*Article* **HiBuffer: Buffer Analysis of 10-Million-Scale Spatial Data in Real Time**

#### **Mengyu Ma, Ye Wu \*, Wenze Luo, Luo Chen , Jun Li and Ning Jing**

College of Electronic Science, National University of Defense Technology, Changsha 410073, China; mamengyu10@nudt.edu.cn (M.M.); luowenze12@nudt.edu.cn (W.L.); luochen@nudt.edu.cn (L.C.); junli@nudt.edu.cn (J.L.); ningjing@nudt.edu.cn (N.J.)

**\*** Correspondence: yewugfkd@nudt.edu.cn; Tel.: +86-137-5518-6456

Received: 30 October 2018; Accepted: 27 November 2018; Published: 30 November 2018 -

**Abstract:** Buffer analysis, a fundamental function in a geographic information system (GIS), identifies areas by the surrounding geographic features within a given distance. Real-time buffer analysis for large-scale spatial data remains a challenging problem since the computational scales of conventional data-oriented methods expand rapidly with increasing data volume. In this paper, we introduce HiBuffer, a visualization-oriented model for real-time buffer analysis. An efficient buffer generation method is proposed which introduces spatial indexes and a corresponding query strategy. Buffer results are organized into a tile-pyramid structure to enable stepless zooming. Moreover, a fully optimized hybrid parallel processing architecture is proposed for the real-time buffer analysis of large-scale spatial data. Experiments using real-world datasets show that our approach can reduce computation time by up to several orders of magnitude while preserving superior visualization effects. Additional experiments were conducted to analyze the influence of spatial data density, buffer radius, and request rate on HiBuffer performance, and the results demonstrate the adaptability and stability of HiBuffer. The parallel scalability of HiBuffer was also tested, showing that HiBuffer achieves high performance of parallel acceleration. Experimental results verify that HiBuffer is capable of handling 10-million-scale data.

**Keywords:** buffer analysis; real-time; visualization-oriented; tile-pyramid; parallel computing

#### **1. Introduction**

A buffer in a geographic information system (GIS) is defined as the zone around a spatial object, measured by units of time or distance [1]. Buffer analysis is a basic GIS spatial operation for overlay analysis, proximity analysis, spatial data query, and so on. Buffer generation is the core issue in buffer analysis, and several methods for solving the buffer generation problem have been proposed. According to the types of buffers, the methods can be summarized by two categories: raster-based and vector-based buffer generation methods. Raster-based buffer generation methods use the values of the pixels in raster images to indicate buffer zones. The resolution of raster buffers is low while zooming in due to sawtooth distortion (see Figure 1a). Vector-based buffer generation methods use vector polygons to represent buffer results. Figure 2 shows the process of vector buffer generation. Compared with raster buffers, vector buffers use much less space for storage, and zooming in does not sacrifice resolution. However, in practice, circles or circular arcs in vector buffers are simplified to regular polygons or regular-polygon segments to reduce computational complexity. As shown in Figure 1b, distortion occurs in vector buffers while zooming in.

**Figure 1.** Distortion of raster and vector buffers while zooming in.

**Figure 2.** Generating vector buffers.

Buffer generation is computationally demanding. Furthermore, with the rapid development of surveying and mapping technology, spatial data with a larger scale are produced, which will surely increase the computational complexity of buffer generation. Previous studies have proposed many strategies to optimize the problem. Most of the optimization strategies focus on accelerating the construction of buffer zones. For vector-based methods, many approaches have been put forward to address the buffer boundary-dissolving problem [2–4]. Although excellent performance has been achieved, the optimized methods are implemented based on the serial approach. Therefore, the performance is limited when processing large-scale spatial data. Developments in parallel computing technologies provide a prerequisite for high-performance buffer generation, and several parallel strategies have been proposed to solve the bottlenecks [5–8].

In most existing studies, buffers of spatial objects are generated separately first, and then the buffers are merged to get the final results. Such an approach is data-oriented and straightforward. However, the computational scales expand rapidly with the volume of spatial objects; as a result, it is difficult for the traditional data-oriented methods to provide real-time buffer analysis of large-scale spatial data.

In this paper, we present a visualization-oriented parallel buffer analysis model, HiBuffer, to provide an interactive and online buffer analysis of large-scale spatial data. Different from traditional data-oriented methods, the core problem of HiBuffer is to determine whether the pixels for display are in the buffer of a spatial object or not. To the best of our knowledge, the approach is a brand new idea for buffer generation with the following advantages: (1) real-time analysis; (2) unlimited precision; (3) insensitive to data volumes. Many approaches have been proposed to achieve satisfactory performance. One feature of HiBuffer is an efficient buffer generation method, in which spatial indexes and a corresponding query strategy are incorporated. Organized into a tile-pyramid structure, the buffer results of HiBuffer are provided to users with a stepless zooming feature. Moreover, a fully optimized hybrid parallel processing architecture is proposed in HiBuffer to achieve a real-time buffer for large-scale spatial data. Using real-world datasets, we designed and conducted plenty of experiments to evaluate the performance of HiBuffer, including the real-time property, the parallel scalability, and the influence of spatial data density, buffer radius, and request rate.

The remainder of this paper proceeds as follows. Section 2 highlights the related work and literature. In Section 3, the techniques of HiBuffer are described in detail. The experimental results are presented and discussed in Section 4, with an online demonstration of HiBuffer introduced in Section 5. Conclusions are drawn in Section 6.

#### **2. Related Work**

The calculation of buffers is an essential operation of GIS. According to the computing architecture, studies on the problem can be classified into two categories: the calculation of buffers using a serial computing model and the construction of buffers in a parallel computing framework.

The first category mainly concentrates on the generation of a buffer zone for each spatial object. Dong [9] introduced a method which utilizes the rotation transform point formula and recursion approach to accelerate buffer generation. Ren [10] proposed a method that reduces the computation by removing the non-characteristic points from the original geometry based on the Douglas–Peucker algorithm. Peng [11] introduced a method based on expansion operations in mathematical morphology, in which the vector data are rasterized and expanded to generate buffers. Wang [12] introduced buffer generation methods based on a vector boundary tracing strategy. Jiechen [13] proposed a method using run-length encoding to encode raster buffers. The method reduces memory occupation while creating the buffers. Zalik [14] presented an algorithm for constructing the geometric outlines of a given set of line segments by using a sweep-line approach. Based on Zalik's method, Sumeet Bhatia [15] presented an algorithm for constructing buffers of vector feature layers and dissolving the buffers based on a sweep-line approach and vector algebra.

With the development of computer hardware, there has been a rapid expansion in processor numbers, thus making parallel computing an increasingly important issue for processing large-scale spatial data. Parallel computing is an effective way to accelerate buffer generation.

Pang [16] proposed a method for buffer analysis based on a service-oriented distributed grid environment. The method includes a master-slave architecture, in which all the entities are assigned to slave computing nodes equally, and then the master node combines the results from all slave nodes to generate the final buffer zones. Huang [5] introduced parallel buffer algorithm which consists of a point-based partition method and a binary-union-tree-based buffer boundary-dissolving method. Fan [6] proposed a parallel buffer algorithm based on area merging to improve the performance of buffer analysis on processing large datasets. Wang [17] proposed a parallel buffer generation method based on the arc partition strategy, which achieves a better load balance performance than the point partition strategy.

In our previous work, we proposed a parallel vector buffer generation method, the Hilbert Curve Partition Buffer Method (HPBM) [8], that takes advantage of the in-memory architecture of Spark [18]. HPBM is based on the Hilbert filling curve to aggregate adjacent objects and thus to optimize the data distribution among all distributed data blocks and to reduce the cost of swapping data among different blocks during parallel processing. Experiments showed that the proposed method outperforms current popular GIS software and existing parallel buffer algorithms. However, HPBM failed to generate buffers for large-scale spatial data in real time.

In summary, all the methods mentioned above are data-oriented and straightforward, with the computational scales expanding rapidly with the volume of spatial objects. It is difficult for the traditional data-oriented methods to provide buffer analysis of large-scale spatial data in real time, though parallel acceleration technologies are adopted.

#### **3. Methodology**

In this section, the key technologies of HiBuffer are introduced. Specifically, HiBuffer is used for the buffer analysis of point and linestring objects, which are commonly used to represent points of interest (POI) and roads in maps. In HiBuffer, buffer generation is visualization-oriented; namely, the core problem of HiBuffer is to determine whether the pixels for display are in the buffer of any spatial object or not.

In HiBuffer, we utilize spatial indexes to determine whether a pixel is in the buffers of spatial objects, and accordingly, an efficient buffer generation method named Spatial-Index-Based Buffer Generation (SIBBG) is proposed. The buffers generated by SIBBG are actually raster based. To avoid the sawtooth distortion of these buffers, we propose the Tile-Pyramid-Based Stepless Zooming (TPBSZ) method. To support real-time buffer analysis of large-scale spatial data, parallel computing technologies are used to accelerate computation, and thus we designed the Hybrid-Parallel-Based Process Architecture (HPBPA).

#### *3.1. Spatial-Index-Based Buffer Generation*

Spatial indexes are used to organize spatial data. As an efficient tree data structure widely used for indexing spatial data, R-tree was proposed in 1984 [19] and has been thoroughly studied by researchers [20,21]. The spatial queries using R-tree, including the bounding-box query and nearest-neighbor search, have been fully optimized theoretically and practically. In SIBBG, we utilize R-tree to determine whether a pixel is in the buffers of spatial objects.

As shown in Figure 3, the problem in determining whether a pixel *P* is in the buffers, given a radius *R*, can be abstracted as determining whether the circle of radius *R* centered at *P* intersects with any spatial objects. An intuitive solution to the problem is as follows: (1) calculate the vector buffer *Circ* of *P* with radius *R*; (2) use the INTERSECT operator provided by R-tree to determine whether *Circ* intersects with any spatial objects. However, there are some disadvantages to this approach. Firstly, as shown in Figure 1b, in fact, *Circ* is not a circle but a regular polygon; thus, it will lead to a calculation error. Secondly, as R-tree is implemented by grouping nearby objects and representing them with their minimum bounding rectangle in the next higher level of the tree, R-tree works well only for a bounding-box query, rather than queries using other polygon shapes like *Circ*.

**Figure 3.** Different situations of pixel P in buffers or not in them, with a given radius R. (**a**) P in buffers of points; (**b**) P not in buffers of points; (**c**) P in buffers of linestrings; (**d**) P not in buffers of linestrings

As the nearest-neighbor search has a much higher computation complexity than the bounding-box query in R-tree, we introduced inner and outer boxes (Figure 4) to optimize the spatial queries in SIBBG. The process of SIBBG is described in Algorithm 1. The inner and outer boxes are used to deal with different situations. In the situation where there are lots of spatial objects within the distance *R* from *P*, we query the spatial objects intersecting with the inner box, as a high density of spatial objects in the neighbor is very likely to intersect with the inner box; for the situation where there are few spatial objects in the neighbor of *P*, we use the outer box to filter out the spatial objects which are far from *P*. Compared with traditional methods, the performance of SIBBG is less sensitive to the data size.

**Figure 4.** Inner and outer boxes of pixel P with a given radius R.

**Algorithm 1** Spatial-Index-Based Buffer Generation.

**Input:** Pixel P, radius R, and spatial index R-tree.

**Output:** True or False (whether P is in the buffers of spatial objects with a given radius R).

*r* ← *R* × √2 2 *InnerBox* ← BOX(*P*.*x* − *r*, *P*.*y* − *r*, *P*.*x* + *r*, *P*.*y* + *r*) *Tmp* ← satisfying *Rtree*.INTERSECT(*InnerBox*) **if** *Tmp* is not *null* **then return** True **else** *OuterBox* ← BOX(*P*.*x* − *R*, *P*.*y* − *R*, *P*.*x* + *R*, *P*.*y* + *R*) *Tmp* ← satisfy *Rtree*.INTERSECT(*OuterBox*) and *Rtree*.NEAREST(*P*) **if** *Tmp* is not *null* && DISTANCE(*Tmp*, *P*) ≤ *R* **then return** True **return** False

#### *3.2. Tile-Pyramid-Based Stepless Zooming*

Tile-pyramid is a multi-resolution data structure model widely used for map browsing on the web. The structure of tile-pyramid is shown in Figure 5. At the lowest level of tile-pyramid (level 0), a single tile summarizes the whole map. For each higher level, there are up to 4*<sup>z</sup>* tiles, where *z* is the zoom level. Each tile has the same size of *n* × *n* pixels and corresponds to the same geographic range [22]. In TPBSZ, a tile-pyramid structure is employed to organize buffer results, provided as a Web Map Tile Service (WMTS). Typically, the tile size is set to 256 × 256 pixels, which is commonly used in tile-pyramid structures.

The advantages of TPBSZ are mainly reflected by the following two aspects:


**Figure 5.** Tile-pyramid structure.

#### *3.3. Hybrid-Parallel-Based Process Architecture*

In HiBuffer, hybrid parallel computing technologies are introduced to achieve real-time buffer analysis of large-scale spatial data. The HPBPA of HiBuffer is shown in Figure 6. HPBPA mainly comprises three parts: the **Hybrid-Parallel Buffer Tile Rendering Engine**, **Multi-Thread Buffer Tile Server**, and **In-Memory Messaging Framework**. The **Spatial Indexes Hub** consists of the R-tree indexes of the spatial objects. The R-tree indexes are pre-built quickly and stored in memory-mapped files [23], which will not be totally loaded into the memory. It yields the benefits of less disk I/Os and less memory consumption, even if the indexes are very big. **Task Pool** stores the tasks to be executed. Analysis results are provided to users as a WMTS, which can be browsed through the Internet. The buffer tiles are created when it is requested for the first time, and the tiles are stored in the **Result Pool** to reduce repeated computation if the tiles are requested again.

**Figure 6.** Hybrid-Parallel-Based Process Architecture of HiBuffer.

The **Hybrid-Parallel Buffer Tile Rendering Engine** adopts the hybrid Message Passing Interface (MPI)-Open Multiprocessing (OpenMP) parallel processing model to render buffer tiles. MPI is a multiprocess parallel processing model based on message passing. OpenMP is a multi-thread parallel processing model based on shared memory. In HiBuffer, we treat the rendering of one buffer tile as an independent task, and each task is processed with multiple OpenMP threads in one MPI process. As the task requests are generated by way of streaming, the tasks are dynamically allocated to the MPI processes. An MPI process will be suspended after the assigned task is accomplished, and new tasks will be handled on a first-in-first-served basis. An example of the buffer tile rendering process is shown in Figure 7.

The **Multi-Thread Buffer Tile Server** encapsulates the buffer analysis service as a WMTS. It adds tasks to the **Task Pool** and filters out unnecessary task requests, including (1) tiles for which the distance from Minimum Bounding Rectangle (MBR) of the spatial objects are larger than the given buffer radius; (2) tiles generated in previous tasks which are still in the **Result Pool**. Buffer tiles are returned to users from the **Result Pool** once finished. In order to improve concurrency, multi-thread technology is adopted in the tile server.

The **In-Memory Messaging Framework** is a messaging framework based on Redis, which is an In-Memory Key-Value database. In this messaging framework, tasks and results are transferred rapidly in memory without disk I/Os. The tasks are stored in a FIFO queue in Redis. Tasks are pushed to the queue and popped to suspended MPI processes. To avoid errors in parallel processing, the push and pop operations are performed in blocking mode. After a task is finished, the buffer tile is written to Redis, and a task completion message will be sent to the tile server using subscribe/publish functions in Redis. Tiles are set with expired times, and expired tiles are cleaned up once the max memory limit is exceeded.

**Figure 7.** Buffer tile rendering in a Message Passing Interface (MPI) process with four Open Multiprocessing (OpenMP) threads.

#### **4. Experimental Evaluation**

In this section, we report several experiments we conducted to evaluate the performance of HiBuffer. First, the ability of HiBuffer to support real-time buffer analysis of large-scale data was tested. Then, we carried out experiments to analyze the influence of different factors on HiBuffer, including spatial data density, buffer radius, and request rate. Finally, the parallel scalability of HiBuffer was tested by running it with varying numbers of MPI processes and OpenMP threads.

#### *4.1. Experimental Setup and Datasets*

All the experiments were conducted in an SMP server, shown in Table 1. The code of HiBuffer was implemented using C++ language. The experiments are based on MPICH 3.4, Boost C++ 1.64, and Geospatial Data Abstraction Library 2.1.2. Table 2 shows the datasets used in the experiments. L1, L2, L3, L4, and P1 are from OpenStreetMap, which is a digital map database built through crowdsourced volunteered geographic information. The larger datasets, L5 and P2, were provided by map service providers. It is worth noting that all the datasets were collected from the real world and have a geographically unbalanced distribution property. Such datasets create challenges with respect to efficient processing.

In the experiments, a task refers to the request of a buffer tile which is in the MBR of spatial objects and also is not in the **Result Pool**. The rendering time of a tile refers to only the time when the tile is rendered in the **Hybrid-Parallel Buffer Tile Rendering Engine**, not including the waiting time while the task is in the **Task Pool**; the rendering time of *N* tiles (*N* > 1) refers to the whole time cost of rendering N tiles in HiBuffer, which means from the time N tasks are generated in the **Task Pool** until all the tasks have been finished. The experimental settings are listed in Table 3. The benchmark buffer radius was set to 200 m, which has practical meaning for the datasets with a wide spatial range. The benchmark request rate was set to infinity, which means that all the task requests are dispatched simultaneously. As there are 32 cores/64 threads in the server processors, the benchmark parallel processing was set to run with 32 MPI processes and 2 OpenMP threads in each process.





**Table 3.** Experimental settings.


*4.2. Experiment 1: Support Real-Time Buffer Analysis of Large-Scale Data with HiBuffer*

In order to highlight the superiority of HiBuffer, Table 4 shows a comparison of HiBuffer with HPBM, three optimized parallel methods, and the popular GIS software programs PostGIS, QGIS, and ArcGIS. HiBuffer was deployed and tested in the same hardware environment. We then carried out experiments on the tile rendering time to further evaluate the real-time characteristic of HiBuffer. For each dataset, we generated 5000 tasks through a test program, which requested different buffer tiles at different zoom levels randomly. We analyzed the tile rendering logs, and the experimental results are shown in Figure 8.


**Table 4.** Performance of traditional data-oriented methods [8] and HiBuffer.

<sup>a</sup> is based on three optimization methods: split-and-conquer buffer zone generation, vertices-based task decomposition, and tree-like merger [6]. <sup>b</sup> adopts strategies such as vertices-based task decomposition and tree-like merger [5]. <sup>c</sup> uses the equivalent-arc partition strategy [17].

As shown in Table 4, HPBM outperformed the other traditional data-oriented methods. However, HPBM failed to generate buffers for large spatial data in real time. Even for a small dataset, such as L1, it took 9 s for HPBM to generate the buffer result. As data volumes grew, the computing time using traditional methods increased significantly. Typically, from L2 to L3, the data size rose from 2,336,250 segments to 6,991,831 segments (around 3 times), while the computing time using HPBM increased from 38.8 s to 332.3 s (around 8.6 times). The reason for the differing increase rates is that different data distributions led to different buffer zone dissolving complexities. As a result, it is almost impossible to generate buffers for massive spatial data in real time using traditional data-oriented methods. Using HiBuffer, we can get the buffer results for the datasets (L1, L2, L3) in less than 1 s.

**Figure 8.** Tile rendering time of HiBuffer on different datasets.

Figure 8a shows the time consumed to render 5000 buffer tiles for each dataset. From L1 to L5 or P1 to P2, the data size increases sequentially; however, there is no significant uptrend in the rendering time of 5000 buffer tiles. Surprisingly, L5, the largest dataset with more than 20 million linestring objects, produces better performance than the datasets with much smaller scales (L1, L3 and L4). The experimental results show that HiBuffer is insensitive to data volumes. In Figure 8a, the number of tiles rendered per second is calculated. L4 produces the poorest performance with 237.95 tiles per second. As the number of tiles in a screen is generally no more than 50, it is possible to perform real-time buffer analysis with HiBuffer for all the datasets. As shown in Figure 8b, the rendering time distributions of each tile on different datasets are visualized with boxplots ('◦' represents outliers and '×' represents average rendering time). For all the datasets, L4 produces the poorest performance though; most of the requested buffer tiles of L4 are rendered in 0.45 s, with the longest rendering time not exceeding 0.75 s. It is assumed that a browser requests 50 buffer tiles of L4 at once. Considering that there are 32 MPI processes, the 50 tasks will be processed in two rounds, with 14

(= 32processes × 2 − 50tasks) MPI processes suspended in the second round: namely, it will be most likely completed in less than 0.9 s (= 0.45 s × 2). In conclusion, HiBuffer is able to provide interactive and online buffer analysis of large-scale spatial data, which is difficult for traditional data-oriented methods.

#### *4.3. Experiment 2: Impact of Spatial Data Density in HiBuffer*

In this experiment, we focused on testing the impact of spatial data density on HiBuffer. Spatial data density refers to the number of spatial elements per unit area. We used the number of segments or points in a tile to signify spatial data density. To express spatial data density exactly, the tile level should be neither too low nor too high. Typically, we chose tiles of level 13, which have proper spatial spans (about 4.9 × 4.9 km) for all the datasets. For each dataset, we generated 5000 different requests of buffer tiles of level 13 and studied the relationship between tile rendering time and the number of elements in the tiles. The results are shown in Figure 9.

Figure 9a–g illustrate the changes in tile rendering time, along with the increase in element numbers in a tile for each dataset. A point ('◦') in the figures represents a sample with a record of rendering time of a tile and number of elements in the tile. The trend lines, generated based on the lowess regression, indicate the trends of the changes. As shown by the trend line of each dataset, the rendering time of a tile has an uptrend, along with the increase in element numbers in a tile, which means the performance decreases with the increase in spatial data density in HiBuffer. This is because higher spatial data density leads to higher computational complexity in the SIBBG process. However, the performance degradation in HiBuffer is not serious (take L4, for example: when the number of segments in a tile increases to about 28,000, the rendering time of the tile is still less than 0.5 s). The percentage lines illustrate the distributions of 5000 tiles under different spatial data densities, which demonstrate the geographically unbalanced distribution property of the datasets. Figure 9h compares the impact of spatial data density on the different datasets. The average number of elements in the tiles indicates the spatial data density of a dataset, while the average rendering time of each tile indicates the buffer generation performance. As illustrated in the figure, the spatial data density and the performance roughly have the same trend. This shows that datasets with higher spatial data density yield a weaker performance in HiBuffer, which explains the experimental phenomenon that HiBuffer is insensitive to data volumes. L5 has the largest data volume though, and the lower the spatial data density, the better the performance produced compared with datasets with much smaller scales. Also, the reason that L1, the small dataset, results in a weaker performance is that the spatial data density is high. The average rendering time of a tile which intersects with no elements was also calculated and is shown in the figure. For different datasets, the average rendering time of a no-element tile always remains at about 0.05 s. This demonstrates that the rendering time of a no-element tile is not related to the spatial data densities or the data volumes of the datasets.

#### *4.4. Experiment 3: Impact of Buffer Radius on HiBuffer*

In experiment 3, the impact of the buffer radius on HiBuffer was studied. The buffer radius was set to 200, 400, 600, 800, 1000, and 100,000 m, respectively. For each radius, we generated 5000 buffer tile tasks of different zoom levels for each dataset. We calculated the average rendering time of each tile; the experimental results are shown in Figure 10.

As the buffer radius increases, more pixels in the tiles belong to the buffers of spatial objects, and more computation is required to generate the buffer tiles. This is corroborated by the experimental results: the average rendering time of each tile shows an uptrend with the increase in buffer radius for all the datasets. To estimate the lower-bound performance of HiBuffer, the buffer radius was set to a very large value, 100,000 m, which is 500 times the benchmark buffer radius. The results show that when the buffer radius is 100,000 m, the performance degradation of HiBuffer is not serious. L4 produces the poorest performance though; the average rendering time of each tile for L4 is 0.2951 s, which can still support real-time buffer analysis.

**Figure 9.** Impact of spatial data density on HiBuffer.

**Figure 10.** Average rendering time of each tile with different buffer radii in HiBuffer.

#### *4.5. Experiment 4: Impact of Request Rate on HiBuffer*

With the exception of experiment 4, the request rate was set to infinity in all other experiments. This means that all the task requests were dispatched simultaneously, and HiBuffer kept running at full load until all tasks were finished. In practical applications, however, the task requests are generated by way of streaming at much lower request rates. In this experiment, the request rate was set to 50, 100, 200, 400, 800, and INF tiles per second, respectively. For each rate, we generated 5000 buffer tile tasks for each dataset.

The rendering time distributions of each tile with different request rates are illustrated in Figure 11. For each dataset, we used the number of tiles rendered per second at the request rate of INF tiles/s (Figure 8a) as the performance limit in HiBuffer. The performance of HiBuffer for all the datasets is affected by the request rate with roughly the same trend. When the request rate is less than the performance limit, the rendering time of a tile increases obviously with the increase in request rates. This results from the intensifying competition for resources between processes in HiBuffer. In contrast, when the request rate exceeds the performance limit, the rendering time of a tile does not change significantly. This is because HiBuffer is running at full load, and the increase in request rates does not cause an obvious effect on the tile rendering performance. However, due to the increase in waiting time while tasks are in the **Task Pool**, the performance of HiBuffer decreases rapidly with the increase in request rate when the request rate exceeds the performance limit. In all the experiments, other than experiment 4, the request rate was set to infinity, which indicates that compared with the experimental results, a higher performance can be achieved in practical applications as a result of the lower request rates.

**Figure 11.** *Cont*.

**Figure 11.** Rendering time of each tile with different request rates in HiBuffer.

#### *4.6. Experiment 5: Parallel Scalability of HiBuffer*

To evaluate the parallel scalability, HiBuffer was respectively tested to run on 1, 2, 4, 8, 16, 32, and 64 MPI processes with 1, 2, and 4 OpenMP threads in each process. For each pair of MPI processes and OpenMP threads, we generated 1000 buffer tile requests of different zoom levels for each dataset. The experimental results are plotted in Figure 12.

First, we analyzed the rendering time of 1000 tiles. The results show that with the increase in process numbers, HiBuffer achieves a high performance of parallel acceleration, which is approximate to linearity when the process number is below 16. However, the performance of parallel acceleration decreases as the process number is increased past 16, especially while running with 4 OpenMP threads in each process. This is because the increase in process numbers aggravates resource competition, which is even worse while running with multi-thread. For example, in Figure 12d, the rendering time of 1000 tiles with 4 threads increases even as the process numbers increase from 16 to 32. Then, we compared and analyzed the average rendering time of each tile line with different OpenMP threads. As shown in the figures, multi-thread parallel processing can effectively reduce the rendering time of a tile when resource competition is not intense. Surprisingly, multi-thread parallel processing even causes performance degradation when the process number is over 16 as a result of the resource competition.

Based on the experimental results and the analysis, we can draw some conclusions on the deployments of HiBuffer in the given hardware environment: (1) in the condition of high load, 64 processes × 1 thread is suggested, because it takes the least time to generate 1000 tiles for all the datasets; (2) in the condition of low load, 16 processes × 4 threads is suggested, as this setting has a high rendering speed of each tile and can reduce the response time of buffer tile requests while the numbers of requests are low.

#### **5. Online Demo of HiBuffer**

An online demonstration of HiBuffer is provided on the Web (http://www.higis.org.cn:8080/ hibuffer). The OSM Spain roads and points (L4 and P1) were used as test data. The demonstration was deployed on a server with four cores/eight threads (see Table 5) and set to run with four MPI processes and eight OpenMP threads. Compared with the experimental environment (see Table 1), there are much fewer CPU cores and much less memory space in the hardware environment of the demonstration, which will surely result in a weaker buffer analysis performance. Even so, as illustrated in the demonstration, it is still possible to provide an interactive buffer analysis even for L4, the dataset which produces the poorest performance in HiBuffer. Figure 13 shows the analysis results of the online demo.

*ISPRS Int. J. Geo-Inf.* **2018**, *7*, 467

**Figure 12.** Parallel performance of HiBuffer with different numbers of MPI processes and OpenMP threads.



(**a**) L4 (100 m buffer)

**Figure 13.** Analysis results of the online demo.

#### **6. Conclusions and Future Work**

This paper presents a visualization-oriented parallel model, HiBuffer, for real-time buffer analysis of large-scale spatial data. In HiBuffer, we propose an efficient buffer generation method named SIBBG. R-tree indexes are utilized in SIBBG to improve buffer generation speed. The performance of SIBBG is not sensitive to the data size. We used the tile-pyramid to organize buffer results and thus propose the TPBSZ. TPBSZ produces satisfactory visualization effects of stepless zooming with unlimited precision and stable computational complexity. Parallel computing technologies are used to accelerate analysis, and we propose the HPBPA. In HPBPA, we designed the fully optimized Hybrid-Parallel Buffer Tile Rendering Engine, Multi-Thread Buffer Tile Server, and In-Memory Messaging Framework.

Our experimental results show that HiBuffer has significant advantages compared with data-oriented methods. Experiment 1 demonstrates the ability of HiBuffer to provide interactive and online buffer analysis of large-scale spatial data. Experiment 2 shows that the performance decreases with the increase in spatial data density in HiBuffer; however, the performance degradation is not serious. In experiment 3, we analyzed the impact of buffer radius on HiBuffer. It shows that the performance decreases slightly with the increase in buffer radius in HiBuffer, and it is still possible to support real-time buffer analysis when the buffer radius is set to 100,000 m, which is a very large value. Then, experiment 4 analyzes the impact of the request rate on HiBuffer. The result indicates that compared with the experimental results, a higher performance can be achieved in practical applications, as the request rate was set to infinity in all the experiments other than experiment 4. In experiment 5, we tested the parallel scalability of HiBuffer. The results show that HiBuffer achieves a high performance of parallel acceleration when resource competition is not intense. Moreover, an online demonstration of HiBuffer is provided on the Web.

HiBuffer does have limitations, and some directions for future research are worth noting:


**Author Contributions:** M.M. and Y.W. designed and implemented the algorithm; M.M. and L.C. performed the experiments and analyzed the data; J.L. and N.J. contributed to the construction of experimental environment; W.L. developed the visible interface of the demonstration system; M.M. wrote the paper and Y.W. helped to improve the language expression.

**Funding:** This research was funded by National Natural Science Foundation of China grant number 41471321 and number 41871284.

**Conflicts of Interest:** The authors declare no conflict of interest.

**Computer Code Availability:** The computer code of HiBuffer is open source, and an online demonstration is provided (https://github.com/MemoryMmy/Hibuffer).

#### **Abbreviations**

The following abbreviations are used in this manuscript:



#### **References**


© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Mobility Data Warehouses**

#### **Alejandro Vaisman 1,\* and Esteban Zimányi <sup>2</sup>**


Received: 9 January 2019; Accepted: 29 March 2019; Published: 2 April 2019

**Abstract:** The interest in mobility data analysis has grown dramatically with the wide availability of devices that track the position of moving objects. Mobility analysis can be applied, for example, to analyze traffic flows. To support mobility analysis, trajectory data warehousing techniques can be used. Trajectory data warehouses typically include, as measures, segments of trajectories, linked to spatial and non-spatial contextual dimensions. This paper goes beyond this concept, by including, as measures, the trajectories of moving objects at any point in time. In this way, online analytical processing (OLAP) queries, typically including aggregation, can be combined with moving object queries, to express queries like "List the total number of trucks running at less than 2 km from each other more than 50% of its route in the province of Antwerp" in a concise and elegant way. Existing proposals for trajectory data warehouses do not support queries like this, since they are based on either the segmentation of the trajectories, or a pre-aggregation of measures. The solution presented here is implemented using MobilityDB, a moving object database that extends the PostgresSQL database with temporal data types, allowing seamless integration with relational spatial and non-spatial data. This integration leads to the concept of mobility data warehouses. This paper discusses modeling and querying mobility data warehouses, providing a comprehensive collection of queries implemented using PostgresSQL and PostGIS as database backend, extended with the libraries provided by MobilityDB.

**Keywords:** mobility; data warehouses; spatiotemporal OLAP; mobility analytics

#### **1. Introduction**

Moving objects [1] (MOs) are objects (e.g., cars, trucks, pedestrians) whose spatial features change continuously in time. The data produced by MOs (e.g., by using attached devices like GPSs and smartphones) can be analyzed in many ways, for example, to discover mobility patterns [2]. This is called mobility data analysis, a technique that is currently used in many related fields, like traffic management, transportation, car pooling applications, smart cities, etc.

The increasing availability of MO data has triggered the interest in mobility data analysis, which has been growing steadily year after year. Moving object data generally come in the form of long sequences of spatiotemporal coordinates *x*, *y*, *t*. To facilitate analysis, these sequences are split into smaller portions of movement, called trajectories. Trajectories can be represented in two possible ways. A continuous trajectory represents the movement track of an object by means of a sequence of spatiotemporal points occurring within a certain interval, together with interpolation functions that allow computing the (approximate) position of the object at any instant. A discrete trajectory contains only a sequence of spatiotemporal points but no interpolation function can be defined.

Moving object databases are databases that store the positions of MOs at any point in time. Although these databases are appropriate for querying, for example, current movement, by means of

queries like "which taxis are within ten minutes from Brussels Central Station", they do not support, per se, complex analytical queries such as "For each month, list the total number of buses running at a speed higher than 60 km per hour for more than 40% of their total route." For this, integration with, for example, data warehousing technologies, is needed, yielding the notion of mobility data warehouses. These are data warehouses that contain MO data that can be analyzed in conjunction with other kinds of data (e.g., spatial data), for instance, a road network, altitude data, and the kind.

To represent MOs, the definition of appropriate data types is needed. The notion of temporal type refers to a collection of data types that capture the evolution over time of base types and spatial types. For instance, temporal integers may be used to represent the evolution in the number of employees in a department. Analogously, a temporal point may represent the evolution in time of the position of a vehicle, reported by a GPS device, which would yield a temporal geometry of type point. Over these kind of data types, MO databases can be implemented. Thus, MobilityDB was developed (the demo of this database can be accessed at http://demo.mobilitydb.eu/, where also the manual with a full description of the data types is available). MobilityDB is a MO database (MOD), that builds on PostGIS (the spatial extension of PostgreSQL), that extends the type system of PostgreSQL and PostGIS with abstract data types (ADTs), in order to represent MO data. These ADTs are based on the notion of temporal types and their associated operations.

#### *1.1. Mobility Data Warehouses*

Conventional databases are used to support the day-to-day operations in an organization. The operations over these databases are usually denoted as online transactional processing (OLTP). To support data analysis, the notion of data warehousing was developed. A data warehouse (DW) collects large amounts of data from various data sources and reduces them to a form that can be used to analyze the behavior of an organization. DWs are based on the multidimensional model, which represents data as facts that can be analyzed along a collection of dimensions, composed of levels conforming aggregation hierarchies. The multidimensional model builds on the data cube abstraction, where the axes of the cubes are the dimensions, and the cells contain measure values. Typically, DWs are exploited by means of the online analitycal processing (OLAP) technique, which consists of a collection of operations that manipulate the data cube. The most popular OLAP operations are roll-up, which aggregates measure data along a dimension up to a certain aggregation level; drill-down, which is the inverse of the former; slice, which drops a dimension from the cube; and dice, which selects a sub-cube that satisfies a boolean condition.

Spatial data can represent geographic objects (e.g., mountains, cities), geographic phenomena (e.g., temperature, precipitation), etc. Similarly to conventional databases, spatial databases are typically used for operational applications in many different domains, rather than to support data analysis tasks. Spatial DWs (SDWs), on the other hand, combine spatial databases features and DW technologies, to provide more sophisticated data analysis, visualization, and manipulation capabilities. Thus, a SDW is a DW that is capable of representing, storing, and manipulating spatial data. Usually, to represent the extent of a spatial object, spatial data types are used. A SDW can thus take advantage of the operations associated with spatial data types, to allow queries like "give me the total number of theaters within one kilometer from my current position".

Most DWs (either spatial or not) assume that only facts evolve in time. However, dimension data, like for instance, the category of a product, may also change across time. The most popular approach for tackling this issue is based on the notion of slowly changing dimensions [3]. An alternative approach for this issue is based on the concept of temporal databases. Built-in temporal semantics allow these kinds of databases to manage time-varying information. The combination of spatial and temporal databases and DWs leads to the notion of spatiotemporal DWs. Since MO are in essence spatiotemporal, adding MO data to a DW leads to the notion of mobility DWs. This is, in a nutshell, the problem addressed by this paper: modeling and querying mobility DWs by means of adding temporal types to spatial DWs.

#### *1.2. Contributions and Motivation*

The authors of this paper had previously defined the notion of spatiotemporal queries as queries that can be expressed by Klug's relational algebra with aggregation, extended with spatial and moving types [4]. Based on this, the authors defined *spatiotemporal DWs* as DWs that support spatiotemporal queries. The classification in the referred work helps to characterize the different kinds of DW systems in terms of their expressiveness, and it is used in the discussion below.

Many efforts have been published under the concept of trajectory data warehouses [5–7]. These works basically propose storing, in the DW fact tables, aggregate measures (like the number of cars in a given latitude-longitude range), segments of trajectories, or spatiotemporal points, probably together with semantic information. It is therefore clear that these approaches do not qualify as trajectory or, equivalently, spatiotemporal DWs (according to the classification in [4]), since they do not support spatiotemporal queries, given that they do not include moving data types. For example, storing aggregate measures [6], does not allow queries like "total distance covered by all cars in Brussels". Instead, they allow expressing queries like "average number of cars in downtown Brussels on Sunday mornings". These approaches conform, in some sense, traditional spatial DWs. The other proposals (although implementations are not yet reported), aim at querying semantic trajectories, for example "total number of persons going from restaurants to theaters". Even in the case where spatiotemporal points are stored, these represent the raw trajectories rather than moving types, and thus everything must be solved using relational operations. On the other hand, there are systems that implement MO databases, typically SECONDO and Hermes. However, these systems are not aimed at building mobility DWs, therefore, implementing a mobility DW based on them is not a trivial task. Section 6 further elaborates on the discussion above.

The work presented in this paper extends and dramatically improves existing trajectory DW proposals. The implementation of MobilityDB gives the possibility of defining MOs as measures in a DW fact table. Integrating relational warehouse data with MO data allows realizing the notion of spatiotemporal queries defined in [4], as this paper will show through a collection of comprehensive examples. The resulting DW is called Mobility DW. In addition, the problem of conceptual and logical modeling of mobility DWs is also covered here. Throughout this paper, the well-known Northwind DW [8], extended with MO and spatial data, is used as a running example.

Note that although the work presented here is based on a centralized PostgresSQL database management system (DBMS), there are ongoing efforts to develop parallel version of this DBMS, from which MobilityDB (and therefore, the mobility DW studied in this paper) can benefit. Further, in a Big Data scenario, horizontal scalability is crucial. This has encouraged the development of a horizontally-scalable version of Postgres, called Postgres-XL, https://www.postgres-xl.org, which can handle mixed workloads, which include OLTP and OLAP queries, GIS queries, OLTP transactions, and so on. Last, but not least, the approach presented here can be extended to big data hadoop-based environments that are currently available, http://spatialhadoop.cs.umn.edu, http://esri.github.io/ gis-tools-for-hadoop.

#### *1.3. Paper Organization*

In Section 2, the notion of temporal types is defined and explained. Section 3 describes the implementation of temporal types in MobilityDB. Section 4 studies the modeling of mobility DWs, also providing a background on DW design, for the non-expert readers, while Section 5 presents a comprehensive set of queries to illustrate how a mobility DW can be exploited. Section 6 discusses related work and compares such work against the present proposal, and Section 7 concludes the paper.

#### **2. Background: Temporal Types**

To make this paper self-contained, this section presents some basic concepts on temporal types. Many of the concepts and operations can be found in [1,8]. However, to develop MobilityDB,

new operations had been defined, and are also explained here. Also, a syntax for these temporal types is introduced here.

Temporal types are functions that map time instants to values from a domain. They are implemented by means of applying a constructor temporal(·). For example, a value of type temporal(integer) is a continuous function *f* : instant → integer. Analogously, a temporal boolean is a function mapping instants to boolean values. In temporal databases terminology, several notions for time can be defined. The most popular ones are valid time and transaction time [9]. The former represents the time period during which a database fact is valid in the modeled reality. The latter is the time period during which a fact stored in the database is considered to be true. These are orthogonal concepts. In the remainder, valid time is considered.

Temporal types may be undefined during certain time periods. For example, in Figure 1, Alice's salary is undefined between 1 July 2017 and 1 January 2018 (e.g., because she took a six-month leave), and this is denoted '⊥'. As a convention, closed-open intervals are used.

**Figure 1.** Representation of the evolution of salaries of two employees using the type temporal(integer).

#### *2.1. Classes of Operations on Temporal Types*

Temporal types are associated with different kinds of operations, described next.

Projection over domain and range. Operations of this kind are, for example, GetTime and GetValues, which return, respectively, the domain and range of a temporal type.

Interaction with domain and range. Example of operations of this kind are AtTimestamp, AtTimestampSet, AtPeriod, and AtPeriodSet, which restrict the function to a given timestamp (set) or period (set). Operations StartTimestamp and StartValue return, respectively, the first timestamp at which the function is defined and the corresponding value. The operations EndTimestamp and EndValue are analogous. The operations AtValue and AtRange restrict the temporal type to a value or to a range of values in the range of the function. The operations AtMin and AtMax restrict the function to the instants when its value is minimal or maximal, respectively.

**Example 1.** *The functions* GetTime(SalaryAlice) *and* GetValues(SalaryAlice) *return, respectively,* {[2017-01-01, 2017-07-01), [2018-01-01, 2018-04-01)} *and* {25, 30}*. Further,* AtTimestamp(SalaryAlice, 2017-07-15) *returns '*⊥*', since Alice's salary is undefined at that date. The operation* StartTimestamp(SalaryAlice) *returns 2017-01-01, while* AtValue(SalaryAlice, 25) *and* AtValue(SalaryAlice, 35) *return, respectively, a temporal real with value* 20@[2017-01-01, 2017-07-01) *and '*⊥*', because no salary with value 35 exists.*

Temporal aggregation. These kinds of operations are crucial to mobility data analysis. Three basic operations take as argument a temporal integer or real and return a real value namely: Integral, that returns the area under the curve defined by the function, Duration, which returns the duration of the temporal extent over which the function is defined, and Length, which returns the length of the curve defined by the function. From these operations, other derived operations can be defined, such as TWAvg, TWVariance, or TWStDev. The operation TWAvg computes the time-weighted average of a temporal value, taking into account the duration in which the function takes a value. TWVariance and TWStDev compute the variance and the standard deviation of a temporal type. Finally, MinValue and

MaxValue return, respectively, the minimum and maximum value taken by the function. These can be obtained by Min(GetValues(·)) and Max(GetValues(·)) where Min and Max are the classic operations over numeric values.

**Example 2.** *In the example above,* TAvg(SalaryAlice) *would yield a value of 26.66, given that Alice had a salary of 25 during 181 days and a salary of 30 during 90 days.*

Lifting. This class represents the generalization of the operations on nontemporal types to temporal types [10]. An operation for nontemporal types is lifted to allow any of the arguments to be replaced by a temporal type, and returns a temporal type. For example, the "less than" (<) operation has lifted versions (denoted by #<) where one or both of its arguments can be temporal types, yielding a temporal boolean. Intuitively, the result is computed at each instant using the nonlifted operation. When two temporal values are defined on different temporal extents, the result of a lifted operation can be defined either over the intersection of both extents or the union of them. In the remainder, for the lifted operations, the first option is assumed.

**Example 3.** *In Figure 1, the comparison* SalaryAlice <sup>&</sup>lt; SalaryBob *results in a temporal boolean with value* true@{[2017-04-01, 2017-07-01), [2018-01-01, 2018-04-01)}*.*

Lifted aggregation operations are such that the aggregate function is computed at each time instant, and the result is defined over the union of all the extents. Figure 2 shows an exampe of a lifted average for the two salaries in Figure 1.

**Figure 2.** Example of the temporal average operation.

#### *2.2. Temporal Types over Spatial Data*

Temporal types can be also defined over spatial data types, not only over basic types like integers or reals. For example, the trajectory of a truck can be represented using a temporal(point) type. Analogously to what was explained above, this type would be a continuous function *f* : instant → point. Some of the operations described in Section 2.1 are explained next for the spatial case. The example in Figure 3 shows two temporal points RouteT1 and RouteT2 that represent the delivery routes of two trucks T1 and T2 on a particular day. It can be seen that, for instance, truck T1 took 15 min to go from point (3, 3) to point (0, 0), while truck T2 took 15 min to go from point (4, 0) to point (1, 3). A constant speed between consecutive pairs of points is assumed.

**Figure 3.** Trajectories of two trucks as a projection of a temporal point into the plane.

An operation called Trajectory, projects a temporal geometry into the plane. In the example, Trajectory(RouteT1) results in the leftmost line in Figure 3, without any temporal information. Note that the projection of a temporal point into the plane may consist in points and lines.

As an example of an operation of the class addressing the interaction with domain and range (as in Section 2.1) for the spatial case, the AtGeometry operation restricts the temporal point to a given geometry. For example, if Polygon denotes a polygon defined as follows Polygon((0 0, 0 2, 2 2, 2 0, 0 0)), then AtGeometry(RouteT1, Polygon) will return the value RouteT1 restricted to the period [8:05, 8:15].

Analogously to the non-spatial case, all operations over nontemporal spatial types are lifted. For example, the Distance function has lifted versions where one or both of its arguments can be temporal points and the result is a temporal real. In the example, Distance(RouteT1, RouteT2) returns a temporal real shown in Figure 4, where, for instance, the value evolves from <sup>√</sup>8@8:05 to 2@8:10 to <sup>√</sup>8@8:15. Note that in this case, the distance function has been approximated by a linear function.

**Figure 4.** Distance between the trajectories of the two trucks in Figure 3 represented as a temporal(real).

Lifted topological operations return a temporal boolean. Examples of such operators are TIntersects, TDWithin, and TContains. For example, TIntersects(RouteT1, RouteT2) returns a temporal boolean with value false@[8:05, 8:15] since the two trucks were never at the same point at any instant of their route. Similarly, TDWithin(RouteT1, RouteT2, 2) returns a temporal boolean with value {false@[8:05, 8:10), true@[8:10, 8:10], false@(8:10, 8:15]} since the two trucks were only at a distance of two at 8:10. Finally, if Polygon is defined as above, then TContains(Polygon, RouteT1) will return a temporal boolean with value {false@[8:00, 8:05), true@[8:05, 8:15]}.

To conclude, several operations compute the rate of change for points. Operation Speed yields the usual concept of speed of a temporal point at any instant as a temporal real. Operation Direction returns the direction of the movement, that is, the angle between the *x*-axis and the tangent to the trajectory of the moving point. Finally, the Turn operation yields the change of direction at any instant.

#### **3. Temporal Types in MobilityDB**

Temporal support has been introduced in the SQL standard, and implemented (in a limited way) in commercial database systems, by means of adding temporality to tables and associating a period with each row. Nevertheless, for mobility applications this approach does not suffice. For such applications, the temporal evolution of the attribute values is needed, along the lines of the data model proposed by Gadia and Nair [11]. This section describes the implementation in MobilityDB, of the temporal types presented in Section 2 as mentioned, a mobility database based on PostgreSQL and PostGIS. For full details, the reader is referred to the manual (http://demo.mobilitydb.eu/mobilitydb-manual.pdf).

#### *3.1. Types in MobilityDB*

In order to manipulate temporal types, MobilityDB uses the timestamptz (a shorthand for timestamp with time zone) type provided by PostgreSQL, and three new types: period, timestampset, and periodset. The period type is a specialized version of the tstzrange (short for timestamp with time zone range) type provided by PostgreSQL. Its functionality is similar to the one of type tstzrange, but with a more efficient implementation. A value of the period type has two bounds, the lower bound and the upper bound, which are timestamptz values. The bounds can be inclusive (represented by "[" and "]"), or exclusive (represented by "(" and ")"). A period value with equal and inclusive bounds corresponds to a timestamptz value. An example of a period value is as follows

SELECT period '[2017-01-01 08:00:00, 2017-01-03 09:30:00)';

The timestampset type represents a set of distinct timestamptz values. A timestampset value must contain at least one element, in which case it corresponds to a timestamptz value. The elements composing a timestampset value must be ordered. An example of a timestampset value is as follows

```
SELECT timestampset '{2017-01-01 08:00:00, 2017-01-03 09:30:00}';
```
Finally, the periodset type represents a set of disjoint period values. A periodset value must contain at least one element, in which case it corresponds to a period value. The elements composing a periodset value must be ordered. An example of a periodset value is as follows

SELECT periodset '{[2017-01-01 08:00:00, 2017-01-01 08:10:00], [2017-01-01 08:20:00, '2017-01-01 08:40:00]}';

Currently, MobilityDB provides six built-in temporal types, tbool, tint, tfloat, ttext, tgeompoint, and tgeogpoint, which are, respectively, based on the bool, int, float, and text types provided by PostgreSQL, as well as the geometry and geography types provided by PostGIS (the last two types restricted to 2D and 3D points). Temporal types may be discrete or continuous. Discrete temporal types (which are based on the boolean, int, or text types) evolve in a stepwise manner, while continuous temporal types (which are based on the float, geometry, or geography types) evolve in a continuous manner. The duration of a temporal value indicates the temporal extent at which the evolution of values is recorded.

Temporal values come in four durations, namely, instant, instant set, sequence, and sequence set. A temporal instant value represents the value at a time instant, such as

#### SELECT tfloat '17.1@2018-01-01 08:00:00';

A temporal instant set value represents the evolution of the value at a set of time instants, where the values between these instants are unknown. For example:

SELECT tfloat '{17.1@2018-01-01 08:00:00, 17.5@2018-01-01 08:05:00, 18.1@2018-01-01 08:10:00}';

A temporal sequence value represents the evolution of the value during a sequence of time instants, where the values between these instants are interpolated using either a stepwise or a linear function. An example is as follows:

SELECT tint '(10@2018-01-01 08:00:00, 20@2018-01-01 08:05:00, 15@2018-01-01 08:10:00]';

As can be seen, a value of a sequence type has a lower and an upper bound that can be inclusive (represented by '[' and ']') or exclusive (represented by '(' and ')'). The value of a temporal sequence is interpreted by assuming that the period of time defined by every pair of consecutive values v1@t1 and v2@t2 is lower inclusive and upper exclusive, unless they are the first or the last instants of the sequence and in that case the bounds of the whole sequence apply. Furthermore, the value taken by the temporal sequence between two consecutive instants depends on whether the subtype is discrete or continuous. For example, the temporal sequence above represents that the value is 10 during (2018-01-01 08:00:00, 2018-01-01 08:05:00), 20 during [2018-01-01 08:05:00, 2018-01-01 08:10:00), and 15 at the end instant 2018-01-01 08:10:00. On the other hand, the following temporal sequence

SELECT tfloat '(10.1@2018-01-01 08:00:00, 20.2@2018-01-01 08:05:00, 15.2@2018-01-01 08:10:00]';

represents that the value evolves linearly from 10 to 20 during (2018-01-01 08:00:00, 2018-01-01 08:05:00) and evolves from 20 to 15 during [2018-01-01 08:05:00, 2018-01-01 08:10:00].

Finally, a temporal sequence set value represents the evolution of the value at a set of sequences, where the values between these sequences are unknown, for example:

SELECT tfloat '{[17.2@2018-01-01 08:00:00, 17.5@2018-01-01 08:05:00], [18.2@2018-01-01 08:10:00, 18.5@2018-01-01 08:15:00]}';

#### *3.2. Using Temporal Types*

The operations for temporal types defined in Section 2 can be expressed in MobilityDB as explained next. For this, the following table definition is used:

```
CREATE TABLE Employee (
```

```
SSN CHAR(9) PRIMARY KEY,
FirstName VARCHAR(30),
LastName VARCHAR(30),
BirthDate DATE,
SalaryHist TINT );
```
Tuples can be inserted in this table as follows:

```
INSERT INTO Employee VALUES
( '123456789', 'Alice', 'Cooper', '1980-01-01',
        TINT '{[25@2017-01-01, 25@2017-07-01), [30@2018-01-01, 30@2018-04-01)}'),
( '345345345', 'Bob', 'Brown', '1985-07-25',
        TINT '[45@2017-04-01, 60@2018-01-01, 60@2018-04-01)');
```
The values for the SalaryHist attribute above corresponds to those in Figure 1. Given the above table with the two tuples inserted, the query

```
SELECT GetTime(E.SalaryHist), GetValues(E.SalaryHist)
FROM Employee E
```
returns the following values

```
{[2017-01-01, 2017-07-01), [2018-01-01, 2018-04-01)} {25,30}
{[2017-04-01, 2018-07-01)} {45,60}
```
The first column of the result above is of type periodset, while the second column is of type integer[] (array of integers) provided by PostgreSQL. Similarly, the query

```
SELECT ValueAtTimestamp(E.SalaryHist, '2017-04-15'),
        ValueAtTimestamp(E.SalaryHist, '2017-07-15')
FROM Employee E
```
*ISPRS Int. J. Geo-Inf.* **2019**, *8*, 170

returns the following values

25 NULL 45 45

where the NULL value above represents the fact that the salary of the Alice is undefined on 2017-07-15. The following query

SELECT AtPeriod(E.SalaryHist, '[2017-04-01, 2017-11-01)') FROM Employee E

returns

```
{[25@2017-04-01, 25@2017-07-01)}
{[45@2017-04-01, 45@2017-11-01)}
```
Note that here, the temporal attributes have been restricted to the period given in the query. The next query is an example of aggregation, and asks for the minimum and maximum values, together with the instants or periods when they occurred:

```
SELECT AtMin(E.SalaryHist), AtMax(E.SalaryHist)
FROM Employee E
```
The result is:

```
{[25@2017-01-01, 25@2017-07-01]} {[30@2018-01-01, 30@2018-04-01]}
{[45@2017-04-01, 45@2017-10-01)} {[60@2017-10-01, 60@2018-07-01]}
```
The use of lifted operations in MobilityDB is illustrated next. The following query asks for the time periods where the salary of Alice was lower than the one of Bob.

```
SELECT E1.SalaryHist #< E2.SalaryHist
FROM Employee E1, Employee E2
WHERE E1.FirstName = 'Alice' and E2.FirstName = 'Bob'
```
The query returns the temporal boolean value

```
{[t@2017-04-01, t@2017-07-01), [t@2018-01-01, t@2018-04-01)}
```
Note that when Alice's salary is undefined, no comparison is performed. As an example of lifted aggregation, the next query asks for the average salary across time.

SELECT AVG(E.SalaryHist) FROM Employee E

returns

```
{[25@2017-01-01, 25@2017-04-01), [35@2017-04-01, 35@2017-07-01),
      [45@2017-07-01, 45@2018-04-01), [60@2018-04-01, 60@2018-07-01)}
```
Figure 2 shows this result graphically.

To conclude, the next example illustrates how temporal point types can be used. Consider the following table, which stores the routes followed by trucks to deliver products.

CREATE TABLE Delivery ( TruckId CHAR(6) PRIMARY KEY, DeliveryDate DATE, Route TGEOMPOINT )

Next, two tuples are inserted in this table, containing information of two deliveries performed by two trucks T1 and T2 during the same day (as depicted in Figure 3) :

```
INSERT INTO Delivery VALUES
( 'T1', '2017-01-10',
        TGEOMPOINT '[Point(3 3)@2017-01-10 08:00, Point(0 0)@2017-01-10 08:15]' ),
( 'T2', '2017-01-10',
        TGEOMPOINT '[Point(4 0)@2017-01-10 08:05, Point(1 3)@2017-01-10 08:20]' );
```
The routes represent continuous trajectories. Therefore, a constant speed between any two consecutive points is assumed, and linear interpolation for determining the position of the trucks at any instant is used. Examples of lifted spatial operations are given next.

The following query computes the distance between the two trucks at any time instant.

```
SELECT Distance(D1.Route, D2.Route)
FROM Delivery D1, Delivery D2
WHERE D1.TruckId = 'T1' AND D2.TruckId = 'T2'
```
This query returns the temporal float depicted in Figure 4 as follows:

```
[2.82842712474619@2017-01-10 08:05:00, 2@2017-01-10 08:10:00,
      2.82842712474619@2017-01-10 08:15:00]
```
Finally, the following query uses the lifted TIntersects topological operation for testing whether or not the two trucks intersect, as follows:

```
SELECT tintersects(D1.Route, D2.Route)
FROM Delivery D1, Delivery D2
WHERE D1.TruckId = 'T1' AND D2.TruckId = 'T2'
```
and the result will be

[false@2017-01-10 08:05:00, false@2017-01-10 08:15:00]

#### **4. Modeling Mobility Data Warehouses**

This section studies how DWs can be extended with temporal types in order to support the analysis of mobility data. The well-known Northwind case study is used in order to introduce the main concepts. First, basic concepts about DW modeling are introduced to make this paper self-contained.

#### *4.1. A Short Introduction to Conceptual Modeling of Data Warehouses*

The conventional database design process includes the creation of database schemas at the conceptual, logical, and physical levels. Typically, databases are designed at the conceptual level using some variation of the well-known entity-relationship (ER) model, since it has been acknowledged that conceptual models allow better communication between designers and users for the purpose of understanding application requirements.

Opposite to conventional databases, there is no widely accepted conceptual model for multidimensional data. As a consequence, DW design is usually at the logical level, based on star and/or snowflake schemas [3], which are less intuitive for the final user. The present paper adopts the MultiDim model [8] to represent, at the conceptual level all elements required in DW and OLAP applications, that is, dimensions, hierarchies, and facts with their associated measures. The main reason to adopt this model is that it has also been extended to support spatial data. A streamlined description of the main components of the model follows, based on Figure 5, which represents a DW for the well-known Northwind database.

**Figure 5.** Conceptual schema of the Northwind data warehouse.

A schema is composed of a set of dimensions and a set of facts. A dimension consists of either one level, or one or more hierarchies, and a hierarchy is in turn composed of a set of levels. Instances of a level are called members. For example, Product and Category are levels in Figure 5. As shown in the figure, a level has a set of associated attributes that describe the characteristics of their members, and has one or more identifiers that uniquely identify their members. For example, in Figure 5, CategoryID is an identifier of the Category level. Each attribute of a level has a type, that is, a domain for its values (typically, integer, real, and string). For each pair of levels in a hierarchy, the lower level is called the child and the higher level is called the parent. The relationships composing of hierarchies are called parent-child relationships. The cardinalities of parent-child relationships indicate the minimum and the maximum number of members in one level that can be related to a member in another level. For example, in Figure 5 there is a one-to-many relationship between the child level Product and the parent level Category. Finally, it is sometimes the case that two or more parent-child relationships are exclusive. This is represented using the symbol '⊗'. This is shown in Figure 5, where states can be aggregated either into regions or into countries. Thus, according to their type, states participate in only one of the relationships departing from the State level.

A fact relates several levels. For example, the Sales fact in Figure 5 relates the Employee, Customer, Supplier, Shipper, Order, Product, and Date levels. The same level can participate several times in a fact, playing different roles such that each role is represented by a separate link between the level and the fact. For example, in Figure 5 the Date level participates in the Sales fact with the roles OrderDate, DueDate, and ShippedDate. Instances of a fact are called fact members. The cardinality of the relationship between facts and levels, indicates the minimum and the maximum number of fact members that can be related to level members. For example, in Figure 5 there is a one-to-many relationship between Sales and Product. A fact may contain (usually numeric) attributes commonly called measures, that are analyzed using the context provided by the dimensions. For example, the Sales fact in Figure 5 includes the measures Quantity, UnitPrice, Discount, SalesAmount, Freight, and NetAmount. Levels in a hierarchy are used to analyze factual data at various *granularities*, or levels of detail. In OLAP, this operation is

called Roll-up, and aggregates measures along dimension hierarchies, up to a dimension level, using an aggregate function. The aggregation function associated with a measure can be specified next to the measure name, and the SUM aggregation function is assumed by default.

#### *4.2. Conceptual Modeling of Mobility Data Warehouses*

The extension of the Northwind DW to support mobility analysis is depicted in Figure 6 and explained below. The idea is aimed at building a mobility DW that keeps track of the deliveries of goods to their customers. In the company there is a fleet of trucks that load the goods in a warehouse (there are several of them), perform a delivery visiting the customers according to a delivery plan, and then return to the warehouse. There is nonspatial data about the customers, the trucks, and the warehouses that store the goods to be delivered to customers. There is also spatial data about the road network (more on this, below). Further, the geographic hierarchies are also represented as spatial data. In addition, there are the trajectories followed by the trucks (derived as a projection of the deliveries). Figure 6 shows the conceptual schema depicting this scenario using the MultiDim model extended to support spatial data and temporal types. This is explained in more detail next.

**Figure 6.** Conceptual schema of the Northwind mobility data warehouse.

As shown in the figure, the fact Delivery is related to four dimensions: Truck, Date, Warehouse, and Customer, where the latter is related to the fact through a many-to-many relationship. The Customer dimension is composed of four levels, with a one-to-many parent-child relationship defined between each pair of levels. The level Truck has attributes TruckId, Licence, and Year. Spatial levels or attributes have an associated geometry (e.g., point, line, or region), which is indicated by a pictogram. In the example, dimensions Customer and Warehouse are spatial and share a Geography hierarchy where a geometry is associated with each level in both dimensions. Also, the State level includes to continuous fields (raster) attributes, namely LandUse and Elevation, which indicate, at each point, its land use (e.g., residential, industrial) and altitude, respectively. Finally, topological constraints are represented

using pictograms in parent-child relationships. For example, the topological constraints in dimension Geography indicate that a city is contained in its parent state and similarly for the other parent-child relationships in the hierarchy.

The fact Delivery has an identifier DeliveryID, and eight measures. The first one, Route, keeps the position of the truck at any point in time. It is a *spatiotemporal measure* of type temporal point, as indicated by the symbol 't(•)'. The other measures are derived from Route. Measure Trajectory stores the geometry of the route traversed by the truck, which is a line, without any temporal information. Measures AvgSpeed, Duration, and NbCustomers are numeric, while StartTime and EndTime are timestamps.

In the example of Figure 6, the movement track of the deliveries is represented as a temporal point. In other words, the trajectory is a measure of this fact, allowing deliveries to be aggregated along the different dimensions. Another use of trajectory aggregation identifies "similar" trajectories and merges them in a single class. Thus, queries like "give me the total number of trajectories by class," or "List all the trajectories similar to the one followed by truck T1 on 25 November 2018" are possible.

Finally, note that the road network is not linked to any fact or dimension in the schema. This is a major difference with respect to other approaches that segment trajectories, and link trajectory segments to road segments.

The translation of the conceptual schema in Figure 6 into a snowflake schema is shown in Figure 7. The many-to-many relationship between the Delivery fact and the Customer dimension is represented in table DeliveryCustomer, which contains the keys of the tables Delivery and Customer and the attribute Sequence, which states the order in which the customer were visited. The rest of the translation is straightforward, and it is omitted for the sake of brevity.

**Figure 7.** Logical schema of the Northwind mobility data warehouse.

#### **5. Querying the Mobility Data Warehouse in SQL and MobilityDB**

In order to express queries to the Northwind mobility DW, the temporal types and their associated operations as defined in Section 3 are used. The mobility DW depicted in Figure 7 is used as an example. Queries were targeted for Belgium, in which the administrative divisions corresponding to states are called provinces. In what follows, the functions that are prefixed by ST\_ are PostGIS functions, while the functions not prefixed (e.g., length) are the temporal extensions from MobilityDB. The queries below are aimed at highlighting the advantages of defining MO data as a measure rather than the typical solution of defining static segments of trajectories as measures [5–7]. Therefore, the queries combine typical OLAP operations, like Roll-up, Slice, and Dice, with operations on MOs (in this case, the trucks that perform the deliveries. For each query, the operations involved are explained, to remark the wide variety of queries that the model and implementation allow.

**Query 1** (Roll-up + slice + distance)**.** *List the total distance traveled by trucks, per make and month.*

*This query involves aggregation along the* Date *and* Truck *dimensions, slicing out all other dimensions, and the computation of the distance travelled by the MOs.*


Also, the query uses the function length to compute the distance traveled by each delivery, and then aggregates the distances per truck make and month in the usual way.

An analysis of this query makes it clear the advantage of the approach presented here, with respect to the approaches that segment the trajectories in order to provide a relational representation of a trajectory. The latter would require, for example, to compute the length of each segment individually (see [8], Chapter 12, for a detailed explanation), and also joining each segment with other dimension tables. Further, proposals that partition the space into a grid, precomputing aggregated measures for each cell in the grid (see Section 6), of course cannot express this query, neither any of the queries in the next examples. Also, even though MO databases like SECONDO and Hermes (also see Section 6) could of course address the MO part of the queries, integrating them into a system to perform OLAP operations is far from being trivial. Therefore, it can be seen that the mobility DW approach based on MobilityDB takes the best of both worlds, bridging the gap between them. The explanation above applies to all queries that are discussed below in this section, thus, it will not be repeated to avoid redundancy.

**Query 2** (Roll-up + slice + dice + duration)**.** *List the average duration per day of deliveries starting with a customer located in the province of Namur.*

*This query requires aggregation along the* Date *dimension, a* Roll-up *operation over the* Customer *dimension, <sup>a</sup>* Dice *operation to select the province of Namur at the* State *level, and the computation of the duration of the time spanned by the deliveries. Finally, all dimensions but* Date *are sliced out.*

*SELECT DT.Date, AVG(duration(D.Route)) FROM Delivery D, DeliveryCustomer DC, Customer C, State S, Date DT WHERE D.DeliveryKey = DC.DeliveryKey AND DC.Sequence = 1 AND DC.CustomerKey = C.CustomerKey AND C.StateKey = S.StateKey AND S.Name = 'Namur' AND D.DateKey = DT.DateKey GROUP BY DT.Date ORDER BY DT.Date*

The query selects the first customer visited by a delivery using the attribute Sequence, and verifies that she is located in the province of Namur. Then, it uses function duration to compute the duration of the delivery. Finally, the query computes the average of the durations per day.

**Query 3** (Roll-up + slice + dice + spatial + duration)**.** *For each month, compute the number of deliveries such that their route intersects the province of Namur for more than 20 min.*

*The query performs a* roll-up *operation along the* Date *and* Customer *dimensions, a* Dice *operation to select the province of Namur at the* State *level, a spatial operation to compute the intersection, and the computation of the duration of the time spanned by the deliveries within the province. Finally, all dimensions but* Date *are sliced out.*


This query uses the function atGeometry to restrict the route of the delivery to the geometry of the province of Namur. Then, function duration computes the time spent within the province and verifies that this duration is at least of 20 min. Remark that this duration of 20 min may not be continuous. Finally, the count of the selected deliveries is performed as usual.

The term D.Route && S.Geom is optional and it is included to enhance query performance. It verifies, using an index, that the spatio-temporal bounding box of the delivery projected over the spatial dimension intersects with the bounding box of the geometry. In this way, the atGeometry function is only applied to the deliveries satisfying the bounding box condition.

**Query 4** (Roll-up + slice + dice + spatial + duration)**.** *Same as the previous query but with the condition that the route intersects the province of Namur for more than consecutive 20 min.*

*The operations involved here are similar to the ones in the previous query. The queries differ in the way in which they compute the duration. Again, this is only possible when measures are actually represented as MOs.*


*ORDER BY DT.Year, DT.MonthNumber*

As in the previous query, the function atGeometry restricts the route of the delivery to the province of Namur. This results in a route that may be discontinuous in time, because the route may enter and leave the province. The query uses the function getTime to obtain the set of periods of the restricted delivery, represented as a periodSet value. The function periods converts the latter value into an array of periods, and the PostgreSQL function unnest expands this array to a set of rows, each row containing a period P. Then, it is possible to verify that the duration of one of the periods P has at least 20 min.

**Query 5** (Roll-up + slice + distance + speed)**.** *For each month, compute the total number of deliveries that traveled at least 50 km at a speed higher than 100 km/h.*

*The operations involved in this query are: a* Roll-up *along the* Date *dimension, and the computation of the distance and speed functions on the MO data. Finally, slicing is performed to drop all dimensions but* Date*.*


The query uses the function speed to obtain a temporal float representing the speed of the route at each instant, expressed in units per second, depending on the spatial reference system, which in our case is meters per second. This temporal value is then transformed into kilometers per hour by multiplying it by 3.6. The function atRange restricts the speed to the float range [100, infinity]. The getTime function computes the set of periods during which the delivery travels at more than 100 km/h. The function atPeriodSet restricts the delivery to these periods and the length function computes the distance traveled by the restricted deliveries, expressed in meters, which is then converted to kilometers and compared against the value 50.

**Query 6** (Temporal aggregation)**.** *For each speed range of 20 km/h, give the total distance traveled by all trucks within that range.*

*This is a typical temporal aggregation query, as explained below. That is, operations over MOs are performed over the measure, without involving the dimensions. Only the fact table is used here.*


This idea of this query is to allow having an overall view of the speed behaviour of the entire fleet of delivery trucks. The temporary table Ranges stores the ranges [0, 20), [20, 40), ... [180, 200). In the main query, the speed of the route, obtained in meters per second, is transformed into kilometers per hour. Then, the function atRange restricts the route to the portions having a given speed range. The time periods comprising the restricted route are obtained with the getTime function. The overall route is restricted then to the obtained time periods with the function atPeriodSet, the distance traveled at the given speed range is obtained with the length function, and all the distances for all trucks at the given time range are obtained with the SUM aggregation function.

**Query 7** (Slice + spatial)**.** *Compute the number of deliveries that traversed at least two states.*

*The OLAP operation here is a* Slice*, to drop all dimensions. The rest of the query involves spatial operations to perform intersections between the MOs' trajectories and the* State *dimension level. Note that, in fact, the intersection is performing a "climbing" along the geographic dimension, skipping the intermediate levels in the hierarchy.*


This query projects the route of the delivery to the spatial dimension using the function trajectory, which results in a geometry. The query then tests that the trajectory of the route intersects two states using the PostGIS function ST\_Intersects.

**Query 8** (Slice + distance + duration)**.** *List the pairs of deliveries that traveled at less than one kilometer from each other during more than half an hour. Again, the OLAP operation here is a* Slice*, to drop all dimensions. Only the* Delivery *fact is kept. Then, computations on the temporal types are performed.*


The function tdwithin is the temporal version of the PostGIS function ST\_DWithin. It returns a temporal boolean which is true during the periods when the two routes are within the specified distance from each other. The function atValue restricts the temporal boolean to the periods when its value is true, and the function getTime obtains these periods. Finally, the query uses the duration function to obtain the corresponding interval and verifies that it is greater than 20 min.

The next example query involves the LandUse continuous field.

**Query 9** (Roll-up + slice + spatial (raster) + distance + duration)**.** *Compute the average duration of the deliveries that started in a residential area and ended in an industrial area on 1 February 2017.*

*The OLAP operations are a* slice*, to drop all dimensions but* Date *and* Customer*, and a* roll-up *over the* Customer *(only up to the bottom level) and* Date *dimensions. Spatial operations over raster data are also performed. Finally, also computations on the temporal types are performed.*

```
SELECT AVG(duration(D.Route))
FROM Delivery D, Date T, DeliveryCustomer DC1, DeliveryCustomer DC2,
        Customer C1, Customer C2, City Y1, City Y2, State S1, State S2
WHERE D.DateKey = T.DateKey AND D.Date = '2017-02-01' AND
        D.DeliveryKey = DC1.DeliveryKey AND DC1.Sequence = 1 AND
        D.DeliveryKey = DC2.DeliveryKey AND DC2.Sequence = D.NbCustomers AND
        DC1.CustomerKey = C1.CustomerKey AND
        DC2.CustomerKey = C2.CustomerKey AND
        C1.CityKey = Y1.CityKey AND C2.CityKey = Y2.CityKey AND
        Y1.StateKey = S1.StateKey AND Y2.StateKey = S2.StateKey AND
        ST_Intersects(C1.CustomerGeo,AtValue(S1.LandUse, 'Residential')) AND
        ST_Intersects(C2.CustomerGeo,AtValue(S2.LandUse, 'Industrial'))
```
The query starts by selecting the deliveries of the given date. It also selects, using the bridge table DeliveryCustomer, the first and last customers served by the delivery. Then, the query obtains the state of these customers with the subsequent joins. Further, the function AtValue restricts the land use fields of the corresponding states to the values of type residential or industrial. Finally, the function ST\_Intersects ensures that the start and end locations of the delivery are included in the restricted fields. Notice that, in PostGIS, the ST\_Intersects predicate can compute not only if two geometries intersect, but also if a geometry and a raster intersect.

Note that the query above does not involve temporal data since it neither mentions a temporal geometry such as measure Route, nor a temporal field such as Temperature. The next query involves both temporal attributes, and combines a field and a trajectory.

**Query 10** (Slice + spatial (raster) + distance)**.** *List the deliveries that have traveled along more than 50 km of roads at more than 1000 m of altitude.*

*ISPRS Int. J. Geo-Inf.* **2019**, *8*, 170

*<sup>A</sup>* slice *is performed to drop all dimensions. Spatial operations over raster data (in this case, representing elevation) are performed. Finally, computations on the temporal types are also performed.*


For each delivery, using the function ST\_Intersects, the inner query verifies that the route of the delivery and the geometry of the state intersect. Then, the elevation field of the state is restricted to the range of values higher than 1000 m with function AtRange. The function ST\_Intersection computes the intersection of the trajectory of the route and the restricted field, and the length of this route is computed with the function ST\_Length. The SUM aggregate function is then used to calculate the sum of the lengths of all the obtained routes for all states, and finally the outer query verifies that this sum is greater than 50 km.

#### *Discussion on Performance*

Since the paper is oriented to show the features of the mobility data warehousing approach, it is focused on highlighting the expressiveness that MO data adds to the classic and spatial data warehousing approaches. Therefore, a thorough evaluation of query performance is outside the scope of this work. Nevertheless, this section reports a preliminary evaluation, developed as follows. The mobility DW depicted in Figure 7 was implemented over a PostgresSQL relational database. Delivery data (the MO data in attribute Route of the fact table Delivery) had been produced using the data generator of the BerlinMOD benchmark for moving objects http://dna.fernuni-hagen.de/ secondo/BerlinMOD/BerlinMOD.html. However, since the BerlinMOD benchmark targets the city of Berlin, the geographic hierarchy has been dropped, and the references to the State dimension had been replaced by a Borough dimension, both in the schema and in the queries. The evaluation was performed with 520 delivery tuples. The minimum, maximum, and average length of the deliveries are 96, 4632, and 1870, respectively, therefore, this would be approximately equivalent in size to a "normalized" fact table of one million records. There rest of the data are based on the Northwind database. Thus, the table **Customer** contains 1430 tuples, and there are seven warehouses and 100 trucks. Queries 1–8 were run (Queries 9 and 10 were not run, since they include raster data just as an example), changing, as mentioned, the references to the State with references to the Borough. For example, Query 3 now reads:

```
SELECT T.Year, T.MonthNumber, COUNT(*)
FROM Delivery D, Date T, Borough B
WHERE D.DateKey = T.DateKey AND S.Name = 'Mitte' AND D.Route && B.Geom AND
          duration(atGeometry(D.Route, B.Geom)) >= interval '20 min'
GROUP BY T.Year, T.MonthNumber
```
ORDER BY T.Year, T.MonthNumber

Analogously, Query 4 reads, for these experiments:

SELECT DT.Year, DT.MonthNumber, COUNT(\*) FROM Delivery D, Date DT, Borough B, unnest(periods(getTime(atGeometry(D.Route, B.Geom)))) P WHERE D.DateKey = DT.DateKey AND B.Name = 'Mitte' AND D.Route && B.Geom AND duration(P) >= interval '20 min' GROUP BY DT.Year, DT.MonthNumber

ORDER BY DT.Year, DT.MonthNumber

Table 1 shows the results obtained in these tests. Queries were run several times to eliminate the influence of cold caching (that is, hot chache is assumed). Further, once this influence was eliminated, execution times were the same for each run, therefore just the average value was reported in the table, since the differences between runs were negligible. It can be seen that the queries that took longer were the ones that included spatial operations.

**Table 1.** Running queries.


Of course, these results are not pretended to be conclusive, but suggest that the addition of MOs as fact measures in a warehouse adds query expressiveness. Also, existing proposals segment trajectories (see Section 6) require a "normalization" of the trajectory representation, which is costly due to the number of joins that queries like the ones presented here would require. This is not needed when trajectories are represented as MOs. In addition, note that the DW is implemented using standard object-relational technologies (in this case, PostgresSQL and PostGIS). Therefore, the effect of the inclusion of MOs can also be inferred from an evaluation of the implementation of the temporal types. This is available in the MobilityDB demo site, at URL http://demo.mobilitydb.eu/, where the system runs over the BerlinMOD benchmark data (the queries in such demo do not include warehouse data, they are purely MO queries). It can be observed on the demo site that typical MO queries run very fast over the benchmark, although this is not reported in this paper. Developing a benchmark to perform a comprehensive set of experiments is outside the scope of this paper, and is planned as future work (Section 7).

#### **6. Related Work**

Trajectory DWs are strongly related to spatial databases and DWs. The main features of PostGIS, the OGC-based extension to PostgresSQL can be found in [12]. Also, the spatial extension of the MultiDim model presented here was based on the spatial data types of MADS [13], a spatiotemporal conceptual model. The notion of spatial OLAP (SOLAP) was introduced in [14], and it is reviewed in [15]. Other relevant work on SOLAP can be found in [16–18].

In order to support spatio-temporal data, a data model and associated query language for MO data were needed. This is achieved in SECONDO http://dna.fernuni-hagen.de/secondo/ [19], a MO database system developed at the FernUniversität in Hagen, based on the model of Güting et al. [1]. Along similar lines, Hermes is a MOD introduced by Pelekis et al. [20]. The Hermes system is described in [21]. In spite of their ability to handle MO data, neither SECONDO, nor Hermes, are oriented toward addressing the problem of mobility DWs. Among other reasons, integrating both prototypes into existing relational databases is not straightforward. Hermes, for example, extends Oracle through a so-called cartridge with the MO types. However, to build an application on top of the database the application developer must write a source program (for example, in Java), and embed PL/SQL scripts that invoke object constructors and methods from Hermes. SECONDO is as packed system, therefore, integration with existing databases is even more complicated. On the contrary, being coded in the "C" programming language (like PostgresSQL), MobilityDB seamlessly extends the PostGIS library with temporal data types, not requiring any additional software architecture. Thus, building a mobility

DW with MobilityDB turns out to be a natural extension of a spatial relational DW. On the downside, MobilityDB at this time only supports moving points, while SECONDO and Hermes provide (although limited) support to other MOs. However, for traffic analysis, moving points are the most used kind of MOs. The data type system of the temporal types presented in this paper follows the approach of [1]. Also, an SQL extension for spatiotemporal data is proposed in [22].

The work by Orlando et al. [6] introduces the concept of trajectory DWs, aimed at providing the infrastructure needed to deliver advanced reporting capabilities and facilitating the use of mining algorithms on aggregate data. This work is based on the Hermes system. A relevant feature of this proposal is the treatment given to the extraction, transformation and loading (ETL) process, which transforms the raw location data and loads it to the trajectory DW. However, regarding mobility analysis, this proposal does not address MO data. Measures in this trajectory DW are related with the number of MO present in a cell defined by spatiotemporal coordinates. Therefore, queries that can be addressed using this DW are of the form "compute the number of cars between latitudes *l*<sup>1</sup> and *l*2, and longitudes *lo*<sup>1</sup> and *lo*2, on 3 November 2018". However, actual mobility analysis queries like the ones presented in Section 5 cannot be addressed. Also, the authors themselves state in their proposal that MO data analysis remains outside the TDW.

Along similar lines, Wagner et al. [7] proposed the Mob-Warehouse, a conceptual to support mobility analysis using a TDW. The authors propose to enrich trajectory data with semantic information about the domain. The paper is based on the previously defined notion of semantic trajectory [23]. Unlike the model presented in [6], where the measures are pre-aggregated (e.g., number of cars in a cell), in this model the measures are at the finest detail level. No implementation details are given, and MO operations are not addressed. Instead, the proposal is aimed at analyzing semantic trajectories, as the example queries provided suggest.

Based on [7,23], and the work on semantic trajectories, another model for TDWs is proposed in [5], accounting for semantic information. In this case, again, MO data are only partially considered. MO trajectories are partitioned into segments, to which semantic information is associated. No implementation is described, and the authors report than experiments are being carried out, suggesting that the segments in the fact table are not that efficient, which is certainly expected given that many joins are required to reconstruct a trajectory from such segments.

Note that the proposal presented in the present paper does not exclude semantic information. On the contrary, semantic information about MOs can be naturally included in the mobility DW proposed here, and, in fact, this is what dimensions are defined for. Also, all of the features of the proposals above are supported by the model described in this paper (e.g., segmentation, pre-aggregation, semantic information).

An overall perspective of the state of the art in mobility data management can be found in [2,24–26]. A survey on spatiotemporal data warehousing, OLAP, and mining is given in [27]. A discussion about the meaning of spatiotemporal data warehousing is given in [4]. Also, mobility DWs are discussed in [28–30]. Analysis tools for mobility DWs can be found in [31]. Finally, a survey on spatiotemporal aggregation is given in [32], while a survey on trajectory aggregation is provided in [33].

#### **7. Conclusions**

This paper studied how DWs can be extended with spatial and mobility data. For this, temporal types were defined to capture the variation of a value across time. The paper discussed how these data types, representing MOs, were implemented as an extension to PostGIS. This extension allows to define a conceptual model for a trajectory DW, which includes MO data as measures that can be analyzed with respect to contextual dimensions. Therefore, the contextual model can actually be implemented with MO data, not only segmented trajectories or pre-aggregated measures. This is a key difference with previous proposals, and would not be possible without a data type system that seamlessly integrates with the (spatial) database supporting the warehouse. A concrete case study that extends the Northwind DW with mobility data was presented, illustrating the use of the mobility DW with a comprehensive set of queries. Future work includes the development of more kinds of temporal types supporting not only moving points, but also moving lines and moving regions and, as mentioned in Section 5, the development of an appropriate benchmark to allow a fair evaluation of mobility DWs. Another research direction will address the extension of the work presented in this paper, to big data environments, along the lines mentioned in Section 1.

**Author Contributions:** Both authors contributed equally to the writing of the paper.

**Funding:** This research received no external funding.

**Acknowledgments:** Alejandro Vaisman was partially supported by the Argentinian National Scientific Agency, PICT-2014 Project 0787, and PICT-2017 Project 1054.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
