Skip Content
You are currently on the new version of our website. Access the old version .
AlgorithmsAlgorithms
  • Article
  • Open Access

28 July 2020

Trajectory Clustering and k-NN for Robust Privacy Preserving k-NN Query Processing in GeoSpark

,
,
,
,
and
Computer Engineering and Informatics Department, University of Patras, 265 04 Patras, Greece
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Algorithmic Data Management

Abstract

Privacy Preserving and Anonymity have gained significant concern from the big data perspective. We have the view that the forthcoming frameworks and theories will establish several solutions for privacy protection. The k-anonymity is considered a key solution that has been widely employed to prevent data re-identifcation and concerns us in the context of this work. Data modeling has also gained significant attention from the big data perspective. It is believed that the advancing distributed environments will provide users with several solutions for efficient spatio-temporal data management. GeoSpark will be utilized in the current work as it is a key solution that has been widely employed for spatial data. Specifically, it works on the top of Apache Spark, the main framework leveraged from the research community and organizations for big data transformation, processing and visualization. To this end, we focused on trajectory data representation so as to be applicable to the GeoSpark environment, and a GeoSpark-based approach is designed for the efficient management of real spatio-temporal data. Th next step is to gain deeper understanding of the data through the application of k nearest neighbor (k-NN) queries either using indexing methods or otherwise. The k-anonymity set computation, which is the main component for privacy preservation evaluation and the main issue of our previous works, is evaluated in the GeoSpark environment. More to the point, the focus here is on the time cost of k-anonymity set computation along with vulnerability measurement. The extracted results are presented into tables and figures for visual inspection.

1. Introduction

There is no doubt that we live in the era of Big Data. Over the last decade, thanks to technological advances, information systems have favored automatic and effective data gathering, thus resulting in a considerable increase in the amount of available data. Daily, a wide range of data is produced: scientific, financial, health data, as well as from social media, are just some examples of sources. However, this data is useless without the extraction of the underlying knowledge, a major challenge for the researchers as classical machine learning methods cannot deal with the volume, value, veracity and variety that big data brings [1]. Therefore, existing machine learning techniques, which deal with 4 Vs [2], have been or need to be redefined for efficiently processing and managing such data, as well as to obtain valuable information that can benefit not only scientists, but also businesses and organizations. Actually, the recent advances in distributed technologies can be utilized in order to enable scientists to rapidly find out hidden or unknown patterns from the 4 Vs [3]. Nonetheless, most of the existing methods fail to directly tackle the increased number of attributes and records of databases, due to its computational complexity. Hence, the data mining techniques should be able to encounter data scalability, dimensionality, uncertain data and/or data preprocessing. Data preprocessing constitutes an important step before the data mining process and, as a result, data mining algorithms are designed to accept specific data formats that are best suited for them.
With the advances in wireless and mobile technologies, moving objects are equipped with location positioning sensors and wireless communicating capabilities, thus, large-scale spatio-temporal data is being generated and an urgent need for efficient query processing approaches, to deal with both the spatial and temporal attributes, has been arisen [4]. In particular, with the advent of mobile and ubiquitous computing, spatio-temporal queries processing on moving objects databases has become a necessity for many applications, such as traffic control systems, geographical information systems, and location-aware advertisement. Hence, time-dependent versions of k nearest neighbor (k-NN) queries need to be studied. According to [5], the k-NN queries can be distinguished into four categories: (i) both query and data objects are static, (ii) moving query but static data objects, (iii) static query but moving data objects and (iv) both query and data objects are moving. When a mobile object’s location data changes, a Snapshot k-NN ( S k N N ) query is issued at each location. However, in highly dynamic spatio-temporal applications, where the moving objects data varies frequently over time, a fundamental query is the so-called Continuous k-NN ( C k N N ) [6]. A C k N N query can be subjected to the fourth category, as it instantly retrieves the k nearest neighbor objects of a moving query object at each time within a given time interval. Practically, the C k N N spatio-temporal query is evaluated on a high number of consecutive timestamps and can be considered as a series of frequent S k N N queries. To the best of our knowledge, none of the existing approaches, such as [6,7,8] to name a few, address the problem of C k N N queries processing on non-linear trajectories under the support of Hough transformation. It is worth to mention that in previous works the process of C k N N query has been considered in road networks with linear trajectory moving objects.
In this paper, our efforts are devoted to processing a C k N N query assuming that each object keeps track of non-linear trajectory. Such a trajectory is approximated as piece-wise linear among timestamps where objects (small) velocity changes; a storage efficient strategy adopted in our previous work [9]. Some previous works investigate the problem of C k N N for moving objects with fixed velocity while others with uncertain velocity in road networks [10,11]. Nonetheless, once an object’s velocity changes, the C k N N query has to be re-executed. Such a case is quite frequent in highly dynamic environments, thus, the performance of these techniques would be significantly degraded, resulting in increased query cost (because of re-evaluation). The focus of this work is on k nearest neighbor queries, which are considered under both a spatio-temporal and a Hough transformed database. The Hough method can mitigate such issues since it is piece-wise constant [9], thus, the C k N N query is not evaluated as frequent as in the Euclidean space with non-linear trajectory. Actually, the application of k-NN queries in Hough space facilitates the maintenance and consistency of continuous queries result [12], even if the moving objects location data updates. The aforementioned issue relates to the fact that the query result needs to be only updated upon specific changes, namely, whenever moving angle direction and/or velocity vary dramatically. Nevertheless, while an object (including the query object) is moving on a linear part of its trajectory (maintenance phase) the query result will not change, meaning that the query result of all the objects will not need be revised until the objects transit to the next linear part of its trajectory, which indicates that the maintenance phase expires.
Moreover, with the extensive adoption of the Cloud, research on privacy preserving has appealed to lots of researchers [13,14]. Let us recall that the ultimate purpose of the processing is to formulate the k-anonymity set to protect users’ privacy in the cloud computing environment. Especially, in the context of this work, we are challenged with the necessity to apply temporal continuous spatial k-NN queries in a distributed architecture with the ultimate goal of forming the anonymity set for a group of moving objects in a fast and efficient manner.
In summary, the main contributions of this paper are as follows:
  • A continuous query processing algorithm is considered that efficiently answers the spatial k-NN query with the aid of an indexing method or otherwise. From the query result, in each timestamp and for all moving objects, the desired anonymity set is formulated.
  • The adopted method is designed with the aid of GeoSpark spatial data operations to compute the k-anonymity set which is used to measure the possibility of each object identity being unveiled as it moves from one location to another. To be more specific, vulnerability evaluation is conducted in two-dimensional space (both Euclidean and Hough space), considering different pairs of features, as it will be demonstrated in Section 4.
  • A comprehensive set of experiments is conducted to evaluate the time performance of the proposed method in the GeoSpark environment under different data sizes.
The rest of this paper is organized as follows. In Section 2, previous related works are presented in relation to our approach. In Section 3, the following are described: (a) the problem definition, (b) the system model and (c) the proposed GeoSpark framework for the k-anonymity set computation. Section 4 presents the experiments conducted for the evaluation of the studying problem, while Section 5 evaluates relevant discussion of the studying problem results in relation to previous works. Finally, in Section 6, conclusions and future directions of this work are recorded.

3. Materials and Methods

This section provides the necessary background knowledge for the remainder of the paper. Initially, the k-NN algorithm, a core component of the adopted privacy preserving methodology along with its weaknesses to tackle big data problems, is presented. In following, useful definitions and notations will be recorded under the problem definition, with the most characteristic being the spatial indexing and partitioning methods. Moreover, the GeoSpark components for the C k N N queries processing with the aim to formulate the k-anonymity set, will be in detail described.

3.1. Operations on Spatial Data

Querying on spatial data is considered an operation that is usually coupled with indexing methods. Several indexing methods have been considered in literature as they are crucial for the performance of spatial data query processing algorithms, since they are used in order to reduce the query run time. The most representative are the ones based on R-tree and Quad-tree.
Further, to support efficient query processing on moving objects, grid-based space partitioning methods can be adopted [27]. Moving objects data are indexed in the grid cells they belong to for facilitating the queries and avoiding the check of all the objects. The aim of a spatial partitioning technique is to improve the query time as well as to keep all the partitions balanced in terms of memory and computations, which is known as the term “load balancing” [28].
The equal grids partitioning uniformly divides the whole region, providing thus a good data locality but not load balancing. Also, Quad-tree is another data structure based on the principle of “divide and rule” that recursively divides two-dimensional space into four quadrants and needs a merging operation to construct the specified number of partitions. On the other hand, the R-tree provides an efficient data partitioning strategy to efficiently index spatial data. It is considered a balanced search tree that improves both search speed and storage utilization. Another space partitioning strategy is based on K D B -tree, a balanced binary tree, which has been used in load balancing of spatial databases for fast querying. It is worth mentioning that the locality of data and load balancing are important to speedup queries performance [29].

3.2. The k-NN Classifier from Big Spatial Data Perspective

The k-NN algorithm is a popular non-parametric method that can be used for both classification and regression tasks. In following, we will discuss the k-NN classification problem from the big spatial data viewpoint. The main components of k-NN classifier are:
  • T R is a training mobile objects dataset of size N,
  • T S is a testing mobile objects dataset of size M,
  • o n is a mobile object represented as a tuple in the form of ( f n 1 , f n 2 , , f n p , c l ) , where f n p is the value of the p-th feature of the n-th object and c l is the class it belongs to, denoted as o n c l and
  • c l is only known to T R dataset.
In the classification process, for each test object t T S , the k-NN algorithm searches the k closest objects in the T R set, computing the distances (specifically the Euclidean distance) between the test mobile object and all the mobile objects in T R . The distances from all training objects are ranked in ascending order and then, the k nearest objects ( k n n 1 , k n n 2 , , k n n k ) are kept to find the dominant class c l . Despite its remarkable performance in real-world applications, it lacks the scalability to manage large-scale datasets. Initially, the time complexity to find the k nearest neighbors for a single test mobile object is O ( N p ) , where an extra complexity O ( N l o g ( N ) ) for distances sorting process is involved; in this additional complexity, N is the size of training dataset and p is the number of object features. The classification process needs to be restated for all the test mobile objects. Additionally, the k-NN model requires the training data to be stored in memory in order to achieve a fast computation of the distances. However, large-scale T R and T S sets may not fit in the RAM memory.

3.3. Problem Definition

A spatio-temporal database, whose records are moving objects with geolocation data attributes in two-dimensional space D , is assumed. In real world examples objects move arbitrarily which oppose to fixed velocity assumption that some works suppose. Hence, it is considered that the objects velocity is low with values between a minimal and a maximal, i.e., u [ u m i n , u m a x ] . In addition to, we consider low data sampling rate, since applying high data sampling rate it would be difficult to apply linear interpolation between sampled points, thus, the adopted dual methods should be redesigned adopting appropriate curve fitting methods. As a result, the velocity calculated from two consecutive way points could not represent the real velocity well and Euclidean distance would not work. In the following, an overview of the studying problem definitions along with notations is presented.
Definition 1.
The trajectory of a moving object is assumed to be a continuous piece-wise linear function, which maps the temporal dimension to the two-dimensional Euclidean space, connecting a sequence of points ( x 1 , y 1 , t 1 ) , ( x 2 , y 2 , t 2 ) , , ( x L , y L , t L ) for t 1 < t 2 < < t L .
Such a representation entails that the n-th object position at time t i is p o s n t i = ( x i , y i ) for n = 1 , 2 , , N , and that during each time interval ( t i , t i + 1 ) , the object moves along a straight line from ( x i , y i ) to ( x i + 1 , y i + 1 ) with a constant velocity (amplitude and direction).
To ensure an efficient storage and handling of the queries, the expected position of the object at any time t ( t i , t i + 1 ) , where 1 i L 1 , is obtained by a linear interpolation between ( x i , y i ) and ( x i + 1 , y i + 1 ) . In this way, a number of additional features can be extracted, such as velocity u, angle direction θ , Hough-X and/or Hough-Y transformation of ( x , y ) [9].
Definition 2.
A moving object trajectory (spatio-temporal) snapshot is defined as the location data of that object in a specific timestamp; thus, a single trajectory is stored as a collection of location snapshots denoted as { p o s n t i } i = 1 L .
Definition 3.
Snapshot Distance: Given two objects o 1 and o 2 with location snapshots p o s 1 t i and p o s 2 t i respectively, the l 2 n o r m distance (Euclidean) (Without loss of generality, both other distance measures can be considered, such as the Manhattan distance ( l 1 n o r m ), as well as the maximum distance ( l n o r m )) between o 1 and o 2 in D at timestamp t i is computed as
d i s t o 1 , o 2 t i = j = 1 D ( p o s 1 t i [ j ] p o s 2 t i [ j ] ) 2 .
The distance between two objects in our model is represented in Euclidean distance.
The spatial k-NN queries are considered as one of the most common search problems that will be employed in our study. Generally, given a spatial region, a k-NN query on spatial data identifies all the spatial points that lie inside this region. For spatio-temporal k-NN queries, a time interval is also given, where the timestamps of the resulting trajectories in that region need to also fall in that time interval. A spatial k-NN query takes as input a query center point along with a set of spatial objects location data in order to find the k nearest neighbors around the center point.
Definition 4.
k-NN: Given a moving object m o , a dataset O and an integer k, the k nearest neighbors of m o from O, denoted as k N N ( m o , O ) , is a set of k objects from O that o k N N ( m o , O ) , s O k N N ( m o , O ) , d i s t o , m o d i s t s , m o .
Definition 5.
Snapshot Trajectory Point k-NN ( S T P k N N ) query: Given a query center point q at timestamp t i , an integer k and a dataset of trajectory (spatio-temporal) points denoted as S t i = { t r a j 1 t i , t r a j 2 t i , t r a j 3 t i , t r a j N t i } a k-NN query, denoted as S T P k N N ( q , S t i , k ) , asks for the k spatial points of S t i whose l 2 distance from the query point q is less than that of the rest points of S t i .
Definition 6.
Continuous Trajectory Point k-NN ( C T P k N N ) query: Given a query center point q at timestamp t i , an integer k and a dataset of trajectory (spatio-temporal) points denoted as S t i = { t r a j 1 t i , t r a j 2 t i , t r a j 3 t i , t r a j N t i } at a time-interval t = { t i } i = 1 L for L (practically large enough), a continuous k-NN query, denoted as C T P k N N ( q , S t , k ) , asks for the k spatial points of S t i , whose distance from the query point q is less than that of the rest points of S t i , for all consecutive t i t .
Note that, t r a j 1 t i p o s 1 t i . Considering the previous definitions, the problem formulation for the application of robust C T P k N N queries (see Figure 1) will be considered in the following subsection. In the context of the proposed approach, all points q S t i are selected as query objects in order to acquire the k nearest neighbors of all mobile objects in each timestamp t i . If the process is repeated for a high number of consecutive timestamps in a selected time period, then, the collection of S T P k N N queries constitutes a C T P k N N query, thus similar to our case.
Figure 1. An Overview of Continuous Trajectory Point k Nearest Neighbor ( C T P k N N ) Query.

3.4. Problem Formulation

The problem of Robust Spatio-temporal Databases is addressed in an environment suitable for Big Spatial Data Management. The k-anonymization approach adopted in [30] for preventing mobile objects identity reveal, is related to k-NN algorithm as described in Section 3.2. Specifically, the k-anonymity set is formulated by the unique object identifier, denoted as i d , of the k nearest neighbors, exploiting as a result the spatio-temporal data of a set of mobile objects.
Through the SMaRT system, a set of mobile users’ trajectory data is recorded per timestamp, that is, the mobile user trajectory i d and the values of longitude and latitude are recorded. From these location features, the four attributes ( x , y , θ , u ) (as presented in Table 1) along with the values of Hough-X and Hough-Y of ( x , y ) [9], ( U x , a x , U y , a y , b x , w x , b y , w y ) (as presented in Table 2), are computed.
Table 1. An Overview of an Original Spatio-Temporal Database.
Table 2. An Overview of the Transformed Spatio-Temporal Database.
So, by employing the k-NN method for different pairs of features, this fact enables us to form the k-anonymity set of each mobile object per timestamp, as depicted in Table 3.
Table 3. k-Anonymity Sets for N Mobile Users in L = 5 Timestamps.
For each mobile user i and per timestamp l, the k nearest neighbors i d it is computed and in the following is kept in a vector form k n n s i l = [ i d i l 1 i d i l 2 i d i l k ] for l = 1 , 2 , , L as presented in Table 3. For each user, the number k s out of the k nearest neighbors, which remains the same from one timestamp to another, is computed in order to estimate the vulnerability ratio 1 k s . Hence, higher k s is associated with lower probability (i.e., lower vulnerability) a moving object’s identity being unveiled.
Definition 7.
k-anonymous database: A database is k-anonymous, i.e., its records are k-anonymous with respect to the selected features, if k 1 discrete records in the same specific timestamp τ, have at least the same nearest neighbors so that no record of k is distinguished from its k 1 neighboring records.

3.5. System Model

A spatio-temporal database is considered with N records, that is, N moving objects in the x y plane. Each record ( x i j , y i j ) represents the spatial coordinates of a mobile user j in timestamp t i j , or point i of its trajectory j [31]. From the location coordinates ( x , y ) , we can extract the corresponding velocity and angle direction features ( u , θ ) and the dual points ( U x , a x ) , ( U y , a y ) , ( b x , w x ) , ( b y , w y ) by employing the dual methods described in [9]. Let us assume a trajectories database T = { T 1 , , T N } of equal length L in which each trajectory is represented via a sequence of L triples.
For each point i in trajectory j, we define a two-dimensional feature vector F i j , which captures the selected features data. Hence, we can define and store the trajectory j as T j = { F 1 j , F 2 j , F 3 j , , F L j } .
In the context of this work, the time performance of the k-anonymity set formulation either with the aid of an indexing method or for spatial k-NN queries on trajectory data employing Snapshot k-NN query on Trajectory Points, as depicted in Figure 2, is evaluated. Given a set of trajectories represented as a sequence of spatio-temporal points along with a query point, the S T P k N N algorithm finds its k nearest spatial points from a set of trajectory points in corresponding timestamps. The S T P k N N query is issued over a set of moving objects, executing the classical k-NN in a time period, and updates the k-anonymity set from timestamp to timestamp for all objects.
Figure 2. An Overview of Spatio-Temporal Data Partitioning and Indexing.
However, some objects may not move with the same velocity, leading to changes in the query result, i.e., in the k-anonymity set. As a result, in Privacy Preserving, it is important for the k-anonymity set to remain the same or vary slowly, as in this case, the vulnerability measure of moving objects will be affected. Note that, for a given STPkNN query q, the k-anonymity set, denoted as A , should always satisfy the following conditions:
  • The first condition | A | = k ensures that the anonymity set contains the i d s of k objects.
  • The second condition a ( S A ) , d i s t ( q , a ) m a x d i s t ( q , a ) | a A ensures that these k objects’ i d are the k nearest ones to q.
Assuming a mobile object when considering a k-NN query relevant to the selected features in a specific timestamp, then, the spatial k-anonymity ensures that an attacker, as the query issuer, cannot identify a mobile object with probability larger than 1 k , where k is a user-defined anonymity parameter. Moreover, a spatio-temporal database is expected to handle a high number of moving objects location data, as well as, a large number of consecutive k-NN queries. Hence, an efficient consecutive S T P k N N processing algorithm is very important to be addressed. To this end, the S T P k N N query is investigated in the GeoSpark framework and experimental evaluations are conducted using a realistic dataset to demonstrate the time performance of the k-anonymity set computation under the S T P k N N query. Ultimately, the vulnerability behavior for different pairs of features is investigated.

3.6. GeoSpark System Overview

In this subsection, the necessary structures of GeoSpark-based approach is thoroughly presented. All parts were implemented using Apache Scala due to its compatibility with Apache Spark framework. The core components are the following implemented on GeoSpark Resilient Distributed Dataset (RDD) and the interaction with GeoSpark is utilized through Apache Zeppelin interface.

3.6.1. GeoSpark Architecture

GeoSpark [32] is an extension of the Apache Spark core and provides additional tools to manage geospatial data, e.g., geospatial datatypes, indexes and operations. The architecture of GeoSpark, as introduced in Figure 3, consists of the following three layers:
Figure 3. An Overview of GeoSpark Layers.
  • The Apache Spark Layer consists of all the components present in Spark and performs data loading and querying.
  • The Geospatial RDD Layer extends Spark and supports three types of RDD, i.e., Point, Rectangle and Polygon RDD. In addition, it contains a geometrical operations library for every RDD.
  • The Geospatial Query Processing Layer is used to perform different types of geospatial queries.
GeoSpark provides Spatial RDD that allows efficient loading, transformation, partitioning, in-memory storing as well as indexing of complex spatial data for different input data sources such as CSV, WKT, GeoJSON, Shapefile, etc., which, through the GeoSpark Spatial SQL interface, are compatible with Spark. It also provides eight types of spatial objects, namely Point, Multi-Point, Polygon, Multi-Polygon, Line String, Multi-Line String, GeometryCollection, and Circle. Hence, spatial objects in a Spatial RDD can consist of different geometry types. Furthermore, it supports three types of spatial objects, which are Point, Rectangle and Polygon, for which the corresponding spatial RDDs can be defined. These structures can be used as input to several spatial queries, such as spatial k-NN, spatial range and spatial join.
In following, taking advantage of the GeoSpark architecture, the main parts of our approach will be presented.

3.6.2. Spatial RDD Structures Preparation

Initially, the trajectory data of all mobile users is stored in a csv file. Specifically, this data consists of trajectory (spatio-temporal) points for a number of mobile objects, i.e., bike riders. For each bike trajectory (spatio-temporal) point, we have in our disposal the bike rider i d , the spatial coordinates ( x , y ) , the polar coordinates (direction, velocity) ( u , θ ) , the Hough- X / Y attributes as well as the timestamp, with the aim of computing the k nearest neighbors and selecting two features among them, at each time. Practically, all bike riders follow non-linear trajectories. However, in our case, the users’ trajectory is approximated by linear sub-trajectories, as, in our previous work [9], thus, the values of the attributes are not randomized such that groups of bikes have similar behavior. The user cannot control attribute values, but, they may ask to simulate a specific timestamp. In GeoSpark, firstly the corresponding S R D D is recovered and in following the temporal partitioning is implemented. Information about road network that describe the situation of the region are not taken into consideration.
In Apache Zeppelin, the input data are loaded from the csv file as a Dataframe and then SQL queries can be executed on the created Dataframe in order to recover spatial data of moving objects in specific timestamps. Then, these data are stored and transformed to PointRDD. In this way, several PointRDDs are created which, from now, will be called as BikesRDD, as they concern bike riders trajectories (spatio-temporal points). GeoSpark’s operator permits us to transform a set of raw data into a PointRDD, and selecting the columns that correspond to the desired features at a specific timestamp, as depicted in Table 4.
Table 4. The Different Types of Point Resilient Distributed Dataset (RDDs) According to Selected Features.

3.6.3. Spatio-Temporal Partitioning

The proposed approach considers a naive temporal partitioning from which a collection of timestamps is considered. These timestamps are related to the sampling period where the system recorded the spatio-temporal data with its usage. As a result, the size of the loaded data in BikesRDD may be different for different timestamps since some mobile objects may not have a spatio-temporal footprint for all recorded timestamps. Here, in our simulations, the sampling period is supposed to be similar for all bike riders trajectory data. From the trajectory (i.e., spatio-temporal) points of a number of bike riders objects at different timestamps, a collection of BikesRDD is constructed.
Following the previous aspects, GeoSpark can easily utilize spatial partitioning over the temporal partition of the data of these bikes. In each temporal partition, a bike rider may have not necessarily moved because random delays may be employed. Hence, in a specific time partition, the spatial data of all bikes may not necessarily participate.
BikesRDD constitutes a specialized GeoSpark PointRDD, which consists of individual bike riders’ records. The spatial partitioning on the BikesRDD concerns bike riders selected features, such as the location data. In several temporal partitions, the BikesRDD are simulated one by one.
More to the point, GeoSpark offers low overhead spatial partitioning approaches that take into consideration the spatial distribution of the data as well as repartitions a loaded Spatial RDD in an automatic way. In each Spatial RDD, several spatially proximal objects are grouped into the same partition and, as a result, the partitioning layer in GeoSpark partitions the workload in a spatial way and periodically repartitions this workload in order to keep balanced partitions. Also, it supports a variety of grid-type partitioning methods, such as equal-grids, R-tree, Quad-tree and K D B -tree, to name a few [33].

3.6.4. k-Anonymity Set in GeoSpark

Algorithm 1 presents the steps of a spatial k-NN in GeoSpark. It takes as input a non-indexed or indexed S R D D , a query center point as well as a parameter k, which indicates the number of nearest neighbors. The algorithm is separated in two phases, namely selection and sorting phase [33], and will be utilized in the simulations.
Algorithm 1 Spatial k-NN Query
1:
input The number of k nearest neighbors
2:
input A query center object A
3:
input A Spatial RDD B
4:
output A list of k spatial objects
5:
Step 1: Selection phase
6:
for all partition ∈ SRDD B do
7:
if an index exists then
8:
  Return k-neighbors of A by querying the index of this partition
9:
else
10:
  for all object ∈ this partition do
11:
   Check all the distance between the object A and Spatial RDD B
12:
  end for
13:
end if
14:
end for
15:
Maintain a priority queue that stores the top k nearest neighbors
16:
Step 2: Sorting phase
17:
Sort the spatial objects in the intermediate Spatial RDD based on their distances to A
18:
Return top k objects in C
Our aim is to exploit the above mentioned structures and Algorithm 1 so as to formulate the k-anonymity set for a number of moving objects based on their trajectory (spatio-temporal) data. From the trajectory points of a number of mobile objects at different timestamps, a collection of spatial BikesRDD is constructed. The goal of this paper is to provide an efficient and scalable framework for robust continuous k-NN querying of spatial objects in GeoSpark. A Scala RDD API for creating the desired PointRDD from a csv file is introduced in following Algorithm 2.
Algorithm 2 Spatial PointRDD Creation
1:
Define csv file location in pointRDDInputLocation
2:
Define the attributes start Column equal to 0 in pointRDDOffset
3:
Define pointRDDSplitter as FileDataSplitter.csv
4:
Define rest attributes in carryOtherAttributes = true
5:
Create PointRDD: PointRDD = new PointRDD(sc, pointRDDInputLocation, pointRDDOffset, pointRDDSplitter, carryOtherAttributes)
An iterative application of a typical spatial k-NN query, which was presented in Algorithm 1, is introduced with use of Scala RDD API of Geospark in following Algorithm 3.
Algorithm 3 Iterative Spatial k-NN Query
1:
inputk, usingIndex, number of timestamps L
2:
for timestamp t = 1 to L do
3:
 Create the Dataframe d f with the selected features in t
4:
 Save d f as csv file f
5:
 Create from csv a PointRDD
6:
if usingIndex = true then
7:
  Build R-Tree index on PointRDD
8:
end if
9:
for all lines ∈ csv file f do
10:
  Read the selected features ( f 1 , f 2 )
11:
  Create querypoint
12:
  val fact = new GeometryFactory()
13:
  val querypoint = fact.createPoint(new Coordinate( f 1 , f 2 ))
14:
  Run Spatial k-NN Query to return k geometry points
15:
  val queryResultList = KNNQuery.SpatialKnnQuery(PointRDD, querypoint, k, usingIndex)
16:
end for
17:
end for
It is worth mentioning the fact that the output format of the spatial k-NN query (indexed or not) is a list of geometries as in Table 5, where the i d of the point geometry is related to the mobile object i d . In addition, the list holds the top-k geometry objects. In the following, the above methodology and algorithms will be considered for the performance evaluation of the proposed k-anonymity method based on spatial k-NN.
Table 5. Trajectory Representation in a PointRDD.

4. Results

4.1. Environment and Dataset

A local machine was used for the experimental evaluation. The experiments were executed using 1 VM, which consisted of 4 CPU Cores, 3.8 Gb RAM and 488 Gb Storage capacity each with Ubuntu 18.04.4 LTS, Apache Spark 2.1.0 , GeoSpark 1.2.0 and Apache-Zeppelin 0.8.2 . We conducted the experiments on one master node and assign 384.1 MB memory to the Spark driver program that is executed on the master local machine. The experimental data extracted from the SMaRT Database GIS Tool (http://www.bikerides.gr/thesis2/). The experiments were based on trajectory datasets of bike riders in the area of Corfu, Greece, so the case study area is located in the non-urban area of Corfu.

4.2. Time Performance of k-Anonymity Set

A comprehensive experimental evaluation that studies the performance of GeoSpark under the studying problem, lies in the corresponding subsection. Specifically, the GeoSpark spatial k-NN query performance, which impacts the computation of the k-anonymity set of 80, 500 as well as 2000 mobile objects in 10 consecutive timestamps, as shown in Table 6, is measured. The spatial k-NN queries are tested by varying k values equal to 8 and 16 on Bike Riders Trajectory (spatio-temporal points) dataset. The performance measure of the system for the k-anonymity set computation is the total run time that the system needs to execute the jobs. We compare the following spatial data processing approaches:
Table 6. Simulation Parameters.
  • Without Partitioning–Without Indexing: GeoSpark approach without spatial index in non-partitioned PointRDD.
  • Without Partitioning–With Indexing: GeoSpark approach with spatial index in non-partitioned PointRDD.
  • With Partitioning–Without Indexing: GeoSpark approach with spatial index on each partition of PointRDD after data partitioned according grids using (i) KDB-Tree and (ii) R-Tree.
The above cases will be taken into consideration in order to evaluate the GeoSpark performance for the computation of the k-anonymity set for a number of mobile users. To form the desired k-anonymity set, we apply spatial k-NN queries using the embedded function known as ”KNNQuery.SpatialKnnQuery”. The spatial index R-tree is only supported for the spatial k-NN Queries.
We consider the following results assuming that the trajectories are represented as a collection of spatio-temporal points in two-dimensional space. The experiments were conducted in Apache Zeppelin, which is an open web-based notebook that enables interactive data analytics. In Figure 4, an SQL query is employed, which returns the trajectory data (spatio-temporal points) of 40 mobile objects. Specifically, a notebook, created in Zeppelin through which the methods described in previous section were executed under GeoSpark framework, is utilized.
Figure 4. An Overview of 40 Trajectories through Zeppelin.
In this case, it should be noted that the experimentation depicts that, when mobile objects do not move at the same timestamps, then it is difficult for the k-anonymity set to be formed, as the desired set of mobile objects has very low cardinality. Also, we observed that from timestamp to timestamp, different objects participated in the formed dataset, meaning that the k nearest neighbors are time-varying. Such a scenario proves that the randomness in objects’ movements does not benefit or promote the formulation of a slow time-varying anonymity set, as we have proved in our previous work [30].

4.2.1. Impact of Data Size

In this subsection, the impact of data size in terms of the number of mobile objects N in Figure 5 and storage size in MB in Figure 6 with or without the aid of an R-Tree indexing on the computation time of the k-anonymity set are investigated. According to results in Table 7, Table 8 and Table 9, we observe that the use of indexing does not impact the total execution time of involved spatial k-NN queries. This happens due to the fact that GeoSpark caches the spatial RDD along with the corresponding indexes in each timestamp. Hence, for the upcoming mobile objects, it directly reads spatial PointRDD and index from the cache, in this way leading to time saving. It should be also noted that in each timestamp, all objects use the same (partitioned or not) PointRDD, which changes in the upcoming timestamps for all the objects.
Figure 5. Time Cost for k-Anonymity Set Computation with or without Indexing for N = 80 , 500 , 2000 Mobile Objects.
Figure 6. Time Cost for k-Anonymity Set Computation with or without Indexing for 3 Cases of Total Input Data in Executor.
Table 7. Time for 80 Mobile Objects without Indexing and with R-Tree Indexing.
Table 8. Time for 500 Mobile Objects without Indexing and with R-Tree Indexing.
Table 9. Time for 2000 Mobile Objects without Indexing and with R-Tree Indexing.
In terms of total execution time, as shown in Figure 5 including the time for index building on PointRDD, GeoSpark without index shows approximately similar search time performance as in the indexed version. Also, the execution time of a k-NN query in GeoSpark remains constant for different values of k in each case. That is mainly considered when the value of k is very small with respect to the input data and thus, most of the time is spent on input data processing. The time cost of loading data from csv to dataframe is 42 s whereas it is included in the total execution time.
Figure 6 illustrates the total execution time relative to the total input data size in executor in MB, which seems to increase in a non-linear way.
Ultimately, in Figure 7 let us focus only on higher number of mobile objects in order to understand the impact of data scalability in the time performance of the proposed approach for the k-anonymity set computation assuming 2-dimensional ( 2 D ) points RDD for 10 consecutive timestamps. Indeed, it seems to approach exponential increase which for actual Big Data will be more apparent.
Figure 7. Time Cost for Mobile Objects N = { 500 , 2000 , 8000 , 32.000 } without Indexing for k = 8 .

4.2.2. Impact of Spatial Partitioning

In this subsection, we experiment on the impact of the embedded spatial partitioning methods on the computation time of k-anonymity set, despite the fact that in GeoSpark the spatial partitioning method is usually used to optimize spatial join.
Initially, in Figure 8, the total number of records in PointRDDs for four partitioning methods in a specific timestamp is demonstrated. Obviously, the partition size of K B D -Tree partitioning is more balanced than that of Quad-Tree. Also, we observe that regarding R-Tree partitioning, an overflow in data partition (e.g., partition 2) occurs, which is much larger than other partitions because R-Tree does not consider the whole space [33].
Figure 8. Spatial PointRDD Data Distribution for 4 Spatial Partition Techniques for 2000 Mobile Objects.
Note that, in the third dataset, with size N = 2000 , the data distribution varies from one timestamp to another and thus, it is not recorded in Table 10.
Table 10. Impact of Spatial Partitioning in Time Performance.
In Table 10, we focus only on the impact of R-Tree and K D B -Tree spatial partitioning on k-anonymity set time computation without PointRDD indexing, as GeoSpark does not support building local indexing on spatially partitioned data for spatial k-NN queries.
We choose to evaluate, for a specific k with value equal to 16, the impact of spatial data partitioning on the time needed to formulate the k-anonymity set for 10 consecutive timestamps, as aforementioned. As the results show, the time cost of the k-anonymity method did not benefit from data partitioning. In addition, the time for the k-anonymity set computation is a little bit lower when using R-Tree for datasets having smaller size. However, in case of a larger dataset, the difference between the two types of partitioning is normalized. This means that most of the processing time, in the same machine, is focused on k-NN query computation.

4.3. Vulnerability Evaluation

In this subsection, some additional experiments based on a real dataset with parameters presented in Table 11 and Figure 9, Figure 10 and Figure 11, are conducted. The k-anonymity set computation is utilized with the aid of R-Tree indexing. As GeoSpark works with spatial data in two-dimensional space, our focus is on three different cases for the k-anonymity set computation in order to evaluate the vulnerability under the formulated anonymity set in each space, which are:
Table 11. Parameters when using N = 500 Trajectories, L = 100 Timestamps.
Figure 9. (a) Euclidean Space and (b) Polar Space for N = 500 Trajectories, L = 100 Timestamps.
Figure 10. Hough-X Space of x for (a) k = 5 and (b) k = 10 for N = 500 Trajectories, L = 100 Timestamps.
Figure 11. Hough-X Space of y for (a) k = 5 and (b) k = 10 for N = 500 Trajectories, L = 100 Timestamps.
  • Euclidean coordinates ( x , y )
  • Polar coordinates ( u , θ )
  • Hough-X coordinates of ( x , y ) denoted as ( U x , a x ) and ( U y , a y )
The experimentation has been carried out on a dataset of size N = 500 Trajectories (or moving objects).
Actually, Figure 9 on the left depicts that assuming only the x feature (or y one) for the anonymity set formulation, the vulnerability in both cases is about 2 times higher than in the case when both attributes ( x , y ) are considered. Similarly, in Figure 9 on the right, the vulnerability is measured in polar space, where the performance is worse than in Euclidean one. This is attributed to the fact that polar coordinates are linearly dependent on spatial data ( x , y ) , revealing the curse of dimensionality. Also, this figure reveals that the anonymity set varies from timestamp to timestamp, meaning that for given objects angle direction and velocity, more than one out of k nearest neighbors change.
Finally, Figure 10 and Figure 11 show the vulnerability performance in Hough-X space, which is compared with the corresponding one in Euclidean space. Obviously, the results are better in Euclidean space as Hough-X space is linearly dependent on the former. Finally, focusing on the case where k = 10 in Figure 12, the vulnerability varies slowly around 0.4 and 0.3 (red and orange dashed lines respectively) in Hough-X space, while in Euclidean space, it achieves values very close to 0.1 (purple solid lines).
Figure 12. Vulnerability Performance Comparison in Euclidean and Hough-X Space.

5. Discussion

In this section, two main issues will be discussed. At first, the time performance of the researched issue and in following the robustness of the suggested anonymity method. Actually, the analysis in Section 4 gives some useful insights and presents some key aspects of the results obtained from the evaluation of the proposed k-anonymity method in GeoSpark.

5.1. Performance Issues

Here, we should claim that the adopted partitioning method is simple. Hence, to improve the performance of the suggested privacy preserving C k N N s, we could employ the approach proposed in [34] and extend Algorithm 1 by applying the smart partitioning technique. Especially, we could consider to compute A k N N queries based on k d A N N or k d A N N + (where d stands for the dimensionality), instead of simple k-NN queries and study the effect of d on the performance of such queries and thus, in k-anonymity set computation performance, which is the main issue in the context of this research work. At this point, it should be noted that that approach exploits a space decomposition technique suitable for Big Data and processes the classification in a parallel and distributed manner considering multidimensional objects, as well. Through an extensive experimental evaluation it has been proved that the solution is efficient, robust and scalable in processing the k nearest neighbor queries.
Although the trajectory (spatio-temporal) data utilized in this article are 2-dimensional (2D), the methodology and techniques are general and applicable to higher dimensions. Hence, observing Figure 5 and Figure 6 in Section 4, the efficiency of the proposed approach, which now is investigated utilizing 2D trajectory (spatio-temporal) points datasets, is expected to scale up exponentially with respect to both data size, as shown in Figure 7, and dimensions, when d > 2 , in the same trend as in [34]. This results from the curse of dimensionality. As the number of dimensions increases, the number of distance computations as well as the number of searches. For further experimentation with the dimensionality parameter d > 2 in the context of GeoSpark, functions for data preprocessing to prepare Points RDD structure need to be redefined. Let us recall that GeoSpark RDD operations and k-NN queries processing work with 2D points.
Finally, as the previous finding shows in [34], it is expected that the increment of computing nodes (VMs) will improve scalability and time performance in support of both snapshot and continuous spatial queries over moving objects, having a great effect when the data follows a uniform distribution due to better load balancing. Indeed, the increment of number of computing nodes decreases the amount of distance calculations and update steps on k-NN lists that each computing node undertakes. A similar behavior we expect to see in our case if the processing is distributed in multiple nodes in the GeoSpark environment.

5.2. Vulnerability

Vulnerability is considered a major issue which arises in spatio-temporal data management as well, and simply measures how vulnerable various mobile objects are to revealing their identity to potential adversaries, when moving from one position to another. Moreover, reducing vulnerability and enhancing resilience in the face of adversity are considered to be essential steps. To date, less attention has been given to vulnerability resulting from k-NN querying on non-linear trajectory data.
This study investigates the vulnerability of mobile objects with respect to k-NN queries based on their geolocation data attributes. The proposed anonymity technique, as well as previous findings, are confirmed in the GeoSpark environment and, as a result, vulnerability is estimated for different risk conditions (i.e., for different number of nearest neighbors, namely k and pair of features, as presented in Table 4 and Table 11).
Furthermore, it is observed that for low values of k, the vulnerability remained at the medium level, and the more the nearest neighbors are utilized to form the anonymity set, the lower the vulnerability becomes. In other words, the results of the present study indicate a low level of vulnerability for high number of nearest neighbors and uncorrelated features data.
At this point, it should be noted that the results in Section 4 highlight the superiority, that is lower vulnerability, of Euclidean space in terms of privacy preserving, which is the main issue in k-anonymity and, thus, in applying robust k-NN queries. However, let us recall that, as it has been already investigated in [9], dual representative features, e.g., Hough-X of x and/or y, are considered an appropriate choice to form the k-anonymity set from both privacy preserving and storage efficiency perspective. So, priority should be given to the use of low dimension independent features data; in that case, the utilization of dual space against the Euclidean one.

6. Conclusions and Future Work

In conclusion, this work aimed at testing the hardware and software performance in terms of the number of operations performed and the time needed to obtain satisfactory anonymity. The real world example concerning moving bicycle riders proved satisfactory results using different additional ancillary methods of space partitioning and indexing. Also, a GeoSpark-based approach for evaluating the vulnerability of spatial k-anonymity in large-scale spatio-temporal data is provided. Experiments show that GeoSpark constitutes an appropriate Spark-based system for the evaluation of robust k-NN queries. The GeoSpark spatial RDDs are exploited to store trajectory data of the mobile objects as PointRDD in order to acquire the anonymity sets of all query mobile objects issuing iterative spatial k-NN queries for all mobile objects in consecutive timestamps. It should be noted that the experiments were conducted in one Virtual Machine (VM) in local mode, with specific capabilities as described in the experimental evaluation.
As a future work, our aim is to apply the proposed anonymity method in a fully distributed environment with many VMs so as to verify the expected results from previous findings about the execution time; in this case, the input data will be partitioned and in following distributed to different VMs. In this way, several parallel tasks will be executed and thus, the execution time for the anonymity set computation is expected to be much lower than in the current single VM case when considering large-scale databases. The locality of the data and load balancing are important when applying efficient k-NN queries. GeoSpark also provides a grid-type partitioning based on K D B -tree that can ensure workload balancing.
In conclusion, it should be pointed out that the current experimentation gave us an insight about the design issues and possible constraints of the proposed approach so as to optimize its performance in terms of execution time cost and memory utilization for real-time scenarios, where the mobile objects identity security, when issuing a k-NN query, is of high importance.

Author Contributions

E.D., A.K., M.T., G.V., S.S. and A.T. conceived of the idea, designed and performed the experiments, analyzed the results, drafted the initial manuscript and revised the final manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research work is funded by General Secretariat for Research and Technology (GSRT), Hellenic Foundation for Research and Innovation (HFRI), and supported by University of Patras.

Acknowledgments

The authors gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved both content and presentation.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Jiang, F.; Leung, C.K. A Data Analytic Algorithm for Managing, Querying, and Processing Uncertain Big Data in Cloud Environments. Algorithms 2015, 8, 1175–1194. [Google Scholar] [CrossRef]
  2. Emani, C.K.; Cullot, N.; Nicolle, C. Understandable Big Data: A Survey. Comput. Sci. Rev. 2015, 17, 70–81. [Google Scholar] [CrossRef]
  3. Yang, C.; Yu, M.; Hu, F.; Jiang, Y.; Li, Y. Utilizing Cloud Computing to Address Big Geospatial Data Challenges. Comput. Environ. Urban Syst. 2017, 61, 120–128. [Google Scholar] [CrossRef]
  4. Shekhar, S.; Jiang, Z.; Ali, R.Y.; Eftelioglu, E.; Tang, X.; Gunturi, V.M.V.; Zhou, X. Spatiotemporal Data Mining: A Computational Perspective. ISPRS Int. J. Geo-Inf. 2015, 4, 2306–2338. [Google Scholar] [CrossRef]
  5. Güting, R.H.; Behr, T.; Xu, J. Efficient k-Nearest Neighbor Search on Moving Object Trajectories. VLDB J. 2010, 19, 687–714. [Google Scholar] [CrossRef]
  6. Huang, Y.; Chen, Z.; Lee, C. Continuous K-Nearest Neighbor Query over Moving Objects in Road Networks. In Proceedings of the Joint International Conferences on Advances in Data and Web Management (APWeb/WAIM), Suzhou, China, 2–4 April 2009; pp. 27–38. [Google Scholar]
  7. Fan, P.; Li, G.; Yuan, L. Continuous K-Nearest Neighbor Processing based on Speed and Direction of Moving Objects in a Road Network. Telecommun. Syst. 2014, 55, 403–419. [Google Scholar] [CrossRef][Green Version]
  8. Zheng, B.; Zheng, K.; Xiao, X.; Su, H.; Yin, H.; Zhou, X.; Li, G. Keyword-Aware Continuous kNN Query on Road Networks. In Proceedings of the 32nd IEEE International Conference on Data Engineering (ICDE), Helsinki, Finland, 16–20 May 2016; pp. 871–882. [Google Scholar]
  9. Dritsas, E.; Kanavos, A.; Trigka, M.; Sioutas, S.; Tsakalidis, A.K. Storage Efficient Trajectory Clustering and k-NN for Robust Privacy Preserving Spatio-Temporal Databases. Algorithms 2019, 12, 266. [Google Scholar] [CrossRef]
  10. Huang, Y.; Lee, C. Efficient Evaluation of Continuous Spatio-temporal Queries on Moving Objects with Uncertain Velocity. GeoInformatica 2010, 14, 163–200. [Google Scholar] [CrossRef]
  11. Fan, P.; Li, G.; Yuan, L.; Li, Y. Vague Continuous K-Nearest Neighbor Queries over Moving Objects with Uncertain Velocity in Road Networks. Inf. Syst. 2012, 37, 13–32. [Google Scholar] [CrossRef]
  12. Heendaliya, L.; Lin, D.; Hurson, A.R. Continuous Predictive Line Queries for On-the-Go Traffic Estimation. Trans. Large-Scale Data Knowl.-Cent. Syst. 2015, 18, 80–114. [Google Scholar]
  13. Wu, W.; Parampalli, U.; Liu, J.; Xian, M. Privacy Preserving K-Nearest Neighbor Classification over Encrypted Database in Outsourced Cloud Environments. World Wide Web 2019, 22, 101–123. [Google Scholar] [CrossRef]
  14. Yang, S.; Tang, S.; Zhang, X. Privacy-Preserving K-Nearest Neighbor Query with Authentication on Road Networks. J. Parallel Distrib. Comput. 2019, 134, 25–36. [Google Scholar] [CrossRef]
  15. Hagedorn, S.; Götze, P.; Sattler, K. The STARK Framework for Spatio-Temporal Data Analytics on Spark. In Proceedings of the 17th Conference on Database Systems for Business, Technology, and Web (BTW), Stuttgart, Germany, 6–10 March 2017; Volume P-265, pp. 123–142. [Google Scholar]
  16. Hagedorn, S.; Räth, T. Efficient Spatio-Temporal Event Processing with STARK. In Proceedings of the 20th International Conference on Extending Database Technology (EDBT), Venice, Italy, 21–24 March 2017; pp. 570–573. [Google Scholar]
  17. Zhang, Z.; Jin, C.; Mao, J.; Yang, X.; Zhou, A. TrajSpark: A Scalable and Efficient In-Memory Management System for Big Trajectory Data. In Proceedings of the 1st (APWeb-WAIM) International Joint Conference on Web and Big Data, Beijing, China, 7–9 July 2017; pp. 11–26. [Google Scholar]
  18. Alarabi, L. Summit: A Scalable System for Massive Trajectory Data Management. In Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL), Seattle, WA, USA, 6–9 November 2018; pp. 612–613. [Google Scholar]
  19. Giannousis, K.; Bereta, K.; Karalis, N.; Koubarakis, M. Distributed Execution of Spatial SQL Queries. In Proceedings of the IEEE International Conference on Big Data, Seattle, WA, USA, 10–13 December 2018; pp. 528–533. [Google Scholar]
  20. Patrou, M.; Alam, M.M.; Memarzia, P.; Ray, S.; Bhavsar, V.C.; Kent, K.B.; Dueck, G.W. DISTIL: A Distributed In-Memory Data Processing System for Location-Based Services. In Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Seattle, WA, USA, 4–6 November 2018; pp. 496–499. [Google Scholar]
  21. Yang, C.; Yu, X.; Liu, Y. Continuous KNN Join Processing for Real-Time Recommendation. In Proceedings of the IEEE International Conference on Data Mining (ICDM), Shenzhen, China, 14–17 December 2014; pp. 640–649. [Google Scholar]
  22. Huang, Y. Processing KNN Queries in Grid-Based Sensor Networks. Algorithms 2014, 7, 582–596. [Google Scholar] [CrossRef]
  23. Dong, T.; Lulu, Y.; Shang, Y.; Ye, Y.; Zhang, L. Direction-Aware Continuous Moving K-Nearest-Neighbor Query in Road Networks. ISPRS Int. J. Geo-Inf. 2019, 8, 379. [Google Scholar] [CrossRef]
  24. Wang, Y.; Xia, Y.; Hou, J.; Gao, S.; Nie, X.; Wang, Q. A Fast Privacy-Preserving Framework for Continuous Location-based Queries in Road Networks. J. Netw. Comput. Appl. 2015, 53, 57–73. [Google Scholar] [CrossRef]
  25. Song, D.; Park, K. A Privacy-Preserving Location-Based System for Continuous Spatial Queries. Mob. Inf. Syst. 2016, 2016, 6182769:1–6182769:9. [Google Scholar] [CrossRef]
  26. Zhang, S.; Mao, X.; Choo, K.R.; Peng, T.; Wang, G. A Trajectory Privacy-Preserving Scheme Based on Dual-K Mechanism for Continuous Location-Based Services. Inf. Sci. 2020, 527, 406–419. [Google Scholar] [CrossRef]
  27. Eldawy, A.; Alarabi, L.; Mokbel, M.F. Spatial Partitioning Techniques in Spatial Hadoop. Proc. VLDB Endow. 2015, 8, 1602–1605. [Google Scholar] [CrossRef]
  28. Yu, Z.; Liu, Y.; Yu, X.; Pu, K.Q. Scalable Distributed Processing of K Nearest Neighbor Queries over Moving Objects. IEEE Trans. Knowl. Data Eng. 2015, 27, 1383–1396. [Google Scholar] [CrossRef]
  29. García-García, F.; Corral, A.; Iribarne, L.; Vassilakopoulos, M. Improving Distance-Join Query processing with Voronoi-Diagram based partitioning in SpatialHadoop. Future Gener. Comput. Syst. 2019, 111, 723–740. [Google Scholar] [CrossRef]
  30. Dritsas, E.; Trigka, M.; Gerolymatos, P.; Sioutas, S. Trajectory Clustering and k-NN for Robust Privacy Preserving Spatiotemporal Databases. Algorithms 2018, 11, 207. [Google Scholar] [CrossRef]
  31. Yuan, G.; Sun, P.; Zhao, J.; Li, D.; Wang, C. A Review of Moving Object Trajectory Clustering Algorithms. Artif. Intell. Rev. 2017, 47, 123–144. [Google Scholar]
  32. Huang, Z.; Chen, Y.; Wan, L.; Peng, X. GeoSpark SQL: An Effective Framework Enabling Spatial Queries on Spark. ISPRS Int. J. Geo-Inf. 2017, 6, 285. [Google Scholar] [CrossRef]
  33. Yu, J.; Zhang, Z.; Sarwat, M. Spatial Data Management in Apache Spark: The GeoSpark Perspective and Beyond. GeoInformatica 2019, 23, 37–78. [Google Scholar] [CrossRef]
  34. Nodarakis, N.; Pitoura, E.; Sioutas, S.; Tsakalidis, A.K.; Tsoumakos, D.; Tzimas, G. kdANN+: A Rapid AkNN Classifier for Big Data. Trans. Large-Scale Data Knowl.-Cent. Syst. 2016, 24, 139–168. [Google Scholar]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.