1. Introduction
In recent years several application contexts require the analysis of huge amount of data and very frequently the dimensions of interest include spatial properties. Therefore, many efforts have been devoted by researchers to the implementation of solutions for efficiently performing such kind of computations. The MapReduce paradigm has also been successfully applied to implement parallel solution for those spatial operations that are typically required for performing spatial data analysis. In particular, the well-known range query, spatial join and
k-nearest neighbor operations are currently available in many MapReduce frameworks. For instance, SpatialHadoop [
1], a spatial extension of Apache Hadoop [
2], provides all these operations, in some cases also in different variants, which can be combined with different partitioning techniques, usually called global indexing strategies.
The fundamental principle of the MapReduce paradigm is the subdivision of the input into independent chunks (dataset partitioning) on which the same operation can be performed in parallel (map phase), possibly combining the partial results at the end in a successive task (reduce phase). Therefore, one of the main aspects that can have a great impact on the effectiveness of the parallel execution is the partitioning of the input dataset. A good partitioning strategy has to produce uniform chunks in order to ensure balanced map tasks, that is, tasks whose execution should require more or less the same amount of resources to be processed.
Many MapReduce frameworks, including Hadoop, natively supports a default partitioning method based on data size, namely in this case the goal is to produce chunks, called
splits, having almost the same size in bytes. This aseptic technique can be effective for traditional (textual) data processing, but may not be the best choice for partitioning spatial data. Indeed, it could produce poor performances in those cases in which the input data have to be pruned or filtered considering its spatial location. Conversely, more appropriate partitioning techniques could separate spatially nearby records in different partitions [
3], favouring the subsequent data analysis.
For this reason, several spatial partitioning methods have been implemented in specific extensions of MapReduce frameworks in order to subdivide the geometries into splits according to their spatial properties. For instance, indexes based on a regular grid, Quad-trees and R-trees are available in SpatialHadoop [
4] and can be applied to partition a dataset before executing a given operation. However, in this paper we show that
not all spatial partitioning techniques behave in the same way and in different cases
the best technique to use can change according to the
spatial characteristics of the datasets at hand and eventually the operation that has to be performed on the partitioned datasets.
As a first example of the kind of issue we want to consider in this paper, we shown in
Table 1 the results of the execution in SpatialHadoop of the Distributed Join (DJ) [
5], the Range Query (RQ) and of the
k-Nearest Neighbor operation (
k-NN) when applied to different situations. We consider the following cases:
the DJ operation applied to two datasets which are both uniformly distributed. In particular, the first one (denoted as ) is partitioned using a regular grid (GR), while the second one (denoted as ) has been partitioned using different techniques, namely regular grid (GR), Quadtree (QT) and R-tree (RT).
the DJ operation applied to a uniformly distributed dataset (denoted as ) and a skewed one (denoted ). In particular, has been partitioned using a regular grid, while has been partitioned with several possible partitioning techniques.
the RQ operation applied to a skewed dataset () partitioned with several possible partitioning techniques.
the k-NN operation applied to a skewed dataset () partitioned with several possible partitioning techniques.
In
Table 1, the second column reports the applied partitioning technique, the third column contains the total time in milliseconds for performing the operation, while the last column reports some statistics about the map tasks. In particular, we indicate the number of instantiated map tasks, the average time taken by them and the relative standard deviation (RSD) between their execution times. A greater value of RSD means that there is a great difference between the execution time of different map tasks, while a lower value means that they essentially take the same time to complete.
As expected, when both datasets are uniformly distributed, the response time of the DJ is similar regardless of the used index, while, when a skewed distributed dataset is considered, then the differences are significant and in this particular case are in favor of the R-tree. This is mainly due to the fact that when the distribution is skewed, the partitioning of the geometries based on a regular grid does not produce balanced splits (i.e., RSD is about 90%), while the Quadtree and the R-tree indexes perform better and produce more balanced splits (i.e., RSD is about 28% for the R-tree). A similar behaviour can be observed for the other two operations, RQ and KNN, where again the Quadtree and R-tree partitioning techniques perform better. In particular, also in this case the best technique is the one with the lowest deviation between the execution time of the map tasks. Therefore, we can observe that balancing the cost of the map tasks is crucial for effectively exploiting the parallel potentiality provided by a MapReduce framework.
The aim of this paper is to provide a way to easily detect some hints about the dataset distribution and based on them chose the more effective partitioning technique to apply. In this manner the partitioning will be mainly based on the spatial properties of the geometries contained in the dataset and not only on its size in bytes.
In literature, many statistical techniques have been proposed to provide a summarized description of a dataset. These descriptors, often called
sketches, are used to speed up the query processing by providing approximate answers based on them [
6]. One of their main uses in spatial big data analysis can be the estimation of selectivity for a join operation. we can classify relevant sketching techniques into two main categories—
sampling-based methods [
1,
4,
7,
8] and
histogram-based methods [
9,
10]. In general, histogram-based methods are shown to be superior for accurate spatial selectivity estimation [
11,
12].
This paper proposes a new technique which internally constructs multiple uniform histograms. However, it does not require to store and maintain all of them, since, rather, it aggregates them into a regression model. The regression model produces only two numbers that are used as indicators of the dataset distribution in the reference space. More specifically, the proposed technique is based on the concept of box-counting that was first proposed in Reference [
13] for computing the fractal dimension of a dataset of points. The behavior of the box-counting function measured in a restricted range of values (representing the cell side of the grid) can be described by a power law and it was used in Reference [
14] for estimating the selectivity of self-join and range query, then extended in Reference [
15] to the spatial join on distinct datasets. We propose its application in the context of big spatial data for the following reasons: (i) it is an efficient technique for detecting information about the dataset distribution; (ii) it produces just one number characterizing the data distribution and does not require to store auxiliary structures, like histograms; (iii) the box-counting function can be computed in parallel, since it calculates a uniform histogram storing the counts and this can be easily implemented in MapReduce.
The main contributions of the paper are: (i) the extension of the box-counting function to all types of geometries; previously, it could be applied only to a set of points; (ii) the definition of a MapReduce algorithm for its efficient computation, possibly on a sample of the considered dataset; (iii) the definition of heuristics for the choice of the best partitioning technique for a given datasets whose distribution has been described by means of the box-counting function. A subset of these results have already been presented in Reference [
16]. In particular, this paper considerably extends the work done in Reference [
16] by providing:
- (a)
an analytical presentation of the box-counting function together with some examples that justify the proposed approach (
Section 3),
- (b)
a detailed description of a MapReduce implementation of the algorithm for computing the exponents of the power law that characterizes the box-counting function (
Section 4),
- (c)
an extended presentation of the proposed heuristic for choosing the right partitioning techniques together with the proofs of the defined properties (
Section 5),
- (d)
a comprehensive set of experiments performed both on real and synthetic datasets (
Section 6) which includes three types of operation: the spatial join, the range query and the
k-nearest neighbor.
2. Background
In this section we summarize the main characteristic of the MapReduce implementation of spatial operations like spatial join, range query and K-nearest neighbor, together with the main partitioning technique usually available in cluster systems dedicated to spatial data, such as SpatialHadoop.
2.1. Partitioning Techniques in SpatialHadoop
This section briefly describes the partitioning techniques available in SpatialHadoop and shows their effect on skewed distributed datasets.
At the basis of the MapReduce paradigm is the idea to divide the input dataset into fixed-size blocks, called splits, on which the same operation (map task) can be instantiated and executed in parallel. If all map tasks can be executed in parallel, then the total execution time depends on the map task that takes longer. Therefore, the fastest parallel executions can be obtained when the map tasks are well balanced. It follows that the partitioning of data into splits is a crucial operation for obtaining well balanced map tasks and ensure faster executions. Hadoop traditionally applies a random division of the input data, during split generation the only prescribed constraint regards the size in bytes of such splits on the HDFS (Hadoop Distributed File System). However, this naïve partitioning cannot be the right choice during spatial analysis for which some filtering or pruning is always performed for evaluating spatial predicates.
Some partitioning techniques that take into account the spatial correlation (locality property) of the input data is necessary, that is, geometries that are closed in space will be placed in the same split.
Table 2 compares different partitioning techniques when applied to different datasets. For each dataset and applied technique, we report the number of produced splits and the relative standard deviation (RSD) between their cardinalities (i.e., degree of balancing in terms of number of geometries). The simplest one relies on the use of a uniform grid (first row). However, such technique might produce balanced tasks only for uniformly distributed datasets, but not in general. Indeed, as we have shown in
Table 1, when skewed distributed datasets are considered the cost of the map tasks is often unbalanced, causing a performance degradation (% RSD is 1% for the uniform distribution, while it is more than 100% for the other datasets). Therefore, in order to guarantee the generation of balanced splits, different types of grids should be used for data partitioning. In SpatialHadoop three main types of grid exists for global data partitioning, the first two are space-based while the last one is a data-based partitioning:
Regular grid: it identifies the cells by dividing the 2D space in both axes by a constraint measure; it is suitable only for uniformly distributed datasets;
Quadtree-based grid: it identifies the cells by recursively subdividing a cell (starting from the whole space) in 4 equal cells until the number of geometries per cell reaches a threshold;
Rtree-based grid: it identifies the cells by recursively aggregating the geometries of the dataset until the number of geometries per cell reaches a threshold.
We partition synthetic and real datasets using the above listed techniques and we obtain the grids shown in
Table 2. As you can notice, the R-tree is the one that ensures the best balancing in terms of number of geometries per cell (i.e., % RSD), but is some cases (e.g., the
or the
) it can produce very unbalanced cells in terms of covered space.
2.2. Spatial Operations in SpatialHadoop
In order to evaluate the effectiveness of the proposed method in choosing the right partitioning techniques for spatial datasets of any distribution, we consider three operations: spatial join, range query and k-nearest neighbor. Hereby we briefly describe their implementation in MapReduce.
Distributed Spatial join (DJ)—The distributed spatial join operator (DJ) works on two input datasets and . Both datasets can be partitioned according to one of the index techniques mentioned in the previous section. Therefore, each dataset is represented as a collection of partitions, . Each partition is characterized by a key, representing the index cell, and a list of geometries belonging to that cell. Given two datasets and , the input of each map task is prepared by a binary reader which is able to generate a compound split starting from a pair of splits belonging to the two datasets (i.e., and ). In particular, a filter is used for preparing the compound splits, so that only the pairs of splits regarding intersecting cells are generated. Therefore, the number of generated compound splits, and consequently the number of map tasks, is equal to the number of pairs of intersecting cells. Given a compound split, each map tasks loads its content into two lists and then it applies a Plane-sweep algorithm for checking the intersection between their geometries. DJ is a map-only job, no reduce phase is necessary, and clearly it can be classified as a map-side join.
Range query (RQ)—the range query (RQ) works on one file only, which can be partitioned in splits using one of the available techniques, . Since also in this case, the reader works on indexed data, given a query rectangle R, a filter is applied in order to select the splits (cells) that intersects R. A map task is instantiated for each split (cell) having a non-empty intersection with R. Inside each map tasks, a test is performed in order to precisely determine which split geometries intersect R. RQ is also a map-only job, namely it has no reducers.
k-Nearest Neighbor (k-NN)—As the range query also k-NN reads only one dataset . This operation is divided in more than one MapReduce job. In particular, given a dataset D and a query point Q, the first job tries to find the k nearest neighbor geometries in the split which correspond to the cell containing the query point Q. This job is composed of a map and a reduce phase. During the map phase, the distance between each geometry and the point Q is computed. Then, a single reduce task receives the list of such geometries with their distance from Q and extracts the first k points according to that distance. After this job, a circle with center in Q and having as radius the distance of the k-th geometry (or the last retrieved one) is computed. If the circle is contained in the cell of the execution terminates; otherwise, the circle is used as filter for the splits to be processed by the second job. More iterations can be done only if the first job does not retrieve k geometries. of course, if k is small very often the first job is sufficient.
3. Evaluation of Dataset Skewness
Considering the case study shown in
Table 1, it is clear that an easy and efficient way for evaluating the skewness of a spatial dataset can be crucial for choosing the right partitioning technique. This section presents the definition of the box-counting function
for a given dataset
D containing 2D geometries of type point, line or polygon embedded in the Euclidean plane. This is the first contribution of the paper, since it is an extension of the box-counting function proposed in Reference [
14] which originally applies only to finite set of points. The implications of such extension on the estimation correctness are discussed and the possible alternatives are evaluated. We will see later how the analysis of this function can provide some hints about the skewness of the dataset.
Definition 1 (Box-counting function).
Given a dataset D, containing 2D geometries (i.e, points, lines or polygons), and a scale r, representing the cell size of a grid covering the reference space of D (i.e., the MBR of the whole dataset D), the function Box-counting is defined as follows:where count(geometries of D intersecting the cell i). The case is excluded, since it always returns the total number of geometries in the input. In Reference [
14] the author shows that the box-counting function is useful for computing the generalized fractal dimension of a finite set of points, where
q represents the exponent in Equation (
1) and
r is the considered scale (i.e., the cell side of the grid). Intuitively, given a grid with cells of side
r, the box-counting function with
proposed in Reference [
14] counts the number of cells that are intersected by at least one point of
D, similarly here the function
counts the number of cells that are intersected by at least one geometry of
D. In this way, when exponents
q is greater than 1, the box-counting is the sum of the number of geometries intersecting a cell raised to
q. As we will show, this function can be used to detect the skewness of a dataset by computing it for
and
while varying the value of
r. More specifically, the level of skewness of a dataset depends on how this value changes while increasing
r.
Notice that with respect to the definition given in Reference [
14], which applies only to set of points and counts the number of points contained in a cell, in Definition 1 we propose to count the number of geometries that intersect a cell. This extension of the box-counting function from set of points to generic geometries could be obtained in three ways, as discussed below.
- (i)
a first option could be to choose a representative point for each geometry of the dataset (e.g., its centroid) and then apply the classical box-counting function. This can be the simplest solution, but it does not ensure to always detect the real behavior of the dataset. This is for instance the case of a datasets containing a set of big polygons covering the whole reference space, if they are approximated by their centroids, the resulting points are all clustered in a small region of the same space, thus producing a point set that does not describe at all the original dataset layout.
- (ii)
Another solution could be to substitute the geometries with their vertices; again, there could be regions covered by geometries that are not covered by their vertices, with the same effect described in the previous case.
- (iii)
the last solution is the one adopted by this paper, namely to count the number of geometries intersecting a cell, which is equivalent to suppose that each geometry
g is converted in a set of points
covering the same space with a granularity that satisfies the following hypothesis:
if g intersects a cell i, then there exists at least one point such that p intersects the cell i. With this hypothesis, we can extend the box-counting function from point sets to sets of geometries and apply to our case the results of Reference [
14]. Notice that even if this solution can produce an over-estimation for the selectivity, it does not have negative effects for the problem considered in this paper, namely the estimation of the dataset distribution, conversely it leads to a more precise result.
Definition 2. Given a dataset D, containing 2D geometries (i.e, points, lines or polygons), the Box-counting plot is the plot ofversus r in logarithmic scale. Now, we can consider such plot and exploit the following observation of Reference [14]—for real datasets the box-counting plot reveals a trend of the box-counting function that, in a large interval of scale values r, behaves as a power law:where α is a constant of proportionality andis a fixed exponent that characterizes the power law. As we will see in Example 1, the Box-counting plot is vital for the computation of the exponent for a given dataset D, since this exponent becomes the slope of the straight line that approximates in a range of scales , thus it can be computed by a linear regression procedure. The exponent characterizes the dataset distribution as explained by the following set of properties.
for , is negative and the power law, given the length of the cell side r, computes the number of cells that are intersected by the dataset D. Notice that, if D is uniformly distributed in the reference space (the Euclidean plane in our case), then the number of cells intersecting D coincides with the total number of cells of the grid, thus the more r increases the more this number decreases according to the area of the cells. As a consequence in case of an uniform distribution, is equal to minus the dimension of the embedding space, in our case .
for datasets that represent fractals (like the Sierpinski’s triangle), it is known from the theory that coincides with the fractal dimensions of the fractal for any q (it is a consequence of the self similarity property), thus for the Sierpinski’s triangle .
Finally, we can observe that and could be chosen as reference descriptors for a dataset D with the aim to have some hints about the distribution of the geometries in the reference space of D. Indeed, can be an indicator of the cases where the dataset leaves empty some areas of the reference space, while can also be affected by the concentration of the datasets in some areas with respect to other ones, that is, the situations where there are no empty areas, but different concentrations in different areas.
Example 1. Two examples of Box-counting plot withandare reported in Figure 1, where the considered dataset is shown on the left and the corresponding plots on the right. In particular, the first dataset(Figure 1a) is synthetically produced by a generator of points belonging to the Sierpinski’s triangle [17], which is a well known fractal, and by substituting the points with small polygons. The Box-counting plot ofandare represented by diamonds in Figure 1b,c, respectively. The second dataset(Figure 1c) is a real dataset containing a set of lines representing the main roads of Australia; again the Box-counting plot ofandare represented by diamonds in Figure 1e,f, respectively. The Box-counting plots reports also, inside blue and red rectangles, the exponent(Figure 1b,e) and(Figure 1c,f), computed by applying the MapReduce algorithm illustrated in the next section. For the dataset, two intervals of r values with constant slope are detected in the plot. In the first one with very small values of, from −8.3 to −5.5,is about; this behavior is due to the fact that, having very small cells and being the dataset finite, the number of cells intersected by the geometries tends to be rather constant (one polygon for each cell), but having polygons instead of points, it slightly decreases. (ii) In the second interval, from −4.1 to −0.7, the behavior of the fractal emerges andis about, namely very close to the expected theoretical value. For the plotsimilar considerations can be done, since, as theoretically proved, fractals have equal exponents for every parameter q considered in the power law.
For the dataset, again two intervals are detected, in the first oneis around. This means that the dataset has a skewed distribution (indeed,), and the more we reduce the size of r, the more emerges the local behavior of the dataset as a line: indeed, it is composed of linestrings. In the second intervalis around, which means that for greater values of r the dataset is present in almost all cells, so the diffusion is close to the uniform one and thusis close to. In such caseshas to be considered, since it is able to capture the real distribution of the dataset, which is actually different from its diffusion. In particular, forwe can observe that most of the roads are concentrated in the south east and south west of Australia.
In order to practically use and as indicators of distribution for real datasets, it is necessary to find an easy and efficient way for computing them given a dataset D. The next section presents a MapReduce implementation in SpatialHadoop of an algorithm for efficiently computing and .
4. A MapReduce Algorithm for and
This section presents the second contribution of the paper, namely a MapReduce algorithm for the computation of the exponents
and
of the power laws introduced in Equation (
2). It was implemented w.r.t. SpatialHadoop, thus some basic spatial functions are assumed to be available in the target system.
Given a dataset
D containing geometries of different types, the required exponents can be obtained by first computing the box-counting function
for different values of
r, and then by using linear regression to determine the slope of the line representing the plot of
, as shown in
Figure 1. Such slope is equal to the parameter
which we need to estimate. It follows that the main goal is the computation of the Box-counting plot of
D for
and
; the successive linear regression can be applied in constant time, since the computed plots will always have the same number of pairs
(or
).
In order to compute the required Box-counting plots, we need to know the reference space of D, which is represented by its MBR. The user can already know this MBR, thus passing it as a parameter, or it can be unknown. In the latter case, a preliminary MapReduce job can be performed to compute it, for instance by invoking the method FILEMBR() of SpatialHadoop. Given the MBR of D, it is now necessary to choose a list of grids with increasing cell size , to be used for computing the series and . The most convenient choice for the grids is the one that can guarantee to produce Box-counting plots with homogeneous distributed values in a logarithmic scale. Thus, starting from a value , at each step i we derive the values by multiplying by 2. The initial is chosen so that is less than the side of the MBR (to simplify the presentation we suppose here that the MBR is a square) and that a minimum number of grids could be generated.
In the MapReduce procedure, each map task will process one split of D, and for each geometry g of the split it identifies the cells of the finer grid that it intersects. it counts the the number of geometries intersected by each cell of and at the end, it writes as result a set of pairs , where i identifies the cell of , while is the count of geometries in D intersecting the cell i. Notice that we do not raise such count to the power q at this point because the value is not final as it needs to be merged with the result of other map tasks during a subsequent reduce phase.
The map task is presented in Algorithm 1. The finer grid
is generated in the SETUP() method (line 4). Additionally, a hash map
is created (line 5) that will store the counts of (only) the not empty cells of
. In this way we avoid to store all the cells of
. In the MAP() method (lines 6–9), for each input geometry
, the function
is executed (line 7). This function returns a list of identifiers representing the cells of
intersected by
. Finally, for each one of such cells
, the hash map
is updated (line 9). In the CLEANUP() method (lines 10–12), the content of
is written to disk producing the input of the successive reduce task.
Algorithm 1: Map Task |
|
The complexity of each map task is , where is the average number of geometries of D per split.
The pseudo-code of the reduce task is presented in Algorithm 2. Notice that its SETUP() method generates a list of grids
and the corresponding list of hash tables
. The grids are generated starting from an initial cell side
, while successive
values are obtained by multiplying
by 2.
Algorithm 2: Reduce Task |
|
The REDUCE() method receives as input the partial results produced by the map tasks. In particular, for each cell of it receives a list of partial counts for computed by the various map tasks. Therefore, it initially sums such counts producing its total count (line 13). These total values are stored into the hash map (lines 16). The same total values are used for updating the cells of the other grids that contain (lines 20), so that every hash map can be filled during the same iteration.
In the CLEANUP() method the series and are computed starting from the hash maps (lines 23–30). Finally, the slopes representing and are computed applying the procedure (lines 31–32). In this phase the range of scales can be split in some intervals when necessary in order to obtain a constant slope.
The complexity of each reduce task is , where represents the total number of cells of intersected by the dataset D, while .
Table 3 illustrates the results of the application of the proposed MapReduce job to both synthetic and real datasets. The first column describes the dataset in terms of its distribution and number of geometries (i.e., cardinality), while the second column reports the taken time in milliseconds and the last column contains the obtained values.
5. Heuristics for Choosing the Best Partitioning Technique
Given the MapReduce procedure illustrated in the previous section, it is now time to introduce some criteria for choosing the right partitioning technique based on the computed values. In particular, we need to introduce some criteria assessing the quality of the partitioning techniques with respect to the operations that we want to perform onto the indexed datasets. Considering the spatial join, range query and k-nn, we define two quality descriptors that have to be minimized in order to improve the effectiveness of an index on a dataset D.
: the %RDS (relative standard deviation with respect to the mean) of the split cardinality (i.e., the number of geometries);
: the percentage of the reference space covered by the grid that represents dead space, that is, space containing no data.
Notice that, affects the cost of a single map task, while has an impact on the total number of map tasks to be instantiated by the desired operation.
Let us consider again the partitions produces by applying the techniques illustrated in
Section 2 on synthetic and real datasets, and in particular the grids obtained in
Table 2. For each of them
Table 3 reports the computed values of
and
. Such obtained results confirm the ability of
and
in distinguishing cases that need different partitioning techniques. Indeed, for the uniform distribution (
and
) the obtained partitions are very similar for all techniques; in this case the regular grid can be the best choice, since its creation cost is less. For the diagonal with buffer (
and
) the Quadtree-based and the Rtree-based grids adapt best to the dataset distribution. However, the partitioning produced by the Rtree-based grid has more balanced cells w.r.t. the Quadtree-based partition, in terms of number of geometries per cells. Finally, when a clustered dataset is considered (
and
), we obtain the best partitioning with the Quadtree-based grid, while the Rtree-based grid produces a partition with lots of dead space.
At this point the idea is to exploit and in order to choose the right grid for data partitioning, without building all kinds of indexes. This goal can be obtained thanks to the following properties of and .
Property 1 (Dataset diffusion). Given a dataset D, when its exponent is close to , then the descriptor for any index is close to zero, that is, no dead space exists.
Proof. If
is equal to
, then the dataset is distributed in the reference space in such a way that, for each scale r of the grid G that is considered during the box-counting computations (see Definition 1), all the cells of G are intersected by the geometries of D. Considering a reference space
:
Therefore, the geometries of D are spread throughout the reference space and no dead space exists. □
Property 2 (Dataset distribution). Given a dataset D, when its exponent is close to , then the descriptor of a regular grid is close to zero, that is, every cell of the grid contains the same number of geometries belonging to D.
Proof. If
is equal to
then the dataset is distributed in the reference space so that the box-counting computations (see Definition 1) produce the following result:
Since the term
is minimized when the
are all equal, then we obtain the uniform distribution of the dataset, which is our thesis. □
Now we add to the above presented formal properties some experimental observations. Given a dataset D:
When the computed is around , then the dataset is skewed, has some dead space and is located around a curve, thus it is usually connected. In this case, the regular grid will have high values for the descriptors and , since the dataset cannot be uniformly partitioned into cells with equal area, while the Rtree-based grid will have the lowest value for , because it is the technique that starts from geometries and not from the space for clustering data, but also a good value for , since data cover a connected region. Finally, the Quadtree-based grid will have a good value for , but a higher value for w.r.t. Rtree-based grid.
When the computed is around , then the dataset is skewed, has lots of dead space and is located around two or more points, thus it is usually not connected. In this case, the regular grid will have high values of and , as before; the Rtree-based grid will have the lowest value for but a higher value for due to the fact that the connectivity is lost, while the Quadtree-based grid will have a better values for both and than the Rtree-based one, since it adapts better to the clustered datasets.
Similar considerations are valid for the values of , where instead of dead space it detects regions of lower/higher concentration, thus affecting more deeply the descriptor .
Using Properties 1 and 2 and the above listed observations, we propose the following heuristic for choosing the best partitioning technique customized to a dataset
D. It is based on the decision tree shown in
Figure 2.
The first test on determines if the dataset spread throughout the reference space or it is concentrated in some way. If is true, we need to verify the kind of distribution it has using . In particular, if is true, then the dataset has a uniform distribution and the best index is the regular grid. Conversely, if is false, we need an additional test on to check if the occupied space is connected or not. If is true, then the dataset can be considered connected and the Rtree-based grid is chosen; otherwise, the Quadtree-based grid will be the best choice.
On the contrary, if is false, then the dataset has a skewed distribution in the reference space. In this case the test on is applied to determine if the dataset is clusterized in some way. In particular, if is true, then the Quadtree-based grid will be chosen; conversely, the final test on is used to determine some degree of connectivity and to choose between a Quadtree or an Rtree index.
Notice that in the tree we use threshold values equal to , for the choices regarding , while we use threshold values equal to , for , since we consider only when the dataset is spread throughout the reference space or when is around , thus in this case the values near to cannot be reached by . The effectiveness of the proposed heuristic has been tested by meas of some experiments illustrated in the next section.
6. Experiments
This section presents some experiments performed on a Hadoop cluster composed of 10 slave nodes and 1 master node, on which Hadoop 2.8 and SpatialHadoop 2.4 extension have been installed. Each node is characterized by 4 cores, 8GB of RAM memory and 1TB of SSD, all nodes are connected through an infinity band network. The experiments consider a collection of datasets containing both synthetic and real spatial data.
Table 4 illustrates their characteristics.
For each dataset, we computed
and
(columns 4 and 5 of the table) by applying the algorithm presented in
Section 4 and we determined the best index (column 6) according to the decision tree presented in the previous section. we considered three kinds of operations on such datasets: range query, spatial joinan
k-nn, using all the three analyzed indexes, that is, we applied three partitioning techniques: regular grid (RG), Quadtree-based grid (QT) and Rtree-based grid (RT). As regards to the range query operation, we performed 200 range queries for all datasets and for all kind of indexes by varying the side length of the query region (from
to
with respect to a reference space normalized to
). The obtained results are shown in
Table 5 where the reported times are an average value over about 200 queries for each case. Notice that, the best performances are always obtained when the used index is the one suggested by the proposed heuristic. In particular, when the dataset is uniformly distributed (
UD) all techniques have similar performances, indeed the difference between the best and the worst solution is at most 5%. For the other synthetic datasets, we obtained as expected that for both datasets having a distribution around a line, the best is the Rtree-based grid, with a higher gain in the case of the
DL dataset. Instead, when we observed a higher similarity with a clustered distribution, like for the
DC dataset, the best solution becomes the Quadtree-based grid. Real datasets shown the same behavior of the synthetic datasets confirming the quality of the proposed heuristic. In particular, when the dataset is nearly uniformly distributed, as for
WA or
ST the differences induced by the various partitioning techniques are very low.
For the spatial join operations, we performed some experiments on synthetic datasets by joining a uniform distributed dataset, called
UN, with another dataset
D having one of the other kind of considered distributions. In particular, for each kind of distribution in
Table 4, 10 synthetic datasets are randomly generated and joined with
UN. The obtained results are reported in
Table 6. As for the previous tests, except for the uniformly distributed case where all indexes have the same performances, the best results are obtained when
D is partitioned using the suggested kind of index, achieving also a considerable reduction on the execution times.
For real datasets we performed a first join between datasets
PR and
WA and a second join between
PR and
ST. Both joins were performed in 9 different conditions, that are obtained by considering all the possible combinations of index type on the first and second dataset. The results are reported in
Table 7 and
Table 8. The best execution times is obtained when both datasets are partitioned with the suggested kind of index. Notice that, the worst performances are obtained when a regular grid is built for both datasets, confirming that it is not a good index in case of skewed data.
Finally, for the
k-nn operation, we perform some experiments on both synthetic and real data. As regards to the synthetic ones, we produce for each kind of considered distributions 10 different datasets and for each of them we perform 40 experiments by randomly choosing the query point, the first half of experiments considers a value of
k equal to 100, while the second one a value of
k equal to 10,000. Similar experiments have also been performed for real datasets, by varying both the value of
k and the reference point. The results are reported in
Table 9. As for the previous operations, in case of uniform distributed datasets, the difference between the best and the worst indexing technique is very low. Conversely, in the other cases the choice of the right partitioning technique has a significant impact on the overall performances.
7. Conclusions
This paper considers the impact of a skewed distribution on the performances of three spatial operations, that is, range query, spatial join and
k-nn, with particular attention on its effect on the balancing of the work performed by the map tasks during their parallel execution. we considered as reference framework SpatialHadoop and its partitioning techniques: regular grid, Quadtree-based grid and R-tree based grid. Such partitioning techniques can produce different results on the basis of the distribution exposed by each dataset, and these results can lead to potentially great differences in the performances of spatial operations. Therefore, the choice of the right partitioning technique for each kind of dataset becomes an important activity in order to completely exploit the benefit of a MapReduce execution. For this reason, we proposed a new technique based on the Box-counting function [
14] for efficiently estimating a dataset distribution and accordingly choose the more suitable partitioning technique. Some experiments on synthetic and real datasets have been performed to show the effectiveness of the proposed heuristic. Such experiments reveal that in all cases the suggested partitioning technique is able to improve the execution time of the following spatial operations. In particular, while in presence of uniform distribution, all partitioning techniques have essentially the same effect on the following operations, when skewed datasets are considered, the choice of the wrong partitioning technique can double the time required for a given analysis. Moreover, when the distribution resemble a linear concentration, data-based paritioning techniques (like the R-tree) are more suitable, since they are able to produce more balanced splits, while they are less suitable for clustered datasets because they could produce partitions with lots of dead spaces. In such case, space-based partitioning (like the Quadtree) are the right choice.
Future work regards the application of the knowledge about dataset distribution to reduce-side joins, the extension of the proposed approach to selectivity estimation, the application of the box-counting function for estimating the skewness of multidimensional datasets, also outside the spatial context. Another important extension could be application of the proposed method in order to define new mixed partitioning techniques. More specifically, the technique could be applied to break a given datasets into homogeneous subsets on which different partitioning technique could be applied. It will also be worthwhile in future work to explore the direction that decides the partitioning without heuristics, based purely on the identified distribution and by means of machine learning models.