The project is implemented in Scala, aiming to extract clusters from massive trajectory logs stored in HDFS and persist them in MongoDB. Spark is leveraged to conduct the work in a distributed manner. RDD distributed across the memory and disk are being created and the work’s stages are being achieved in parallel by distributing tasks across worker nodes. In particular, the system must provide the possibility to employ either the smart or semi-naive algorithm for computing the transitive closure; the choice of applying either the max–min, max–delta and max–product t-norms for transitivity. Moreover, the max–delta and max–product t-norms need a specific kind of handler.
Proof of Theorem 1. Let us put then . For to be in , must be true. But, in our case . Hence, by contradiction only in . □
Proof of Theorem 2. Let us suppose that . This is equivalent to . Hence, if , then . The last statement is equivalent to , and this holds only if because by our supposition . Consequently for , is not always true. □
These two theorems proof that at each step of computing the t-norms, we must filter the norms inferior to the chosen
-level. To provide all these possibilities, we present the
Figure 1 and
Figure 2 explained in the next subsection.
4.1. The Project’s Class Diagrams
The project integrates the Scala Stackable Trait Pattern. In detail, we provide an abstraction called AbstractMatrix. It is an abstract class holding the abstract definition of all the operations needed to achieve the transitive closure. Then, different traits override specific abstract definitions to customize the behavior. Traits extend java interfaces to provide much richer functionalities. In our case, traits extend the AbstractMatrix with specific behavior; they are denoted mixins, e.g., the MaxMinNorm overrides the norm function into computing the minimum of the inputs and the Seminaive mix overrides the closure algorithm definition. They can only be extended by classes already extending the AbstractMatrix. Also, the AbstracMatrix has a companion object defining a specific KeyEntry Type and a static ireverse operation which reverses the Similarity class entries for the closure’s composition join. The Similarity class resembles Spark’s MatrixEntry case class in the Spark MLlib library. However, it is not a case class and it overrides both the equals and hashCode operations to provide specific comparison behavior. The rationale behind this class is that we observed an anomaly when computing the transitive closure. More specifically, the transitive closure computing time was very large. After observing the computing process, we found out that when subtracting the results of the past and present, iterations in the closure’s similarities were not considered equal because of their weights. Although, the weights differed from each other in a precision of . This fact led us to provide our own comparison strategy with a custom precision of integrated in the Similarity class.
The abstraction is extended by another trait
DefaultMatrix. Like the name suggests, the trait provides default behavior for the abstract definition. This has made it possible to leverage the abstract operations without defining a default behavior for each class. As the
Figure 2 shows, the class SiMatrix can leverage the
AbstractMatrix operation without having to override them. This is our approach extension of the stackable trait. Furthermore, Scala traits support self types. Meaning that a trait can only be extended by a subclass of the specified type. Which is why we implemented a
MatrixImpl, a self type of the
DefaultMatrix, adding a restriction to the inheritance strategy. The concrete class
SiMatrix extends both the
DefaultMatrix and the
Serializable traits to respectively be able to leverage all the
AbstractMatrix traits and to execute spark stages. Note that in the other traits, we did not override the composition operation because we provided a concrete definition in the lowest level
SiMatrix class. To create the SiMatrix objects, we needed a factory pattern. Fortunately, the factory pattern is simplified in Scala by companion objects (A companion object of a specific class in Scala has the same name as the class and it can access the private variables and operations of the class). The
SiMatrix is a companion object providing the factory method for instantiating its relative class. An extract of the
apply method is described in Algorithm 2. The function is a higher order function based on currying. The first function takes as input the distributed similarities and outputs a function that takes the alpha threshold used for the partitioning, the chosen t-norm and transitive closure algorithm. Then, depending on the t-norm and closure algorithm it instantiates the
SiMatrix class while specifying the traits to be mixed in. Note that the order of the trait is important. Scala is based on linearization principle, the last trait is the first to be extended. Hence, the name stackable trait pattern. It is like a stack of patterns where the first one to be extended is actually the last.
Algorithm 2: The SiMatrix companion object apply method. |
|
The apply method of the SiMatrix object leverages the flatten function to create a reverse duplicate of the entries and return an iterator of the couple which will be flattened by Spark’s flatMap operation. Now, for computing the closure, we specified two traits Seminaive and Smart related respectively to the semi-naive and smart algorithm. The difference between the two is that at each iteration the first considers the initial relation when computing the composition, while the second considers the same relation related to the current iteration. This led us to define a curried function for each trait with default parameters. In the Seminaive trait Algorithm 3, we specify that Rx equals the predefined Rx in the AbstractMatrix, which gets overridden in the SiMatrix class. The last function in the closure curried form computes the composition of the past specified parameters Ri and Rx. This reflects the high potential of Scala’s higher order function into simplifying highly complex computations. At the heart of the closure function, we check if the current Rj equals the past iteration Ri. If the preposition holds, we return Ri; otherwise, we call the closure again in a recursive manner with Ri as Rj and the other parameters as default ones. As for the Smart trait Algorithm 4; we change Rx into a KeyEntry value returned by calling the ireverse function on the first parameter Ri. Note that to compute the transitive closure, we consider the type KeyEntry which reflects the RDD[(Long,(Long, double))] type. The first relation gets its columns extracted and joined with the rows of the second relation which are extracted by the ireverse function. The composition is overridden in the SiMatrix class (please see the Algorithm 5). In particular, the composition calls the handle function which in the same manner calls the norm function. Hence, the order of the mixins is very important.
Algorithm 3: The SemiNaive trait. |
|
Algorithm 4: The Smart trait. |
|
Algorithm 5: The overridden composition method in the SiMatrix class. |
|
To compute the similarities of the trajectories, we defined a trait GeoEntry that gets extended by the PointEntry and TrajectoryEntry case classes. Case classes in scala are serialized by default, their constructor parameters become accessible variables and functions like equals and hashCode are overridden. They are related to the Value Object Pattern. The TrajectoryEntry class is composed by objects of the PointEntry class. The similarities of the points gets computed by the getSimilarity function which call the recursive function lcss both defined in the TrajectoryEntry class. The points gets compared respectively by their time and distance by the functions defined in the PointEntry class. The order of the comparison is important to reduce additional time overhead. At last, the closure can be accessed via the closure_edges immutable lazy variable. In particular, the laziness reflects the Lazy initialization Pattern where the variable gets instantiated on access. The overall pipeline of the process is described in the next subsection with activity diagrams.
4.2. The Overall Execution Pipeline
The activity in
Figure 3 reflects the overall process. All the actions in the diagram are distributed. First, we leverage Spark’s
wholeTextFiles operation to read the different directories. Also, we specify the number of partitions. While experimenting, we observed that if we specify n partitions we get 2*n ones. After reading the files, we use the Spark’s
map operation to retrieve the trajectory from each file. Spark’s
map function is a higher order function taking as a parameter another function which gets specified by us.
Figure 4 illustrates the trajectories’ creation process. Each file is handled in a distributed manner; however, each line of the files is handled sequentially to create instances of the
PointEntry class.
Afterward, we evaluated the system and we observed a relatively large time overhead. This led us to specify our own custom partitioning strategy based on the trajectories’ sizes. The strategy is implemented in the
LoadPartitioner class which extends Spark’s
Partitioner abstract class. The partitioner has to override the functions
numPartitions and
getPartition which return the id of the chosen partition. Details are in Algorithm 6, the main idea behind the partitioning strategy is to return the partition with the lowest size. The size of the partitions is not defined by the number of the trajectories, but the size of the points in the trajectories. The efficiency of the logic is provided in
Section 5.
Algorithm 6: The LoadPartitioner class. |
|
Now that the partitions got repartitioned successfully, we zip each trajectory with a unique identifier. Then, we suppress the trajectories’s sizes and conduct the cartesian product to assemble the similarities’ elements. After the cartesian product, we conduct a special filter. This filter enables us to reduce the load to more than the half of the overall couples. Instead of computing the similarities of
elements, we compute only
. This is achieved by considering only the couples where the first trajectory’s id is inferior than the second. Because the relation is a fuzzy proximity relation, we do only have to compute
not also
. Also, we do not have to consider
because it equals 1. Next, we compute the similarities and conduct a filter to suppress the similarities with a weight inferior than the chosen
-level. The importance of repartitioning the similarities after the filter is proven in the
Section 5. After repartitioning, we construct the
SiMatrix object and compute the closure. For the clustering, we consider the max–min transitivity and extract the clusters by defining the connected elements of the created Graph object belonging to the Spark’s GraphX library. All the extracted clusters form maximum cliques. Now that the clusters are identified, we index each element by its relative cluster, collect and broadcast the identifiers in a Map type form. The rationale behind this broadcast is to reduce the time overhead for the join. Because Spark’s joins require shuffle of the data and this can cost a lot of time. However, the broadcast enables us to cache the identifiers in each worker. The join and persistence of the objects are explained in the
Figure 5.
Instead of handling each element, we handle each partition to reduce the number of database connections. For persisting the results, we leverage the Reactive MongoDB Driver. First, a driver object, a database and a collection object get created. Their creation is conducted in an asynchronous manner. This means that we do not actually get the real objects. Instead, we get Scala’s Futures. After the completion of the futures, we iterate over the trajectories. For each trajectory, we get its cluster id by looking up its id in the broadcasted map. Then, we create a bson object and call the insert operation of the collection. The operation is conducted asynchronously in a non-blocking manner. Consequently, other bson objects are being created and inserted in the same time. This is achieved by configuring the spark.task.cpus parameter to the desired number of threads per task. Then, we assemble the futures in a sequence and await its completion. In the end, we destroy the driver and return the final write result to declare the end of the task.