Similarity Search on Semantic Trajectories Using Text Processing

Ribeiro de Almeida, Damião; de Souza Baptista, Cláudio; de Andrade, Fabio Gomes

doi:10.3390/ijgi11070412

Open AccessArticle

Similarity Search on Semantic Trajectories Using Text Processing

by

Damião Ribeiro de Almeida

^1,*

,

Cláudio de Souza Baptista

¹

and

Fabio Gomes de Andrade

²

¹

Department of Computer Science, Federal University of Campina Grande, Campina Grande 58429-900, Brazil

²

Federal Institute of Paraíba, Cajazeiras 58900-000, Brazil

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2022, 11(7), 412; https://doi.org/10.3390/ijgi11070412

Submission received: 23 May 2022 / Revised: 8 July 2022 / Accepted: 15 July 2022 / Published: 21 July 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The use of location-based sensors has increased exponentially. Tracking moving objects has become increasingly common, consolidating a new field of research that focuses on trajectory data management. Such trajectories may be semantically enriched using sensors and social media. This enables a detailed analysis of trajectory behavior patterns. One of the problems in this field is the search for a semantic trajectory database that is flexible and adaptable; flexibility in the sense of retrieving trajectories that are closest to the user’s query and not just based on exact matching. Adaptability refers to adjusting to different types of semantic trajectories. This article proposes a new approach for representing and querying semantic trajectories based on text-processing techniques. Furthermore, we describe a framework, called SETHE (SEmantic Trajectory HuntEr), that performs similarity queries on semantically enriched trajectory databases. SETHE can be adapted according to the aspect types posed in user queries. We also presented an evaluation of the proposed framework using a real dataset, and compare our results with those of state-of-the-art approaches.

Keywords:

semantic trajectories; textual search; similarity measuring

1. Introduction

The proliferation of smartphones, low-cost sensors, and wireless communication devices has enabled the monitoring in geographic spaces of mobile entities such as people, animals, cars, ships, and natural phenomena.Currently, there are several ways to obtain the location of moving objects, with the Global Positioning System (GPS) being the most straightforward and common way to construct raw trajectories [1]. GPS consists of a sequence of geospatial points (latitude, longitude, and altitude coordinates) ordered by timestamps [2,3,4]. Trajectory data are important for analyzing and understanding the behavior of moving objects. For example, trajectory analysis may identify traffic jams, people’s behavior patterns, navigation routes, fishing areas, animal migration, and hurricane trajectories [5].

Many studies have enriched trajectory data by including context-based information. Emmanouilidis et al. [6] defined context as a synonym for the range of information that may influence service adaptation. Such information may arise from the environment, user, or other systems. Context-based information enables the enrichment of trajectory analysis and improves the understanding of moving-object behavior [7]. The use of context information can provide insights into the behavioral aspects of mobile objects that would not be possible using only raw trajectories, such as which point of interest (POI) was visited, the type of activities performed, and the trajectory purpose.

Ubiquitous computing and Internet of Things help obtain context-based information [8]. Various devices, such as smartwatches, medical sensors, radio frequency identification (RFID) devices, and environmental sensors, can capture context-based information. Another way of implicitly obtaining trajectory data and context information is by volunteered geographic information (VGI) [9], which consists of geographic data provided by citizens through location-based social networks, such as LinkedGeoData (http://linkedgeodata.org/, accessed on 22 May 2022) and OpenStreetMap (https://www.openstreetmap.org/, accessed on 22 May 2022). In addition, some social media platforms, such as Flickr (https://www.flickr.com/, accessed on 22 May 2022), Twitter (https://twitter.com/, accessed on 22 May 2022), Facebook (https://www.facebook.com/, accessed on 22 May 2022), and Foursquare (https://foursquare.com/, accessed on 22 May 2022), provide geolocation from their posts. Other behavioral information may be extracted from social media, such as the user’s activity and POI evaluation.

Adding context information to trajectory data creates a semantically enriched trajectory, or simply a semantic trajectory [10,11]. In a semantic trajectory dataset, trajectories contain annotations. The waypoints are enriched with information regarding either the environmental or mobile object context, such as the POI name or user heartbeat. An aspect is any type of information that can be annotated to the trajectory POIs. Examples of POI aspects include their name, category, weather, means of transport, and rating [12]. Hence, the trajectory becomes a complex object with several contextual data dimensions associated with its movement [13].

To better understand what semantic trajectories are, consider the example represented in Figure 1, which depicts the short trajectory of a tourist in the city of Pisa in Italy. Each stop is semantically enriched with four aspects: POI name, category, means of transport used to reach the POI, and environmental temperature. The route starts at the POI Cappella dal Pozzo which belongs to the Chapel category. The tourist is walking, and the local temperature is 22

°

C. Then, the tourist moves by bus to the Museo delle Sinopie, where the temperature is 21

°

C. Finally, the route ends at Teatro Sant’Andrea, where the tourist arrives by taxi, and the temperature is 23

°

C.

When dealing with semantic trajectories, we need to decide how to represent context information. For example, Noel et al. [14] represented trajectories in a multidimensional manner, in which each dimension focused on a single aspect, and each aspect was represented by a trajectory. Table 1 presents a representation of the trajectory in Figure 1. In Table 1, the transition event is the displacement between the stopping points. The first line is the POI name trajectory, which begins at Cappella dal Pozzo and ends at Teatro Sant’Andrea. We then have the category trajectory following the same direction, starting at a chapel and ending at a theater. Finally, we obtain the means of the transport and temperature trajectories. Hence, it is possible to analyze trajectories from different viewpoints and solve queries concerning certain aspects. For example, it is possible to search for trajectories in which a person travels using only a bus as the transport mean or trajectories that start at a mall and end up at a theater.

Current search engines on semantic trajectory datasets retrieve only the set of trajectories that exactly matches each constraint defined in the user query. For example, suppose someone is looking for the trajectories of a person arriving at a given church by bus. In this case, the query result will only contain trajectories with the category attribute equal to church, and the transport mean attribute equal to bus. Occasionally, it becomes challenging to find trajectories that perfectly match all query constraints. The more constraints a query has, the more difficult it becomes to find compatible trajectories. For example, a query that looks for people trajectories who went by taxi to a chapel, then by bus to a museum, and walked to a theater would not return the trajectory shown in Figure 1. Among all the constraints defined in the query, only the transport mean aspect of the first and last locations are not satisfied by that trajectory. Hence, even when satisfying almost all query restrictions, the trajectory depicted in Figure 1 cannot be retrieved as a result.

A query on a semantic trajectory database expresses the disposition of stop points along the trajectory [15]. Examples include searching for trajectories that start at the Leaning Tower of Pisa, trajectories that end at a museum, or trajectories that visit a church and then a theater.

Aiming to solve the aforementioned limitations, this study proposes a semantic trajectory framework that represents multi-aspect trajectories and can search for the most similar trajectories according to the aspect values contained in the query through a ranking approach. The framework queries a semantic trajectory database using text processing techniques. Hence, the trajectory is represented as a string vector. Each string may represent the POI’s name, category, or other aspects. A query is also represented by the vectors. The distance between the query and trajectory vectors determines the matching and ranking of the result set. As a baseline, we used the semantic trajectory search framework developed by Izquierdo et al. [15], which describes a formal framework for semantic trajectories using description logic (DL) and SPARQL.

Thus, the main contributions of this article are as follows:

The proposition of a new approach to represent trajectory data based on text.
The development of a search engine for querying semantic trajectories taking into account not only the POIs name and categories, but also the semantic trajectory aspects.
The specification of a new ranking algorithm that enables searching for trajectory similarity.
The implementation of a simple and efficient approach—execution time and storage requirements—to perform queries on semantic trajectories, when compared to the SPARQL-based approach.

To validate our approach, we implemented a case study using TripBuilder [16], a trajectory dataset built from Flickr data, combined with Wikipedia data.

The remainder of the article is structured as follows. Section 2 discusses related work. Section 3 presents the fundamental concepts and the formal definition of the semantic trajectory query framework. Section 4 presents a running example to instantiate the SETHE framework. Section 5 describes the experiments performed. Finally, Section 6 concludes the paper and discusses further work to be undertaken.

2. Related Work

Usually, raw trajectory data are captured and stored in a spatial database known as a moving object database (MOD) [17]. However, once these data are captured and analyzed, it is necessary to enrich them with context-based information to increase analytics processing and to enable users to perform tasks such as identifying traffic jams, finding people’s behavior patterns, observing navigation routes, identifying fishing areas, studying animal migration, understanding hurricane trajectories, and so on [5]. SeMiTri [18] is a trajectory enrichment system that uses semantic annotation to identify trajectory stops and moves [19]. SeMiTri semantically describes trajectories with information about the POI, means of transport, and type of geographic region (residential, business, market, etc.).

CONSTAnT is a conceptual data model that represents the main aspects of a semantic trajectory [20]. The model is divided into two parts. The first part describes simple entities, providing information about the mobile object, trajectory, sub-trajectory, semantic points, environment, places, and events. The second part describes complex entities in which data mining techniques are utilized to identify information such as the purpose of movement, means of transport used, and behavior of the moving object.

Nöel et al. [14] proposed a semantic trajectory model composed of multiple aspects, where each aspect has a group of related attributes. The authors argue that a semantic trajectory can be analyzed from different points of view, such as residential and professional. The city name where the user stayed, the type of place (house or apartment), and rent value are attributes when looking at the trajectory from the residential point of view. Work, occupation, and salary are semantic attributes when examining trajectories from a professional perspective.

RDF graphs and ontologies have also emerged as solutions to enrich semantic trajectories [13,21]. The representation of semantic trajectory data in RDF enables the inference of new knowledge and the publication of data as linked open data (LoD). The CRISIS system is an example of an application that deals with trajectory data streams and uses an RDF graph that semantically represents the marine data received from several sensors [22]. Baquara

^{2}

is another example of a conceptual framework that analyzes and semantically enriches trajectories by using a customizable process [7]. The MASTER project models the trajectory and its context using RDF and uses the rendezvous database to store the RDF data [13].

Alvares et al. [23] proposed a model that represents essential parts of a trajectory (stop, move, and semantics) and uses the SQL language to perform queries on places visited, types of places, and other analyses. Izquierdo et al. [15] addressed the problem of queries on semantic trajectories using a stop-and-move representation. The authors described a formal framework using DL to formally introduce the syntax and semantics of trajectories and the mechanisms needed to express queries in their database. As a proof of concept, the authors used the TripBuilder [16] dataset with georeferenced photos captured by Flickr users. The stop points were enriched with the POI name, category, and movements with the means of transport. The concepts described in DL were implemented in RDF, and the queries were expressed using SPARQL.

The aforementioned studies used complex data models to represent the semantic trajectories. These models contain many entities and relationships that make the entire scenario challenging to understand. These become more complex as new aspects are added to the data model. Consequently, queries become more difficult to express and are based on exact matching without ranking. This study proposes a text representation of semantic trajectories, resulting in a simpler data model that optimizes memory size requirements and query performance. Consequently, a similarity query returns more results, reducing the frustration of the user due to empty answers.

3. SETHE: A Semantic Trajectory Retrieval Approach

In this section we present the SETHE (SEmantic Trajectory HuntEr) framework, a new approach for representing semantic trajectories and their aspects using text processing. In SETHE, POI names, categories, and other aspects are represented as a sequence of terms. We then perform queries using text processing and rank the result set according to how close the trajectory result set was to the query. SETHE searches for trajectories containing at least one sub-sequence corresponding to the query, and whose semantic values are closest to the aspects specified in a user query. In this section, we formalize the SETHE underpinning trajectory model.

3.1. Basic Concepts

There are several similar definitions of trajectories [18,20,23]. However, we present a new approach for representing and dealing with trajectory data, in which a semantic trajectory is represented as a text vector. Hence, a query may be expressed using regular expressions, and a ranking approach is used to return not only exact matches, but also similar results. Considering that a POI is a specific location in which someone may be interested [24], we present a new aspect-based semantic trajectory definition.

Definition 1.

An aspect-based semantic trajectory is a sequence of POIs

T = 〈 p_{1},

p_{2}, \dots, p_{n} 〉

ordered chronologically and represented by a set of tuples

S T = {〈

n a m e, s t_{1} 〉,

〈 c a t e g o r y, s t_{2}

〉, 〈

a s p e c t N a m e_{3}, s t_{3}

〉, …,

〈 a s p e c t N a m e_{a}, s t_{a} 〉

}, where

a \geq 2

. Every

s t_{i}

is a sequence ofntext values, one for each point

p_{i} \in T

. A POI of

S T

can be simple with onlyname and category, so

S T = {〈 n a m e, s t_{1} 〉

,

〈 c a t e g o r y,

s t_{2} 〉}

, or more complex, with semantic aspects other than the POI name and category. Hence, names and categories are mandatory, and other aspects are optional.

Applying Definition 1, we can represent the Figure 1 trajectory as:

\begin{matrix} S T_{f 1} = {〈 n a m e, 〈 C a p p e l l a d a l P o z z o, S i n o p i e M u s e u m, T e a t r o S a n t^{'} A n d r e a 〉 〉, \\ 〈 c a t e g o r y, 〈 C h a p e l, M u s e u m, T h e a t e r 〉 〉, \\ 〈 t r a n s p o r t m e a n, 〈 W a l k, B u s, T a x i 〉 〉 \\ 〈 t e m p e r a t u r e, 〈 22, 21, 23 〉 〉} \end{matrix}

A sub-sequence is another important concept for understanding semantic trajectory query processing. According to Gusfield [25], while a sub-trajectory represents consecutive points of a trajectory T, the POIs do not need to be consecutive in a sub-sequence of T. Sub-sequences are used during query processing to retrieve POI sequences from T that match the user’s query.

Definition 2.

An aspect-based semantic sub-sequence trajectory

S S T = {〈 n a m e,

s s t_{1} 〉

, 〈category,

s s t_{2} 〉

,

〈 a s p e c t N a m e_{3}

,

s s t_{3} 〉, \dots, 〈 a s p e c t N a m e_{a}, s s t_{a} 〉}

represents a sub-sequence of POIs ofT, where

a \geq 2

.

According to Definition 2, the sub-sequence

S S T_{f 1}

represents a sub-sequence of

S T_{f 1}

. We can see that the

S S T_{f 1}

POIs are sequential points of

S T_{f 1}

, but they are not necessarily consecutive.

\begin{matrix} S S T_{f 1} = {〈 n a m e, 〈 C a p p e l l a d a l P o z z o, T e a t r o S a n t^{'} A n d r e a 〉 〉, \\ 〈 c a t e g o r y, 〈 C h a p e l, T h e a t e r 〉 〉, \\ 〈 t r a n s p o r t m e a n, 〈 W a l k, T a x i 〉 〉 \\ 〈 t e m p e r a t u r e, 〈 22, 23 〉 〉} \end{matrix}

3.2. Query Processing

To perform a search in a semantic trajectory dataset, users must provide some information, such as POIs names, categories, and/or other aspects. The stops can be identified based on the POI name or category. Figure 2 shows a graphical example of a query in which a user is searching for trajectories that pass through a museum and end up at the Leaning Tower of Pisa.

In addition to specifying the stops, the user may also specify what aspects are associated with each specific point. In Figure 2, the user wants to search for trajectories of people who use a taxi to go to a museum with a rating score of four, and rainy weather. Finally, the person traveled by bus to the Leaning Tower of Pisa, given a rating score of five, and the weather was clear.

The use of exact matching for the query depicted in Figure 2 may result in few or no results. To solve this limitation, SETHE searches for trajectories that most closely match the query using a ranking algorithm. For this, the user must also provide a distance function and weight for each aspect type. The distance function calculates the distance of the query aspect from a given trajectory. In addition to the distance function, the weight represents the degree of importance of each aspect of the user query. The distance and weight influence the final result rank. Table 2 lists some examples of these functions, where random values are compared to the aspect values shown in Figure 2. In this example, we used a

w o r d 2 v e c

function for the means of transport, the equal function for the weather aspect, and the Euclidean function for the rating aspect. The

w o r d 2 v e c

function calculates the semantic distance between the terms. The equal function returns only one of these two values: 0 (zero) when the terms are different and 1 (one) when the terms are equal. The Euclidean function is computed as

E u c l i d e a n (a, b) = | (a - b) |

.

The following subsections detail the SETHE querying process. This process is accomplished in several steps: building a query to be interpreted by the framework, building a vector representation for the query, retrieving the sub-sequence with the same stop points specified in the query, building a vector representation for each retrieved sub-sequence and its aspects, and calculating the similarity between the query and sub-sequence vectors.

3.2.1. Query Building

A query is a sequence of expressions that may contain a POIs name, categories, and other aspects that indicate the semantic trajectory in which the user is interested. For example, using the categories sequence (

m u s e u m

;

t o w e r

) and the sequence of transport mean aspects (

B u s

;

T a x i

), SETHE looks for routes that use a bus to arrive at a museum and a taxi to arrive at a tower.

During the searching process, SETHE considers either the name or category of the POIs at any position in the trajectory. However, it is possible to use features of regular expressions to inform the position of the POI in the textual path. For example, when using the symbol ⌃, we indicate that the POI must be at the beginning of a trajectory, and with the symbol $, we say that the POI should be at the end of a trajectory. Therefore, when performing a query using the sequence (⌃

m u s e u m

;

t o w e r

$), SETHE looks for trajectories that start at a museum and end at a tower.

Table 3 shows five regular expression symbols used in the query process and two new symbols ((?-) and ∼) that help building query expressions.

Inspired by [12], we define the query as follows.

Definition 3.

QueryQis a tuple

Q = (E, A, W, D, L)

, where:

Erepresents the POI sequence of a nonempty set of tuples $E = {〈 n a m e, e_{1} 〉$ , $〈 c a t e g o r y$ , $e_{2} 〉}$ . Each $e_{i}$ is a sequence ofmregular expressions, with one for each POI.
Ais another representation of the POI sequence represented by a set of tuples $A = {〈$ $a s p e c t N a m e_{1}$ , b $α_{1}$ 〉, …, $〈 a s p e c t N a m e_{b}, α_{b} 〉}$ , where $b \geq 0$ is the number of optional aspects. Each $α_{i}$ is a sequence ofmregular expressions, with one for each POI.
$W = {w_{1}, w_{2}, \dots, w_{b}}$ is a set of weights, where each weight is associated with an optional aspect, and $\sum_{i = 1}^{b} w_{i} = 1$ . If $A = {⌀}$ , then $W = {⌀}$ .
$D = {d_{1}, d_{2}, \dots, d_{b}}$ denotes a set of distance functions. A distance function exists for each optional aspect. If $A = {⌀}$ , then $D = {⌀}$ .
$L = {t h r_{1}, t h r_{2}, \dots, t h r_{b}}$ is the threshold set that each distance function may achieve. There is a threshold for each function, and if $A = {⌀}$ , $L = {⌀}$ .

To facilitate query comparison we use only the distance function in the optional aspects. Following Definition 3, we can use the tuple

Q_{f 3} = (E_{f 3}, A_{f 3}, W_{f 3}, D_{f 3}, L_{f 3})

to express the query in Figure 2. This query specifies two points. The first is a category (

m u s e u m

), and the second is a POI name (

T o r r e P e n d e n t e d i P i s a

). Therefore, sequence E must contain two tuples (

n a m e

and

c a t e g o r y

), and sequence

e_{i}

must have two points, where

E_{f 3} = {〈 n a m e, 〈 . *; T o w e r P e n d e n t e d i P i s a 〉 〉, 〈 c a t e g o r y, 〈 m u s e u m; . * 〉 〉}

The regular expression

. *

is used when POI names and categories are unknown. The optional aspects in Figure 2 are means of transport, weather, and rating. The weight of each aspect depended on the user’s choice. In this example, we set the highest priority to the means of transport and the weather aspect as the lowest priority. Therefore, we assigned the following weight sequence:

W_{f 3} = {0.5, 0.2, 0.3}

. To calculate the distance between two aspect values, we use the

w o r d 2 v e c

function for the means of transport, the

e q u a l s

function for the weather aspect, and the

E u c l i d e a n

function for the rating aspect; therefore,

D_{f 3} = {w o r d 2 v e c, e q u a l s, E u c l i d e a n}

. We adopted the values

L_{f 3} = {1, 1, 5}

for the threshold sequence. The value of 1 (one) is the highest possible value for the

w o r d 2 v e c

function. The maximum

e q u a l s

function value is 1, and 5 is the maximum value for the

E u c l i d e a n

function of the rating aspect, which varies between 1 and 5. Therefore, query

Q_{f 3}

is expressed as

\begin{matrix} Q_{f 3} = ({〈 n a m e, 〈 . *; T o r r e P e n d e n t e d i P i s a 〉 〉, \\ 〈 c a t e g o r y, 〈 m u s e u m; . * 〉 〉}, \\ {〈 t r a n s p o r t m e a n, 〈 B u s, T a x i 〉 〉, \\ 〈 w e a t h e r, 〈 r a i n, c l e a r 〉 〉, \\ 〈 r a t i n g, 〈 4, 5 〉 〉}, \\ {0.5, 0.2, 0.3}, \\ {w o r d 2 v e c, e q u a l s, E u c l i d e a n}, \\ {1, 1, 5}) \end{matrix}

SETHE transforms a query into text to compare it with the textual trajectory database. This process is divided into four main steps.

Using the regex function to obtain the trajectories that pass through the POIs with the names and categories of the expressions.
Extracting the sub-sequences of T trajectory, in which both the name and category of the POIs match E regular expressions.
Using distance functions and aspect weights to calculate the query coefficient similarity with the sub-sequences.
Ranking the result according to the coefficient in descending order.

The function

r e g e x (t e x t, p a t t e r n)

was used to determine the trajectories. The

t e x t

parameter can take one of two sentences: either POI name sequence

s t_{1}

or POI category sequence

s t_{2}

. The

p a t t e r n

parameter is a regular expression composed of the concatenation of

e_{i}

elements. The regular expression

(. *)

was used to merge the

e_{i}

expressions. The

r e g e x

function informs if the

s t_{i}

has at last one sub-sequence that matches the

p a t t e r n

. Using the

Q_{f 3}

example, the

p a t t e r n

value is

(. *) (. *) (T o r r e P e n d e n t e d i P i s a)

for

t e x t

equal to

s t_{1}

and

(m u s e u m) (. *) (. *)

for

t e x t

equal to

s t_{2}

.

Finally, SETHE uses two

r e g e x

functions over the database to look for the semantic trajectories that passed through the museum and the

L e a n i n g T o w e r o f P i s a

. The final expression is as follows:

r e g e x (s t_{1}, “ (. *) (. *) (T o r r e P e n d e n t e d i P i s a) ”) a n d r e g e x (s t_{2}, “ (m u s e u m) (. *) (. *) ”)

Before proceeding, it is necessary to transform a query into a vector representation. A query Q is represented by sentence

δ_{q}

and vector

\vec{v_{q}}

. The sentence

δ_{q} = (y_{1}^{1}, y_{1}^{2}, \dots, y_{1}^{b}, y_{2}^{1}, y_{2}^{2},

\dots,

y_{2}^{b}, y_{m}^{1}, y_{m}^{2}, \dots, y_{m}^{b})

is the interleaved concatenation for each POI of the regular expressions

α_{i}

of the optional aspects. The coordinates of the vector

\vec{v_{q}} = (v_{1}, v_{2}, v_{3}, \dots, v_{z})

are the interleaved weights associated with each POI aspect, where Equation (1) identifies the aspect weight in W.

v_{i} = (w_{k} | 1 \leq i \leq z), w h e r e k = ((i - 1) m o d b) + 1

(1)

Using the example

Q_{f 3}

and applying Equation (1) to the

δ_{q} = (t a x i r a i n 4 b u s c l e a r 5)

, we obtain the vector

\vec{v_{q}} = (0.5, 0.2, 0.3, 0.5, 0.2, 0.3)

.

3.2.2. Discovering Sub-Sequences

After retrieving the semantic trajectories using the regex query, SETHE calculates a vector representation for each

S T

sub-sequence that matches all regular expressions defined in

e_{1}

and

e_{2}

. We describe this process with four algorithms. Algorithm 1 (the main function) is responsible for invoking the functions described in the other algorithms. Algorithm 1 demonstrates how to extract sub-sequences from an aspect-based semantic trajectory. The

c a l c S u b s e q u e n c e s

function receives parameters

s t_{1}

,

s t_{2}

,

e_{1}

, and

e_{2}

. We used a tree as the data structure that will help to determine the trajectory sub-sequences. The tree starts with an empty child, which will be the tree root, and its children will be the start sub-sequence POI. According to Definition 3, regular expressions

e_{1}

and

e_{2}

must have the same size. Algorithm 1 initially verifies if

e_{1}

is equal to “.*”, then it takes

e_{2}

. The regex function looks for all matches in the text for each

e_{i}

regular expression. Each item in the matches variable has a POI text (name or category) and POI position at the trajectory. A node is created for each match and added to the tree. The new node receives the text value of the match, the text position in the

s t_{i}

, and the index of the regular expression in

e_{i}

. If it is the first regular expression, the node is added as a child of the root of the tree. Otherwise, the recursive function

a d d N e w N o d e

will look for the correct position of the node in the tree. The

f i x T r e e

function removes all tree branches whose heights are less than the size of

e_{i}

. Thus, only the

S T

sub-sequences that match

e_{1}

and

e_{2}

remain in the tree. The

e x t r a c t S u b s e q u e n c e

function is responsible for traversing the tree nodes to extract the trajectory sub-sequences and store them in the

l i s t S u b s

variable. In the final of the

c a l c S u b s e q u e n c e s

function, the

l i s t S u b s

variable has all the sub-sequences of

S T

.

Algorithm 1 Extract Sub-sequence from ST

1:: functioncalcSubsequences ( $s t_{1}$ , $s t_{2}$ , $e_{1}$ , $e_{2}$ )
2:: $t r e e \leftarrow e m p t y t r e e$
3:: $r o o t \leftarrow t r e e . r o o t$
4:: $l i s t S u b s \leftarrow e m p t y l i s t$
5:: for $i n d e x = 1$ to $e_{1} . s i z e$ do
6:: $t e x t \leftarrow s t_{1}$
7:: $e x p \leftarrow e_{1} [i n d e x]$
8:: if $e x p$ == “ $. *$ ” then
9:: $t e x t \leftarrow s t_{2}$
10:: $e x p \leftarrow e_{2} [i n d e x]$
11:: end if
12:: $m a t c h e s \leftarrow r e g e x (t e x t, e x p)$
13:: for all $m a$ in $m a t c h e s$ do
14:: $n o d e \leftarrow n e w n o d e$
15:: $n o d e . t e x t \leftarrow m a . t e x t$
16:: $n o d e . t e x t P o s i t i o n \leftarrow m a . p o s i t i o n$
17:: $n o d e . e x p I n d e x \leftarrow i n d e x$
18:: if $i n d e x$ == 1 then
19:: $r o o t . a d d C h i l d r e n (n o d e)$
20:: else
21:: for all $n o d e C h i l d$ in $r o o t . c h i l d r e n$ do
22:: $a d d N e w N o d e (n o d e C h i l d, n o d e)$
23:: end for
24:: end if
25:: end for
26:: end for
27:: $f i x T r e e (t r e e, r o o t, e_{1} . s i z e)$
28:: for all $c h i l d$ in $r o o t . c h i l d r e n$ do
29:: $e x t r a c t S u b s e q u e n c e (c h i l d, {}$ , $l i s t S u b s)$
30:: end for
31:: return $l i s t S u b s$
32:: end function

Algorithm 2 describes the

a d d N e w N o d e

recursive function. This function receives two parameters: the father and the child nodes created by Algorithm 1. To add the new node as a child of the father node, there are two constraints: first, the node position must be greater than the father node position; second, the index of the new node must be one unit above the index of the parent node.

Algorithm 2 Insert a Node in the Tree

1:: functionaddNewNode ( $f a t h e r$ , $n e w N o d e$ )
2:: if $n e w N o d e . t e x t P o s i t i o n > f a t h e r . t e x t P o s i t i o n$ then
3:: if $n e w N o d e . e x p I n d e x = = f a t h e r . e x p I n d e x + 1$ then
4:: $f a t h e r . a d d C h i l d r e n (n e w N o d e)$
5:: end if
6:: else
7:: for all $c h i l d N o d e$ in $f a t h e r . c h i l d r e n$ do
8:: $a d d N e w N o d e (c h i l d N o d e, n e w N o d e)$
9:: end for
10:: end if
11:: end function

Algorithm 3 describes the recursive function

f i x T r e e

. This function receives three parameters: the tree, the node to be checked, and the leaf node height. Each node of the tree is visited recursively until reaching the leaf nodes. If the index of the leaf node is different from the height (

e_{i}

size), the node is removed from the tree. This process is repeated until the end to remove all the nodes with no child and index less then height.

Algorithm 3 Remove Incomplete Sub-sequence from the Tree

1:: functionfixTree ( $t r e e$ , $n o d e$ , $h e i g h t$ )
2:: for all $c h i l d N o d e$ in $n o d e . c h i l d r e n$ do
3:: $f i x T r e e (t r e e, c h i l d N o d e, h e i g h t)$
4:: end for
5:: if $n o d e . c h i l d r e n$ is $e m p t y$ then
6:: if $n o d e . e x p I n d e x! = h e i g h t$ then
7:: $t r e e . r e m o v e N o d e (n o d e)$
8:: end if
9:: end if
10:: end function

Algorithm 4 describes the function

e x t r a c t S u b s e q u e n c e

behavior. This recursive function receives three parameters: the tree node, the sub-sequence currently being processed, and the sub-sequence list, which is the variable that stores the final result. The algorithm iterates through all child nodes and adds the value to the

s u b s e q u e n c e

variable. If a node has more than one child, it means that more sub-sequences contain that node. Therefore, the

s u b s e q u e n c e

is cloned, and the

e x t r a c t S u b s e q u e n c e

function is invoked again with the following parameters: child node, clone, and list of sub-sequences

l i s t S u b

. When the node parameter is empty, there are no more children to be processed; hence, the

s u b s e q u e n c e

value will be an

S T

sub-sequence. Then, the value of the

s u b s e q u e n c e

variable is added to

l i s t S u b

. At the end of the function, variable

l i s t S u b

will have a list of all

S T

sub-sequences that satisfy both

e_{1}

and

e_{2}

.

Suppose a trajectory

S T

has five POIs, and for the sake of simplicity, we highlighted only the trajectory

s t_{2}

. Following Algorithm 1, two sub-sequences are extracted from

s t_{2}

, which we call

S S T_{1}

and

S S T_{2}

, containing the category sub-sequences

(c_{2} c_{5})

and

(c_{4} c_{5})

.

	$c_{1}$		$c_{2}$		$c_{3}$		$c_{4}$		$c_{5}$
$s t_{2}$ =(	shop		museum		church		museum		tower	)

Algorithm 4 Extract Sub-sequence from Tree Algorithm

1:: functionextractSubSequence ( $n o d e$ , $s u b s e q u e n c e$ , $l i s t S u b s$ )
2:: $n u m C h i l d r e n \leftarrow n o d e . c h i l d r e n . s i z e$
3:: $s u b s e q u e n c e S i z e \leftarrow s u b s e q u e n c e . s i z e$
4:: for all $c h i l d$ in $n o d e . c h i l d r e n$ do
5:: if $n u m C h i l d r e n > 1$ then
6:: $c l o n e \leftarrow c o p y o f s u b s e q u e n c e$
7:: $c l o n e [s u b s e q u e n c e S i z e + 1] \leftarrow n o d e . v a l u e$
8:: $e x t r a c t S u b s e q u e n c e (c h i l d, c l o n e, l i s t S u b s)$
9:: else
10:: $s u b s e q u e n c e [s u b s e q u e n c e S i z e + 1] \leftarrow n o d e . v a l u e$
11:: $e x t r a c t S u b s e q u e n c e (c h i l d, s u b s e q u e n c e, l i s t S u b)$
12:: end if
13:: end for
14:: if $n u m C h i l d r e n = = 0$ then
15:: $s u b s e q u e n c e [s u b s e q u e n c e S i z e + 1] \leftarrow n o d e . v a l u e$
16:: $l i s t S u b s . a d d (s u b s e q u e n c e)$
17:: end if
18:: end function

3.2.3. Transforming a Sub-Sequence into a Vector

After identifying

S S T_{1}

and

S S T_{2}

sub-sequences, SETHE creates a sentence

δ

for each one of them, similar to what was performed for query

Q_{f 3}

. A new sentence

δ_{s s t}

and a vector

{\vec{v}}_{s s t}

are created for each sub-sequence, where

δ_{s s t} = (r_{1}^{1}, r_{1}^{2}, \dots, r_{1}^{b}, r_{2}^{1}, r_{2}^{2}, \dots, r_{2}^{b}, r_{n}^{1}, r_{n}^{2}, \dots, r_{n}^{b})

and

{\vec{v}}_{s s t} = (

ν_{1}^{1}

,

ν_{1}^{2}

, …,

ν_{1}^{b}

,

ν_{2}^{1}

,

ν_{2}^{2}

, …,

ν_{2}^{b}, ν_{n}^{1}, ν_{n}^{2}, \dots, ν_{n}^{b})

, such that r corresponds to the

S S T

optional aspects and

ν_{i}^{j} = s c o r e (y_{i}^{j}, r_{i}^{j})

.

The score is calculated for each term of

δ_{s s t}

, and its value is vector

{\vec{v}}_{s s t}

. The score function uses a distance function to calculate the closeness of a term

δ_{s s t}

to the same index term of

δ_{q}

.

The smaller the distance, the higher the score for the term of

δ_{s s t}

. If there are two terms, one belonging to query Q and the other belonging to a sub-sequence of

S T

, such that

y_{i}^{j}

ϵ

δ_{q}

and

r_{i}^{j}

ϵ

δ s s t

, the equation to calculate the score between the two terms is

s c o r e (y_{i}^{j}, r_{i}^{j}) = {\begin{matrix} 0, i f d_{j} (y_{i}^{j}, r_{i}^{j}) > t h r_{j}, \\ w_{j}, i f y_{i}^{j} = (. *), \\ | | \frac{(w_{j} * d_{j} (y_{i}^{j}, r_{i}^{j}))}{t h r_{j}} | - w_{j} |, o t h e r w i s e \end{matrix}}

(2)

where

w_{j} \in W

,

d_{j} \in D

and

t h r_{j} \in L

The trajectory coefficient of

S T

consists of the highest similarity value between the query vector of Q and the vectors of

S T

. The similarity function must return a value between zero and one. Examples of such similarity functions include Jaccard and cosine functions. Let

V t s

be the set of all vectors created by the

S T

sub-sequences. The coefficient is calculated as

c o e f (S T) = m a x (s i m i l a r i t y ({\vec{v}}_{q}, \vec{x})) | {\vec{v}}_{q} \in V_{q}, \vec{x} i n V_{t s}

(3)

where,

0 ≪ s i m i l a r i t y (., .) ≪ 1

A composite query

C Q = (Q_{1}, Q_{2}, \dots, Q_{n})

consists of multiple queries gathered into a single query, in which SETHE executes each query separately and merges the results of each query into a single result set.

4. Running Example

This section presents an example of how SETHE works using a query and a database composed of three trajectories. Let us suppose a Q query that looks for trajectories in which a mobile object initially goes to the Leaning Tower of Pisa, then visits a chapel or a church, and later on, stops at a museum.

Let the query

Q = (E, A, W, D, L)

, where

$E = {〈 n a m e, 〈 L e a n i n g_T o w e r_o f_P i s a; . *;$ $. * 〉 〉,$ $〈 c a t e g o r y, 〈 . *; (c h u r c h | c h a p e l); m u s e u m$ $〉 〉}$ . As we define the first point as the Leaning Tower of Pisa, we do not need to specify the category of the first point.
$A = {〈 t r a n s p o r t m e a n, 〈 t a x i, b u s, w a l k 〉 〉, 〈 r a t i n g, 〈 5, 5, 4 〉 〉}$ .
$W = {0.7, 0.3}$ .
$D = {e q u a l s, E u c l i d e a n}$ .
$L = {1, 5}$ .

In this example, we can see that the transport mean has a higher weight than rating. Transforming the query into sentence

δ_{q}

and applying Equation (1), we obtain the query vector

{\vec{v}}_{q} = (0.7, 0.3, 0.7, 0.3, 0.7, 0.3)

Suppose that we have a semantic trajectory database

S T

, as described in Section 3.2.1. SETHE performs a search using the

r e g e x

function, which receives as parameters the textual trajectory and the regular expression formed by the elements of

e_{1}

and

e_{2}

. In this case, two

r e g e x

functions were used: one for

s t_{1}

and the other for

s t_{2}

.

r e g e x (s t_{1}, “ L e a n i n g_T o w e r_o f_P i s a ") a n d r e g e x (s t_{2}, “ (c h u r c h | c h a p e l) (. *) (m u s e u m) ”)

After searching the collection of semantic trajectories using regex, SETHE retrieves three trajectories:

S T F = {<

n a m e, s t_{F 1}

>, <

c a t e g o r y, s t_{F 2}

>, <

t r a s p o r t m e a n s, s t_{F 3}

>,

<

r a t i n g,

s t_{F 4}

>}

,

S T G = {<

n a m e, s t_{G 1}

>, <

c a t e g o r y, s t_{G 2}

>, <

t r a s p o r t m e a n s, s t_{G 3}

>,

<

r a t i n g, s t_{G 4}

>}

, and

S T H = {<

n a m e, s t_{H 1}

>, <

c a t e g o r y, s t_{H 2}

>, <

t r a s p o r t m e a n s, s t_{H 3}

>,

<

r a t i n g, s t_{H 4}

>}

. For simplicity, we highlighted only the trajectory category and aspects, shown below. In addition, for ease of explanation, we placed an index on each term in the trajectory:

f_{1} f_{2} f_{3} f_{4} f_{5} f_{6} f_{7}

s t_{F 2} = 〈 t o w e r, g a t e, t o w e r, c h a p e l, c h u r c h, c a m p a n i l e, m u s e u m 〉

g_{1} g_{2} g_{3} g_{4} g_{5}

s t_{G 2} = 〈 t o w e r, c h a p e l, c a m p a n i l e, c h a p e l, m u s e u m 〉

h_{1} h_{2} h_{3} h_{4} h_{5} h_{6}

s t_{H 2} = 〈 t o w e r, p i a z z a, c h a p e l, m u s e u m, c h a p e l, m u s e u m 〉

The trajectories of the means of transport and rating aspects relate to

S T_{F}

,

S T_{G}

, and

S T_{H}

trajectories are, respectively,

s t_{F 3} = 〈 S u b w a y, S u b w a y, B u s, W a l k, S u b w a y, S u b w a y, W a l k 〉

s t_{F 4} = 〈 4, 4, 4, 5, 4, 3, 3 〉

s t_{G 3} = 〈 W a l k, W a l k, S u b w a y, B u s, S u b w a y 〉

s t_{G 4} = 〈 5, 3, 4, 3, 3 〉

s t_{H 3} = 〈 T a x i, T a x i, B u s, S u b w a y, B u s, W a l k 〉

s t_{H 4} = 〈 5, 2, 3, 4, 3, 3 〉

The second step is to identify the sub-sequences that satisfy the

e_{1}

and

e_{2}

expressions. Following Algorithm 1, a tree of paths is constructed, as depicted in Figure 3. Each level of the tree, except the root, represents an expression in either

e_{1}

or

e_{2}

. Each POI of

S T

that satisfies

r e g e x (s t_{1}, e_{1}) or r e g e x (s t_{2}, e_{2})

is added to the tree as a child of lower index expressions. For example, using the function

r e g e x (f_{4},

“ (c h u r c h | c h a p e l) ”)

, the category

f_{4}

matches the regular expression

(c h u r c h | c h a p e l)

, so the algorithm adds

f_{4}

as an

f_{1}

and

f_{3}

child in the tree. Category

h_{4}

, for example, is added only as a child of

h_{3}

and not of

h_{5}

, as

h_{4}

occurs before

h_{5}

in the trajectory.

The sub-sequences extracted from the POIs tree are:

σ_{F a} = 〈 f_{1}, f_{4}, f_{7} 〉

σ_{F b} = 〈 f_{1}, f_{5}, f_{7} 〉

σ_{F c} = 〈 f_{3}, f_{4}, f_{7} 〉

σ_{F d} = 〈 f_{3}, f_{5}, f_{7} 〉

σ_{G a} = 〈 g 1, g 2, g 5 〉

σ_{G b} = 〈 g 1, g 4, g 5 〉

σ_{H a} = 〈 h 1, h 3, h 4 〉

σ_{H b} = 〈 h 1, h 3, h 6 〉

σ_{H c} = 〈 h 1, h 5, h 6 〉

After identifying the trajectory sub-sequences, the next step is to calculate the score of each optional aspect to compose the vector that will serve as a similarity comparison with the

{\vec{v}}_{q}

vector.

Table 4, Table 5, Table 6 and Table 7 present the sentences of each

S T_{F}

sub-sequence that were created by interleaving the optional aspects. Each term corresponds to a sub-sequence optional aspect, starting with the transport mean and rating aspects related to each POI sub-sequence. As specified in the query Q, the

e q u a l s

function is used to calculate the similarity between two aspects of the transport mean type. Therefore, there are only two possible values: 1 (one), if the values are the same, and 0 (zero), if they are different. For the rating aspect, we use the Euclidean distance function. Applying Equation (2) to each word, we find the score for each aspect. For example, the rating score of value 4 (four) is calculated as follows:

s c o r e_{r a t i n g} (5, 4) = ||\frac{0.3 (4 - 5)}{5}| - 0.3| = 0.24

Table 8 and Table 9 present the score for the

S T_{G}

, and Table 10, Table 11 and Table 12 contain the scores for

S T_{H}

sub-sequences.

The coordinates of each trajectory vector contain the score calculated for each sub-sequence. Thus, we have the following vectors for the

S T_{F}

sub-sequences:

{\vec{v}}_{F a} (0, 0.24, 0, 0.3, 0.7, 0.24), {\vec{v}}_{F b} (0, 0.24, 0, 0.24, 0.7, 0.24),

{\vec{v}}_{F c} (0, 0.24, 0, 0.3, 0.7, 0.24), {\vec{v}}_{F d} (0, 0.24, 0, 0.24, 0.7, 0.24)

In this example, we use the Jaccard index to calculate the similarity between the vectors of the trajectories and the query vector

{\vec{v}}_{q}

. Equation (4) calculates the Jaccard index:

J ({\vec{v}}_{i}, {\vec{v}}_{q}) = \frac{\sum_{k}^{} m i n ({\vec{v}}_{i} [k], {\vec{v}}_{q} [k])}{\sum_{k}^{} m a x ({\vec{v}}_{i} [k], {\vec{v}}_{q} [k])}

(4)

Applying Equation (4) for

S T_{F}

and

{\vec{v}}_{q}

vectors, we have the following values:

J ({\vec{σ}}_{F a}, {\vec{v}}_{q}) = 0.493, J ({\vec{σ}}_{F b}, {\vec{v}}_{q}) = 0.473

J ({\vec{σ}}_{F c}, {\vec{v}}_{q}) = 0.493, J ({\vec{σ}}_{F d}, {\vec{v}}_{q}) = 0.473

Applying Equation (3), the coefficient for the

S T_{F}

trajectory is 0.493. After performing the same process for the

S T_{G}

and

S T_{H}

trajectories, we have the final result:

c o e f (S T_{H}) = 0.94, c o e f (S T_{F}) = 0.493, c o e f (S T_{G}) = 0.47

Therefore, the

S T_{H}

trajectory has the highest coefficient, which is closest to the user query. Second is the

S T_{F}

path, and

S T_{G}

is the path with the least similarity to the Q query.

5. Experiments and Results

We used the TripBuilder dataset [16] to evaluate the performance of our solution. We evaluated the performance of SETHE based on the framework described in [15], which presents a set of 10 queries for the city of Pisa, Italy.

5.1. Dataset

The TripBuilder RDF dataset contains 1,617,582 triples and 55,474 trajectories, modeled into Trajectory, Stop, Move, Transportation, and POI classes. Figure 4 shows the UML representation of TripBuilder. A trajectory can have several stop and moves, as represented by the * symbol in the relationship. Each trajectory has start and end points, and each point represents a POI. The Move class represents the transition between two stops, and is semantically enriched by the Transportation class.

To conduct the experiments with SETHE, we transformed the RDF TripBuilder dataset into a text dataset. We modeled a database to store the textual trajectories, as depicted in Figure 5. The Trajectory entity has a one-to-many relationship to each entity representing a different trajectory type. The POI entity represents the textual trajectory where each point contains the name of the place where the moving object stopped. The Category entity contains the category textual trajectories. The Move entity contains the textual trajectories of the transport mean utilized to reach each stop. The LocatedIn entity contains the trajectories of the regions where the moving object has stopped.

Table 13 presents a sample of data of the POI name trajectory stored in the

V a l u e

column. Table 14 shows category trajectories. Table 15 presents some examples of transport mean trajectories. The first move does not have an associated transport mean; therefore, it receives the value N/A. Indeed, SETHE enables trajectories with missing information for a given aspect. When this happens, i.e., a given POI without aspect, we use the special value N/A. For example, consider a trajectory in which the first and third POIs do not have a particular aspect, say transportation. Then the transportation aspect for that trajectory would be represented in the following way:

〈 N / A, S u b w a y, N / A, T a x i, B u s, S u b w a y 〉

.

5.2. Results and Discussion

The experiments were carried out on a computer with a Core i7-7700 3.60 GHz processor, 32 GB of RAM, and 500 GB HD, with a GNU/Linux Ubuntu 18.04 operating system. We installed the RDF dataset in the tuple database (TDB) of the Apache Jena Fuseki 4.3.2 server running on the Java platform jdk-16.0.2. We used PostgreSQL 13.2 to store the textual trajectory database. We converted the RDF triples to CSV spreadsheets, removed accents and special characters, and then loaded the data into the text database. Izquierdo et al. [15] described a semantic trajectory search framework and specified ten queries for city of Pisa to evaluate their framework performance. The same queries were used to evaluate the SETHE framework. The queries are listed in Table 16.

Table 17 shows how to use the SETHE framework to answer the aforementioned queries. Some queries require POIs visited consecutively. Thus, the proximity between the stopping points was also used as a trajectory aspect. In this case, proximity refers to the number of stops between two POIs. In SETHE, when a proximity attribute is set to a tilde symbol (∼), the closer two POIs are, the higher their score. Izquierdo et al. [15] use the

e q u a l s

operator to compare means of transport; therefore, we used the distance function

e q u a l s

for the transport mean aspect. In queries that use both aspects (e.g., transport mean and proximity), we adopted the exact weight of 0.5 in these examples.

As shown in Table 17, when the query does not have a value of

e_{1}

, it is assumed that this value is empty. All queries are simple, except for query Q4, which is a composite query.

We compared the performance of the SETHE PostgreSQL queries to that of the SPARQL queries. Regular expressions were extended with ∼ and

(? -)

operators. Each query was executed ten times for both SPARQL and SETHE. Figure 6 shows the average execution time for each query on a logarithmic scale. We observed that SETHE has a better response time in most cases than SPARQL queries [15]. The

Q 10

SPARQL query was not charted because it took approximately one hour to run.

Owing to the ranking algorithm, which implies better recall, as shown in Figure 7. The first SETHE query results are the same as the SPARQL query; they fit perfectly with the user query. The blue bars in the graph in Figure 7 represent the result set returned by SETHE that was not retrieved by SPARQL. The trajectories in the blue area are similar to the user query specifications but do not fit perfectly with the SPARQL queries.

Another important issue to be analyzed is the storage space of each investigated approach. The Apache Jena server uses TDB to store the RDF graph. Figure 8 shows the storage space between TDB and PostgreSQL. It was observed that the TDB demanded more than five times the memory size demanded by our textual approach.

The TripBuilder database was obtained from the repository https://figshare.com/articles/online_resource/Trajectories_RDF_Dataset_From_TripBuilder/11559090 (accessed on 22 May 2022). The source code for our project can be downloaded from https://github.com/DamiaoRA/SETHE (accessed on 22 May 2022).

6. Conclusions

The insertion of context-based information into trajectory data results in semantically enriched trajectories. Thus, trajectories may be analyzed from different perspectives, also known as aspects. Each perspective enables spatiotemporal context-based information analytics. In this study, these trajectories were called aspect-based semantic trajectories. Depending on the application, the trajectory aspects may vary significantly in terms of quantity and type. Some related approaches represent semantic trajectories using RDF graphs, ontologies, or conceptual models in which the search process is based only on an exact match. Depending on the complexity of the query, exact matches may yield few or no results.

This article proposes the SETHE framework, a search engine for querying aspect-based semantic trajectory datasets using text processing. The SETHE implements partial matching using a similarity coefficient between the aspect-based semantic trajectory and the user’s query to rank the result set. In traditional semantic trajectory search tools, there is no weight related to a given aspect; hence, all aspects have the same priority as the user. Our approach uses a distance function and a weight assigned to each aspect that impacts the ranking algorithm. The result set contains trajectories ranked by their coefficients calculated from the distance functions and weights. Using a ranking approach, the trajectories closest to the user query may be returned.

We also present a new approach to representing aspect-based semantic trajectory data, where each trajectory is represented only by text. The experiments using this approach demonstrated that the memory consumption for storing trajectories and their aspects is lower than that of an approach using an RDF graph, one of the main semantic trajectory representations used.

To assess the relevance of our work, we compared the results with those of one of the most recent studies in the field of semantic trajectory search. The results demonstrated that the SETHE had a better average response time. Furthermore, a SETHE query usually returns more results as we use a partial-match ranked approach. In future work, we intend to use the normalized discounted cumulative gain (NDCG) to measure ranking quality. We plan to extend our SETHE framework to encompass multidimensional modeling so that users can run rollup and drill down operators over trajectory aspects. Finally, we will work on implementing a graphical user interface (GUI) and perform a user assessment using the ISO 9241 standard—parts 14, 16, and 17.

Author Contributions

Conceptualization, Damião Ribeiro de Almeida and Cláudio de Souza Baptista; Methodology, Damião Ribeiro de Almeida, Fabio Gomes de Andrade and Cláudio de Souza Baptista; Software, Damião Ribeiro de Almeida; Writing—original draft, Damião Ribeiro de Almeida; Writing—review and editing, Cláudio de Souza Baptista and Fabio Gomes de Andrade; Supervision, Cláudio de Souza Baptista. All authors have read and agreed to the published version of the manuscript.

Funding

This research received external funding from the Computing Department of the Federal University of Campina Grande (UFCG).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this research can be extracted from: https://figshare.com/articles/online_resource/Trajectories_RDF_Dataset_From_TripBuilder/11559090 (accessed on 22 May 2022).

Acknowledgments

The second author would like to thank the National Council for Scientific and Technological Development (CNPQ), Brazil, for partially funding this research.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kong, X.; Li, M.; Ma, K.; Tian, K.; Wang, M.; Ning, Z.; Xia, F. Big trajectory data: A survey of applications and services. IEEE Access 2018, 6, 58295–58306. [Google Scholar] [CrossRef]
Fileto, R.; Raffaetà, A.; Roncato, A.; Sacenti, J.A.; May, C.; Klein, D. A semantic model for movement data warehouses. In Proceedings of the 17th International Workshop on Data Warehousing and OLAP, Shanghai, China, 3–7 November 2014; pp. 47–56. [Google Scholar]
Nardini, F.M.; Orlando, S.; Perego, R.; Raffaetà, A.; Renso, C.; Silvestri, C. Analysing trajectories of mobile users: From data warehouses to recommender systems. In A Comprehensive Guide through the Italian Database Research over the Last 25 Years; Springer: Berlin/Heidelberg, Germany, 2018; pp. 407–421. [Google Scholar]
Wagner, R.; Macedo, J.A.F.d.; Raffaetà, A.; Renso, C.; Roncato, A.; Trasarti, R. Mob-warehouse: A semantic approach for mobility analysis with a trajectory data warehouse. In Proceedings of the International Conference on Conceptual Modeling, Hong Kong, China, 11–13 November 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 127–136. [Google Scholar]
Alsahfi, T.; Almotairi, M.; Elmasri, R. A survey on trajectory data warehouse. Spat. Inf. Res. 2020, 28, 53–66. [Google Scholar] [CrossRef] [Green Version]
Emmanouilidis, C.; Koutsiamanis, R.A.; Tasidou, A. Mobile guides: Taxonomy of architectures, context awareness, technologies and applications. J. Netw. Comput. Appl. 2013, 36, 103–125. [Google Scholar] [CrossRef]
Fileto, R.; May, C.; Renso, C.; Pelekis, N.; Klein, D.; Theodoridis, Y. The Baquara2 Knowledge-Based Framework for Semantic Enrichment and Analysis of Movement Data. Data Knowl. Eng. 2015, 98, 104–122. [Google Scholar] [CrossRef]
Qin, Y.; Sheng, Q.Z.; Falkner, N.J.; Dustdar, S.; Wang, H.; Vasilakos, A.V. When things matter: A survey on data-centric internet of things. J. Netw. Comput. Appl. 2016, 64, 137–153. [Google Scholar] [CrossRef] [Green Version]
Goodchild, M.F. Citizens as sensors: The world of volunteered geography. GeoJournal 2007, 69, 211–221. [Google Scholar] [CrossRef] [Green Version]
Parent, C.; Spaccapietra, S.; Renso, C.; Andrienko, G.; Andrienko, N.; Bogorny, V.; Damiani, M.L.; Gkoulalas-Divanis, A.; Macedo, J.; Pelekis, N.; et al. Semantic trajectories modeling and analysis. ACM Comput. Surv. CSUR 2013, 45, 42. [Google Scholar] [CrossRef]
Almeida, D.R.d.; Baptista, C.d.S.; Andrade, F.G.d.; Soares, A. A Survey on Big Data for Trajectory Analytics. ISPRS Int. J. Geo-Inf. 2020, 9, 88. [Google Scholar] [CrossRef] [Green Version]
Petry, L.M.; Ferrero, C.A.; Alvares, L.O.; Renso, C.; Bogorny, V. Towards semantic-aware multiple-aspect trajectory similarity measuring. Trans. GIS 2019, 23, 960–975. [Google Scholar] [CrossRef] [Green Version]
Mello, R.d.S.; Bogorny, V.; Alvares, L.O.; Santana, L.H.Z.; Ferrero, C.A.; Frozza, A.A.; Schreiner, G.A.; Renso, C. MASTER: A multiple aspect view on trajectories. Trans. GIS 2019, 23, 805–822. [Google Scholar] [CrossRef] [Green Version]
Noël, D.; Villanova-Oliver, M.; Gensel, J.; Le Quéau, P. Modeling semantic trajectories including multiple viewpoints and explanatory factors: Application to life trajectories. In Proceedings of the 1st International ACM SIGSPATIAL Workshop on Smart Cities and Urban Analytics, Bellevue, WA, USA, 3–6 November 2015; pp. 107–113. [Google Scholar]
Izquierdo, Y.T.; Monteagudo Garcia, G.; Casanova, M.A.; Paes Leme, L.A.P.; Sardianos, C.; Tserpes, K.; Varlamis, I.; Ruback Rodrigues, L.C. Stop-and-move sequence expressions over semantic trajectories. Int. J. Geogr. Inf. Sci. 2021, 35, 793–818. [Google Scholar] [CrossRef]
Brilhante, I.; Macedo, J.A.; Nardini, F.M.; Perego, R.; Renso, C. Tripbuilder: A tool for recommending sightseeing tours. In Proceedings of the European Conference on Information Retrieval, Amsterdam, The Netherlands, 13–16 April 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 771–774. [Google Scholar]
Güting, R.H.; Schneider, M. Moving Objects Databases; Elsevier: Amsterdam, The Netherlands, 2005. [Google Scholar]
Yan, Z.; Chakraborty, D.; Parent, C.; Spaccapietra, S.; Aberer, K. SeMiTri: A framework for semantic annotation of heterogeneous trajectories. In Proceedings of the 14th International Conference on Extending Database Technology, Uppsala, Sweden, 21–24 March 2011; ACM: New York, NY, USA, 2011; pp. 259–270. [Google Scholar]
Spaccapietra, S.; Parent, C.; Damiani, M.L.; de Macedo, J.A.; Porto, F.; Vangenot, C. A conceptual view on trajectories. Data Knowl. Eng. 2008, 65, 126–146. [Google Scholar] [CrossRef] [Green Version]
Bogorny, V.; Renso, C.; de Aquino, A.R.; de Lucca Siqueira, F.; Alvares, L.O. Constant—A Conceptual Data Model for Semantic Trajectories of Moving Objects. Trans. GIS 2014, 18, 66–88. [Google Scholar] [CrossRef]
Nikitopoulos, P.; Vlachou, A.; Doulkeridis, C.; Vouros, G.A. DiStRDF: Distributed Spatio-temporal RDF Queries on Spark. In Proceedings of the EDBT/ICDT Workshops, Vienna, Austria, 26 March 2018; pp. 125–132. [Google Scholar]
Dividino, R.; Soares, A.; Matwin, S.; Isenor, A.W.; Webb, S.; Brousseau, M. Semantic Integration of Real-Time Heterogeneous Data Streams for Ocean-Related Decision Making. In Proceedings of the Big Data and Artificial Intelligence for Military Decision Making, Bordeaux, France, 30 May–1 June 2018. [Google Scholar] [CrossRef]
Alvares, L.O.; Bogorny, V.; Kuijpers, B.; de Macedo, J.A.F.; Moelans, B.; Vaisman, A. A model for enriching trajectories with semantic geographical information. In Proceedings of the 15th Annual ACM International Symposium on Advances in Geographic Information Systems, Seattle, WA, USA, 7–9 November 2007; pp. 1–8. [Google Scholar]
Chang, B.; Park, Y.; Kim, S.; Kang, J. DeepPIM: A deep neural point-of-interest imputation model. Inf. Sci. 2018, 465, 61–71. [Google Scholar] [CrossRef]
Gusfield, D. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology; Cambridge University Press: New York, NY, USA, 1997. [Google Scholar]

Figure 1. Example of tourist trajectory in the city of Pisa.

Figure 2. Example of a trajectory query with information on stops and aspects.

Figure 3. POI tree.

Figure 4. Representation of the TripBuilder dataset.

Figure 5. ER diagram of the textual trajectories database.

Figure 6. Performance graph.

Figure 7. Number of results per query.

Figure 8. Memory space consumption.

Table 1. Example of user semantic trajectory.

POI Name	Cappella Dal Pozzo	Museo Delle Sinopie	Teatro Sant’Andrea
Category	Chapel	Museum	Theater
Transport means	Walk	Bus	Taxi
Temperature	22 $°$ C	21 $°$ C	23 $°$ C

Table 2. Example of distance functions for aspects.

word2vec Funtion	Equals Function	Euclidean Function
word2vec(taxi, bus) = 0.74	equals(rain, fog) = 0	Euclidean(4.0, 5.0) = 1.0
word2vec(bus, walk) = 0.67	equals(clear, clear) = 1	Euclidean(5.0, 1.0) = 4.0

Table 3. List of regular expressions for the semantic path query filter.

Expression	Explanation	Query Example	Result Example
⌃	begins with	(⌃museum; .*)	museum, chapel, tower
$	ends with	(.*; museum$)	chapel, tower, museum
\|	“or” operator	( (museum\|chapel); tower)	chapel, square, tower
.*	any text	(.*; chapel$)	square, tower, chapel
(exp)*	the expression may be repeated zero or more times	((tower)*; museum)	tower, tower, museum
(?-)	The value ahead is repeated in the previous positions until the last expression of the trajectory begins	(?-) (Taxi; (?-) Bus)	(?-) Taxi, Bus, Bus, Bus
∼	proximity aspect	(museum; tower) ( .*; ∼)	museum, tower, square

Table 4. Scores for the sub-sequence

σ_{F a}

.

Table 4. Scores for the sub-sequence

σ_{F a}

.

$σ_{Fa}$	$f_{1}$		$f_{4}$		$f_{7}$
$δ_{Fa}$	Subway	4	Walk	5	Walk	3
score $δ_{Fa}$	0	0.24	0	0.3	0.7	0.24

Table 5. Scores for the sub-sequence

σ_{F b}

.

Table 5. Scores for the sub-sequence

σ_{F b}

.

$σ_{Fb}$	$f_{1}$		$f_{5}$		$f_{7}$
$δ_{Fb}$	Subway	4	Subway	4	Walk	3
score $δ_{Fb}$	0	0.24	0	0.24	0.7	0.24

Table 6. Scores for the sub-sequence

σ_{F c}

.

Table 6. Scores for the sub-sequence

σ_{F c}

.

$σ_{Fc}$	$f_{3}$		$f_{4}$		$f_{7}$
$δ_{Fc}$	Bus	4	Walk	5	Walk	3
score $δ_{Fc}$	0	0.24	0	0.3	0.7	0.24

Table 7. Scores for the sub-sequence

σ_{F d}

.

Table 7. Scores for the sub-sequence

σ_{F d}

.

$σ_{Fd}$	$f_{3}$		$f_{5}$		$f_{7}$
$δ_{Fd}$	Bus	4	Subway	4	Walk	3
score $δ_{Fd}$	0	0.24	0	0.24	0.7	0.24

Table 8. Scores for the sub-sequence

σ_{G a}

.

Table 8. Scores for the sub-sequence

σ_{G a}

.

$σ_{Ga}$	$g_{1}$		$g_{2}$		$g_{5}$
$δ_{Ga}$	Walk	5	Walk	3	Subway	3
score $δ_{Ga}$	0	0.3	0	0.18	0	0.24

Table 9. Scores for the sub-sequence

σ_{G b}

.

Table 9. Scores for the sub-sequence

σ_{G b}

.

$σ_{Gb}$	$g_{1}$		$g_{4}$		$g_{5}$
$δ_{Gb}$	Walk	5	Bus	3	Subway	3
score $δ_{Gb}$	0	0.3	0.7	0.18	0	0.24

Table 10. Scores for the sub-sequence

σ_{H a}

.

Table 10. Scores for the sub-sequence

σ_{H a}

.

$σ_{Ha}$	$h_{1}$		$h_{3}$		$h_{4}$
$δ_{Ha}$	Taxi	5	Bus	3	Subway	4
score $δ_{Ha}$	0.7	0.3	0.7	0.18	0	0.3

Table 11. Scores for the sub-sequence

σ_{H b}

.

Table 11. Scores for the sub-sequence

σ_{H b}

.

$σ_{Hb}$	$h_{1}$		$h_{3}$		$h_{6}$
$δ_{Hb}$	Taxi	5	Bus	3	Walk	3
score $δ_{Hb}$	0.7	0.3	0.7	0.18	0.7	0.24

Table 12. Scores for the sub-sequence

σ_{H c}

.

Table 12. Scores for the sub-sequence

σ_{H c}

.

$σ_{Hc}$	$h_{1}$		$h_{5}$		$h_{6}$
$δ_{Hc}$	Taxi	5	Bus	3	Walk	3
score $δ_{Hc}$	0.7	0.3	0.7	0.18	0.7	0.24

Table 13. Example of POIs name trajectory.

IdTraj	Value
TF10018	Statua_equestre_di_Cosimo_I_de_Medici, Loggia_della_Signoria, Castello_dAltafronte, Torre_della_Pagliazza, Palazzo_Bartolini-Torrigiani
TF10019	Palazzo_dei_Vescovi_a_San_Miniato_al_Monte, Basilica_di_San_Miniato_al_Monte
TF10027	Palazzo_Roffia, Porta_della_Mandorla, Campanile_di_Giotto, Battistero_di_San_Giovanni_(Firenze), Porta_della_Mandorla, Campanile_di_Giotto, Torre_dei_Caponsacchi, Palazzo_dei_Vescovi_a_San_Miniato_al_Monte, Basilica_di_Santa_Croce, Torre_dei_Caponsacchi

Table 14. Example of category trajectory.

IdTraj	Value
TF10018	scultureafirenze, loggedifirenze, castellidifirenze, torridifirenze, palazzidifirenze
TF10019	palazzidifirenze, basilichedifirenze
TF10027	palazzidifirenze, cattedralidellaprovinciadifirenze, campanili, battisteridellatoscana, cattedralidellaprovinciadifirenze, campanili, torridifirenze, palazzidifirenze, basilichedifirenze, torridifirenze

Table 15. Example of transport mean trajectory.

IdTraj	Value
TF10018	N/A, Subway, Taxi, Subway, Bus, Subway
TF10019	N/A, Subway
TF10027	N/A, Walk, Bus, Subway, Subway, Bus, Taxi, Taxi, Taxi, Taxi

Table 16. Semantic trajectories queries for the city of Pisa.

Qid	Free Text Query
Q1	Trajectories that stop at a museum and then at a chapel.
Q2	Trajectories that stop at a tower, then stop at a chapel or church, then stop at a chapel or church again, and then at a museum.
Q3	Trajectories that stop at least once in a tower, and then at a museum.
Q4	Trajectories that stop at the Lion Tower and then at the Leaning Tower, or stop at the Leaning Tower and then at the Lion Tower.
Q5	Trajectories that begin at a museum and then end at a chapel.
Q6	Trajectories that stop at a museum and, later on, end at a chapel or a church optionally.
Q7	Trajectories that begin at a chapel, stop at zero or more chapels, and end at a chapel.
Q8	Trajectories that stop at a museum and then take a bus to a chapel.
Q9	Trajectories that begin at a chapel or a church, always move by bus between stops, and end at the Leaning Tower.
Q10	Trajectories that begin at a tower, then walk to take a bus to a church, and then, using any transportation means, end at a palace.

Table 17. Expressing queries using SETHE.

Qid	SETHE Query
Q1	$e_{2} = 〈 m u s e i d i p i s a; c a p p e l l e d i p i s a 〉$ $α_{1} = 〈 . *;$ ∼〉
Q2	$e_{2} = 〈 t o r r i d i p i s a; (c a p p e l l e d i p i s a \| c h i e s e d i p i s a)$ ; $(c a p p e l l e d i p i s a \| c h i e s e d i p i s a);$ $m u s e i d i p i s a 〉$ $α_{1} = 〈 . ;$ ∼; $. $ ; ∼〉
Q3	$e_{2} = 〈 t o r r i d i p i s a; m u s e i d i p i s a 〉$ $α_{1} = 〈 . *;$ ∼〉
Q4	$Q 4 = {Q 4_{1} = {{〈 n a m e, 〈 T o r r e_d e l_L e o n e; T o r r e_p e n d e n t e_d i_P i s a 〉 〉,$
	{ $〈 p r o x i m i t y, 〈 . *;$ ∼ $〉 〉}},$
	$Q 4_{2} = {{〈 n a m e, 〈 T o r r e_p e n d e n t e_d i_P i s a; T o r r e_d e l_L e o n e 〉 〉,$
	${〈 p r o x i m i t y, 〈 . *;$ ∼ $〉 〉}}}$
Q5	$e_{2} = 〈$ $(m u s e i d i p i s a);$ $(c a p p e l l e d i p i s a) $ 〉$ $α_{1} = 〈 . *;$ ∼〉
Q6	$e_{2} = 〈 m u s e i d i p i s a; (c a p p e l l e d i p i s a \| c h i e s e d i p i s a) * $ 〉$ $α_{1} = 〈 . ; . 〉$
Q7	$e 2 = 〈$ ⌃ $(c a p p e l l e d i p i s a); (c a p p e l l e d i p i s a) $ 〉$ $α_{1} = 〈 . ; . 〉$
Q8	$e_{2} = 〈 m u s e i d i p i s a; c a p p e l l e d i p i s a 〉$ $α_{1} = 〈 . ; B u s 〉$ $α_{2} = 〈 . ; \sim 〉$
Q9	$e_{1} = 〈 . ; (T o r r e_p e n d e n t e_d i_P i s a) $ 〉$ $e_{2} = 〈$ ⌃ $(c a p p e l l e d i p i s a \| c h i e s e d i p i s a); . 〉$ $α_{1} = 〈 . *; (? -) B u s 〉$
Q10	$e_{2} = 〈$ ⌃ $(t o r r i d i p i s a); . ; c h i e s e d i p i s a; (p a l a z z i d i p i s a) $ 〉$ $α_{1} = 〈 . ; W a l k; B u s; . * 〉$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ribeiro de Almeida, D.; de Souza Baptista, C.; de Andrade, F.G. Similarity Search on Semantic Trajectories Using Text Processing. ISPRS Int. J. Geo-Inf. 2022, 11, 412. https://doi.org/10.3390/ijgi11070412

AMA Style

Ribeiro de Almeida D, de Souza Baptista C, de Andrade FG. Similarity Search on Semantic Trajectories Using Text Processing. ISPRS International Journal of Geo-Information. 2022; 11(7):412. https://doi.org/10.3390/ijgi11070412

Chicago/Turabian Style

Ribeiro de Almeida, Damião, Cláudio de Souza Baptista, and Fabio Gomes de Andrade. 2022. "Similarity Search on Semantic Trajectories Using Text Processing" ISPRS International Journal of Geo-Information 11, no. 7: 412. https://doi.org/10.3390/ijgi11070412

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Similarity Search on Semantic Trajectories Using Text Processing

Abstract

1. Introduction

2. Related Work

3. SETHE: A Semantic Trajectory Retrieval Approach

3.1. Basic Concepts

3.2. Query Processing

3.2.1. Query Building

3.2.2. Discovering Sub-Sequences

3.2.3. Transforming a Sub-Sequence into a Vector

4. Running Example

5. Experiments and Results

5.1. Dataset

5.2. Results and Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI