3.2.1. Embedding Layer

To limit the problems of the curse of dimensionality, trajectory sparseness, and computational inefficiency, we replace traditional representations such as one-hot by associating each discrete location with a low-dimensional dense vector (embedding). This is done by means of an embedding layer, transforming sequences of discrete location identifiers into sequences of dense vectors before they are fed to the LSTM block, as depicted in Figure 2. In particular, each location is initially defined by a random vector of a pre-defined size, whose values are updated during the training process; just like other model parameters, embeddings are tweaked, through backpropagation, on the basis of the prediction outcomes. Over training, they assume a meaningful mathematical representation as vectors of continuous values, whereby locations that are often co-occurring in the same traces share similar representations in this embedding space.

**Figure 2.** Embedding layer representation: from a sequence of discrete locations to a sequence of dense vectors.

#### 3.2.2. LSTM Block

The next stage consists of the LSTM block. LSTM [73] is a complex recurrent neural network type, whose repeating module is composed of four different neural networks interacting between each other. The network processes an input sequence one element at a time, receiving, at each step, two sources of input data: the current vector of the data sequence concatenated with the output vector of the network module at the previous step. The information flows through the network modules, encoded in the cell state, and is modified by the four neural network structures until the end of the sequence is reached. The output at the last step is the final vector characterization of the sequence, which is subsequently used for the actual prediction task. If the LSTM block contains multiple LSTM layers, the final trajectory vector is represented as the output, at the last step, of the last layer. In general, the first LSTM layer is fed with the input sequence, the second layer is fed with the output of the first layer, and so on. Figure 3 displays a visual representation of the LSTM block; the example shows the last two steps of an embedding sequence and a block of two LSTM layers.

**Figure 3.** Visual representation of the last two steps of an LSTM block composed of two LSTM layers: the lower vectors represent the input embeddings; the vector on the upper right represents the final trajectory characterization.

Equations (1)–(6) report the formulas describing the functioning of a repeating module of LSTM, given an input vector *xt*; the forget gate (1) defines the information to be deleted from the cell state; the input gate (2) decides which values to update; the tanh network (3) determines a vector of new values to be added to the state; the new cell state (4) is obtained by filtering the old cell state through the forget gate, and by adding the combination outcome between the input gate and the tanh network; the output gate (5) defines which parts of the cell state to output; and the final LSTM output (6) results from the multiplication between the output gate and the tanh of the new cell state.

$$f\_t = \sigma(\mathcal{W}\_f \cdot [h\_{t-1}, \mathbf{x}\_t] + b\_f) \tag{1}$$

$$i\_t = \sigma(\mathcal{W}\_i \cdot [h\_{t-1}, \mathbf{x}\_t] + b\_i) \tag{2}$$

$$\overline{\mathcal{C}}\_{\text{I}} = \tanh \left( \mathcal{W}\_{\text{C}} \cdot \left[ \mathfrak{h}\_{\text{t}-1}, \mathfrak{x}\_{\text{t}} \right] + \mathfrak{b}\_{\text{C}} \right) \tag{3}$$

$$\mathbb{C}\_{t} = f\_{t} \* \mathbb{C}\_{t-1} + i\_{t} \* \overline{\mathbb{C}}\_{t} \tag{4}$$

$$
\rho\_t = \sigma \left( \mathcal{W}\_o \cdot \left[ h\_{t-1}, \mathbf{x}\_t \right] + b\_o \right) \tag{5}
$$

$$h\_t = o\_t \* \tanh(C\_t) \tag{6}$$

#### 3.2.3. Softmax Layer

The predicted next location is explicitly disclosed by means of a softmax layer on top of the LSTM block. The softmax layer is a simple, fully-connected neural network followed by a softmax activation function. It receives the final trajectory vector characterization as an input, and outputs the predicted probability distribution for the next potential location, as shown in Figure 4.

**Figure 4.** Softmax layer representation transforming the output vector of the LSTM block into the probability distribution of the potential predicted location.

Equation (7) reports the description of the softmax layer, where *hlast* represents the output of the last LSTM layer at the last step and *n*\_*LOC* is the total number of locations.

$$\mathbb{P}\left(\text{LOC}=j \mid h\_{\text{last}}\right) = \frac{\exp\Big(\mathcal{W}\_j h\_{\text{last}}' + b\_j\Big)}{\sum\_{k=1}^{n\_- \text{LOC}} \exp\Big(\mathcal{W}\_k h\_{\text{last}}' + b\_k\Big)}\tag{7}$$

#### 3.2.4. Model Training

Prior to being fed into the neural network model, location sequences are scanned by a sliding window, determining the training features and the target variable. The window moves forward by one location until the end of each sequence, defining multiple segments of fixed length as input sequences to the deep learning model. The segment length represents the amount of past motion activity taken into account for learning to predict the future location (e.g., predicting the next location based on the last six hours of a user's mobility). Its choice, besides strongly depending on the applications and dataset restrictions, is closely related to the time resolution of the sequence, whereby a higher time resolution determines a larger number of locations describing the past motion activity.

The deep learning model is fed with a collection of these segments, where, for example, a window length equal to four locations would define a sequence (*LOCt*−3, *LOCt*−2, *LOCt*−1, *LOCt*) as input features to the model and the location *LOCt*+<sup>1</sup> as the target variable. The model training maximizes the log probability, with respect to the weights of every layer (embedding, LSTM, and softmax), of observing the correct next location, given the sequence of past locations. The process relies on backpropagation and mini-batch stochastic training to determine in which direction the weights are adjusted.

The prediction of a location sequence is therefore based on the collective historical mobility of people, identifying the most likely next location as the one having the highest probability according to the output of the model.

#### **4. Experiment**

The current section introduces the dataset used for the prediction task and reports the description and results of the experiments conducted. A particular focus is given to the evaluation of results, which are compared to traditional approaches and are analyzed according to different motion characteristics. The proposed model was implemented and executed on TensorFlow (Google Brain, Mountain View, CA, USA), using AWS EC2 p3.2xlarge GPU instance.

#### *4.1. Dataset*

To properly describe the general large-scale motion activity of foreign tourists, we used a real-world dataset comprising seven months of anonymized mobile phone call detailed records (CDRs) of roamers in Italy. In order to present meaningful findings, it is indeed important, especially when dealing with wide territories, to make use of a sufficiently large and complete dataset, whose trajectories redundantly cover the study area. CDRs have been widely used in human mobility studies [74–77], reporting the detected mobile phone activities enriched with the time stamp and the position of the device in terms of the coverage area of the principal antenna. We only took into account short-term visitors, recorded to be located in the country for a maximum of two weeks. In addition, we discarded those users that appeared to be completely stationary. Foreign visitors' mobility was therefore represented by short traces and non-repetitive behaviors.

The erratic profile of mobile activity, represented by sparse connection events, may critically fragment mobility traces, making it difficult to create continuous location sequences. To limit the fragmentation problem and define proper trajectories, we pre-processed traces into sequences unfolded in 1 h time step; the prediction problem is formulated as predicting the location of a user in the next hour. In particular, if more than one track point was recorded in the same hour, the location associated to the majority of those recordings was chosen to identify the current position of the user. Given the wide territory, the choice of the time step unit, and our focus on large-scale movements, a minimum spatial resolution of 2 km was selected. Reference points were defined as the antennas subjected to the highest number of connections within the minimum spatial resolution, projecting the other ones to the closest reference point. Furthermore, we discarded very rare locations, identified by just a few tens of recorded events. Being mostly randomly visited, they are not significantly involved in the overall travel behavior of foreign visitors in Italy. Nevertheless, specific characteristics of different datasets may provide an influence on parameters such as time and space resolution, and a choice of different values can be suitable for different applications.

The final dataset consists of 1 h encoded sequences of 5903 possible unique locations over the Italian territory. To appropriately focus on short motion behaviors and to make complete and proper utilization of the dataset, represented by relatively short continuous traces, we set a window length equal to 6 h (6 locations), determining a total of 13 million trajectory segments (with a median

displacement per segment of 36.1 km) generated by 1.4 million users. We believe this large amount of data is representative of the overall real motion behavior of foreign tourists.

#### *4.2. Experimental Settings*

We designed the neural network model using an embedding size of 100 dimensions and a block of two LSTM layers having a hidden size of 4000 neurons each. The training process was based on cross-entropy cost function, mini-batches, and Adam optimizer [78]. To evaluate the performance of the model on previously unseen data, we randomly split the dataset into a training set and a test set, containing 80% and 20% of the users, respectively.

For a better evaluation of the results, we compared the achieved prediction accuracy with traditional approaches involving the use of Markov modeling, which is widely applied in location prediction problems. Locations are represented as states and movements between locations as state transitions. The creation of a transition matrix identifies the most likely next destinations for each current location [33]. We reported three different Markov model types as comparison baselines for our methodology:

