1. Introduction
Mobility has evolved as a hallmark of modernity around the world. A key policy objective in recent decades has been for citizens to move faster and further more safely and more comfortably. In this process, urban mobility has played a fundamental role, which is strongly related to mass motorization. European drivers spend around 20 min on average behind the wheel daily on personal trips on working days and between 25 and 30 min on business trips [
1]. Barcelona is the second most populous city in Spain and the seventh in the European Union (EU) [
2]. According to [
3], there were more than 6 million daily displacements in 2017; 35.3% corresponds to active mobility, 40.1% to public transport, and 24.6% to private transport (67.7% and 29.8% correspond to car and motorcycle, respectively).
Unfortunately, mobility produces negative effects such as congestion, pollution, accidents, deaths from traffic accidents, and greenhouse gas emissions [
4,
5]. In order to satisfy its new climate objectives, the EU has launched a series of political initiatives to reduce the negative effects of cars and, in turn, promote public transport [
6].
In this context, accurate real-time prediction of traffic conditions using statistical and Artificial Intelligence (AI) methods is of vital importance for road users, private sectors, and governments. For example, identifying commuting trends may help to decide in which areas public transportation should be reinforced. Knowing mobility trends and urban design may enable the selection of the optimal number and location of public charging stations for electric vehicles, which may increase the use of these vehicles. For companies, being able to estimate their social and environmental impacts, combined with intelligent methods, helps them to design distribution processes that minimize these impacts, which tend to be correlated to economical costs [
7]. Thus, high quality data are potentially useful for modeling and addressing optimization problems related to urban mobility [
8].
Regarding the availability of data, there is an increasing number of initiatives aiming to share data for the common good and for the benefit of anyone. For instance, the city of Chicago has an open data portal (
https://data.cityofchicago.org/, accessed on 2 April 2024) which enables the creation of maps and graphs to gain insights about plenty of facts (salaries, violence, vaccinations, etc.). One of the advantages of these types of portals is that data are frequently updated (faster than the statistics typically offered by national statistics institutes). There are also initiatives proposed by private companies such as Carbon Footprint Ltd. (
https://www.carbonfootprint.com/aboutus.html, accessed on 2 April 2024), which offers some online calculators to estimate the average carbon footprint of individuals and small businesses for free. Despite the emergence of these initiatives, ref. [
5] states that there is a lack of sufficiently granular indicators and related data on urban mobility (such as modal split, environmental impacts, congestion, energy use). The reason is that cities are not required to collect and provide these data. However, the increasing adoption of Internet of Things technologies (e.g., intelligent traffic lights, road sensors, or sensors in trash and recycling containers) and awareness of the potential of open data portals suggest that the availability of data will continue to increase during the next years.
Traffic modelization and prediction constitute an arduous endeavor because of the high non-linearity and complexity of traffic flow. Recently, classic statistical models have been challenged by machine learning and deep learning methods in traffic prediction tasks [
9]. This is due to the fact that traditional methods cannot make predictions in the medium–long term and fail to consider the spatial and temporal dependencies in the data [
10].
In this context, the purpose of this article is to examine the traffic levels in Barcelona utilizing data from the Ajuntament de Barcelona’s open data service, employing both classic and machine learning methods. The primary contributions include conducting an exploratory analysis of the traffic data with visualization techniques, developing predictive models utilizing both time series analysis and the eXtreme Gradient Boosting (XGBoost) algorithm, and examining the results along with a discussion. The specific flow of the research methodology employed in this article is depicted in
Figure 1. To our knowledge, this study represents the first attempt to analyze open data on traffic levels utilizing visualization techniques, time series analysis, and the XGBoost algorithm.
Next, we outline the structure of the document.
Section 2 reviews related work, focusing on traffic flow studies, prediction, and models in Barcelona.
Section 3 details the methodology, encompassing visualization techniques, time series analysis, and the XGBoost algorithm.
Section 4 outlines the application, detailing the dataset employed and presenting and discussing the results obtained through the various components of the methodology.
Section 5 provides a discussion based on the results obtained in the previous sections. Finally,
Section 6 draws pertinent conclusions and outlines potential lines of future research.
3. Methodology
To examine traffic levels in Barcelona utilizing data from the Ajuntament de Barcelona’s open data service, this paper employs visualization techniques, time series analysis, and the XGBoost algorithm.
3.1. Visualization Techniques
Visualizing traffic data is crucial for effective city traffic management. It aids in comprehending the dynamics of moving entities and uncovering patterns related to traffic, social dynamics, spatial geography, and economic trends [
22]. Some aspects of the potential application field of traffic visualization are identifying real-time traffic jams by monitoring traffic situations, discovering mobility patterns of vehicles and pedestrians, and improving route planning according to the traffic density. Commonly used visualization techniques include line charts, bar charts, heatmaps, histograms, and geospatial maps. Geospatial maps represent the most commonly employed method for visualizing traffic data, and 2D visualization is more commonly used in the literature than its 3D counterpart [
23].
While 2D geospatial maps provide a good representation of the spatial dimensions of data, they fail to present the temporal dimension of data, which is especially important in traffic data visualizations. There are different approaches for representing spatial data with temporal components. The most widely studied method in the literature is the space–time cube (STC) model [
24]. In an STC model, the spatial components of the data are depicted along the
x and
y axes, while the temporal dimension is represented along the
z axis. This model works well with small-to-medium-sized data; however, when the data are too large, the visualization becomes difficult.
Another common approach to represent space and time is to perform animation of a map to display geospatial changes in time. The visualization technique used in this article uses the HeatMapWithTime plugin of Folium library [
25]. A heatmap is a powerful visualization method for storytelling, especially for geospatial data. This plugin allows creation of an interactive animation of heatmaps that allows end users to manipulate the visual representation according to their needs. Some of the potential interaction options are overview, zoom, pause, play, loop, and also play at different speeds for a period of time. The heatmap allows us to visualize the traffic density of different sections of the city throughout the day. Every dot in the heatmap represents the location of a vehicle; by analyzing the location of vehicles at different time intervals, potential mobility patterns of vehicles and the traffic density of the streets can be identified. The dataset employed in this article (described in
Section 4.1) does not contain vehicle information but only street section data; however, with the provided information, we can manually assign a certain number of vehicles for every street section according to the traffic density of the sections, i.e., higher-traffic-density sections have a higher number of vehicles, and lower-density sections have a lower number of vehicles.
3.2. Time Series Analysis
The predictive models used are autoregressive integrated moving average (ARIMA) models [
26]. The general case of ARIMA (
p,
d,
q) can be written as follows:
where the parameters
are the autoregresive part (ar), and
are the moving average part (ma). The models are built by section because the behavior is more stable. They are implemented using free software R (version 4.03) [
27] and the forecast package for R [
28]. As these models have been widely utilized across various applications for an extended period, we direct interested readers to [
29] for further elaboration.
3.3. eXtreme Gradient Boosting
eXtreme Gradient Boosting [
30] (XGBoost) is a scalable machine learning technique for tree gradient boosting. XGBoost employs a methodology similar to other gradient boosting techniques, constructing a mathematical framework to predict
from input
Specifically, XGBoost forms an ensemble comprising K Classification and Regression Trees (CART). Each corresponds to a distinct tree q with its own structure characterized by the number of leaves T and the set of leaf weights w.
Given this established framework, the predicted output is determined by a
K additive function as follows:
In this specific instance, the data values include factors such as the geographical location of the evaluated section, the time of data collection, and other data engineering variables discussed later. Rather than purely categorizing like conventional decision trees, this model incorporates continuous scores obtained from the weights associated with each leaf across all defined trees q.
To enable the learning process of the
K specified trees, a convex loss function acts as a regularized objective:
Here,
l represents a differentiable loss function measuring the disparity between the target
and the predicted value
, while
is defined as a regularization term penalizing potential overfitting of the model expressed as follows:
Instead of adhering to optimization methods based in the Euclidean space, XGBoost undergoes training via an additive approach. In this method, for a given
i-th instance at the
t-th iteration, a distinct
is introduced to minimize the objective function, represented as follows:
Here,
denotes the prediction at that specific instance and iteration. This process means that
is incrementally included based on its performance across optimization instances and iterations. The objective function, derived by expanding the loss function’s second-order Taylor series with respect to
for optimization purposes, is expressed using its first- and second-order gradient statistics
and
, following
Furthermore, expanding the objective function through its regularization term
for a leaf set
within a given structure
can be achieved as follows:
This formulation enables the determination of the optimal leaf weight
for leaf
j and assesses the fit of the overall structure
using the following:
XGBoost employs a greedy algorithm, initially starting from a single leaf and subsequently adding branches at each iteration using the scoring function in Equation (
10) to evaluate improvements in the general tree structure
q. The potential gain is assessed within an instance
, where
and
denote the sets of nodes on each branch after splitting.
3.3.1. Evaluation Metrics
The classification system’s evaluation is determined using the area under the ROC curve (AUC) with an Over to Rest (OrV) strategy, a widely recognized measure of performance for machine learning models. In the OrV strategy, any misclassification is considered erroneous regardless of which two classes are mistaken. The AUC method is defined as follows:
Here, represents the probability of a false positive, and denotes the probability of a true positive.
3.3.2. Shapley Additive Explanations
Shapley Additive Explanations (SHAP) are a tool rooted in game theory used to interpret machine learning models [
31]. It assesses the contribution of each variable to the final output by iteratively introducing one variable into the model at a time and evaluating the expected value of the model’s output function. This method allows SHAP to calculate the average contribution while considering the effects of all potential variable orderings.
The average contribution of variable
x in model
f can be calculated as follows:
where
represents the count of input variables, and
.
SHAP, and, in particular, Tree SHAP [
32], provides insights into how variables impact tree-based machine learning models. It examines three fundamental properties: local accuracy ensures that the function linking
x to
, denoted as
, matches the set of variables
x, ensuring the approximation of
f corresponds to the output of
f; missingness stipulates that absent values have no bearing on the model’s output (
); and consistency guarantees that if an input’s contribution to the model increases or stays constant, the Shapley value follows suit, regardless of other inputs. Tree SHAP analyzes the model using an input dataset
X of dimensions
, generating a matrix of SHAP values for each variable in every tuple in
X, ensuring consistent explanations for individual predictions while highlighting contributions with sign indicators.
3.4. Variable Analysis
Once the model is computed, the importance scores for each variable can be extracted. This presents a score for how valuable each variable selected is in absolute terms and averaged across all the trees that form the XGBoost model. We study the importance of each parameter following three main scoring systems:
Gain: It denotes the relative contribution of each variable to the model, calculated by summing up the contribution of each variable for every tree generated. It shows the importance of a variable when generating a prediction;
Cover: It indicates the relative importance of each variable based on the number of observations associated with it. This is determined by summing the second derivative of the loss function over all training data points falling into each node defined by the variable;
Frequency: It represents the percentage indicating how often a specific variable appears in the trees of the model relative to the total number of trees.
The model is implemented using standard Python 3.10.8 distribution [
33] with its correspondent libraries for XGBoost implementation.
4. Application
This section describes the dataset explored and presents the results obtained from applying the visualization techniques and models introduced in the preceding section.
4.1. Description of the Dataset
The dataset is called “Traffic state information by sections of the city of Barcelona” and is freely accessible from the dataset catalogue of the Open Data BCN service (
https://opendata-ajuntament.barcelona.cat/en, accessed on 2 April 2024). According to its website, Open Data BCN constitutes “a movement driven by public administrations with the main objective of maximize available public resources, exposing the information generated or guarded by public bodies, allowing its access and use for the common good and for the benefit of anyone and any entity interested”.
The dataset contains historical data collected since December 2017 and is updated monthly. It describes the traffic state for a set of 527 sections and has an update frequency of 5 min. The traffic states are 0 = no data, 1 = very fluid, 2 = fluid, 3 = dense, 4 = very dense, 5 = congestion, and 6 = cut off (closed). The traffic state is assessed using various sensors embedded beneath the asphalt, including those detecting magnetic field changes caused by passing metal masses (vehicles), infrared sensors, and cameras equipped with image processing capabilities. Data from each detection station are qualitatively interpreted, typically using a scale ranging from 1 to 5, based on predefined thresholds specific to each station.
To explore the city of Barcelona’s traffic situation information in March 2022, the software used was R version 4.0.3 (10 December 2020) [
27], employing RStudio as integrated development environment [
34]. The dataset contains five variables: ID section (where section refers to road segment in Barcelona), data (year, month, day, hour, minute, and second), current state, and expected state. There are 4,599,656 rows since the information of the 527 sections is updated approximately every 5 min. First, we transform the dataset to consider the current state every 5 min exactly. This process requires introducing new rows with a 0 (no data) both in current and expected states.
The distribution of the seven current traffic states is shown in
Figure 2. Overall, congestion (
) and ‘no data’ (
) have the minimum and maximum proportions, respectively. The rest of the states, sorted by proportion, are fluid (
), very fluid (
), dense (
), very dense (
), and cut off (
).
4.2. Visualization Techniques
Figure 3 represents the traffic state of the city of Barcelona for Friday 25 March 2022 at different time intervals. The lines represent the sections of the streets, with each color indicating the traffic condition of those sections: light blue for very fluid, dark blue for fluid, yellow for dense, orange for very dense, and red for congestion. As we can observe, most sections of the city have a traffic state of 1 to 2 (very fluid to fluid traffic), and rush hours (9:00 a.m.) have more sections with dense or higher traffic (3–5) than non-rush hours (3:00 p.m.).
Figure 4 shows the traffic state for the same time intervals as
Figure 3 but in heatmap form. The top two subfigures depict a broader area, whereas the bottom two subfigures concentrate on the central area. As we can observe, the dense area of the heatmap (that is, areas marked with an orange color) is located in sections with higher traffic or areas with a higher number of sections.
4.3. Time Series Analysis
The predictive models are applied to data from the city of Barcelona’s traffic situation information in March 2022. Due to the large number of monitored sections, it is important to know whether their behavior is similar. For this purpose, 10 sections belonging to the same street are chosen. We would like to highlight that, even though the sections belong to the same street, there are times when the traffic conditions are not similar, even with changes depending on whether it is a weekend or a working day.
Table 1 shows the output of the time series models. The first two columns identify the section and the fitted model. The subsequent five columns display the coefficients, while the last two columns present the performance measures AIC and BIC. It is not necessary to differentiate the series in any case because the models are stable.
Table 2 shows the Mean Error (ME) and the Root-Mean-Square Error (RMSE) of each model. The tests of independence (Box–Ljung test for residuals), homoscedasticity (Box–Ljung test for squared residuals), and normality (Shapiro–Wilk normality test) are applied to each of the models, and their
p-values are shown. None of the models complies with the initial assumptions, which leads us to confirm that these models do not explain the behavior of the data very well. In conclusion, due to the high non-linearity and complexity of the traffic flow, classical methods cannot make good predictions; moreover, they do not fit well with the space–time structure of the traffic data. For this reason, in the next subsection, a machine learning method is applied.
4.4. eXtreme Gradient Boosting
For streamlined parameter importance analysis, we classify the diverse densities derived from the original dataset into separate classification categories: 0 = very fluid, 1 = fluid, 2 = dense, 3 = very dense, 4 = congestion. The dataset, organized according to
Table 3, contains information spanning the entirety of 2019. We partition the dataset into distinct sets for both training and testing stages during the evaluation process. Specifically, 70% of the dataset is earmarked for training purposes, while the remaining 30% is preserved for parameter evaluation. Data spanning January to March 2022 are processed for final analysis and predictions as presented in the model performance section.
To optimize XGBoost’s performance, parameter tuning is conducted using a cross-validation technique known as grid search with five folds. This involves exploring various parameter configurations within sensible ranges, as detailed in
Table 4.
The Area Under the Curve (AUC) depicted in
Figure 5 illustrates the consistently strong performance of the model, with AUC values exceeding 80% across all ROC curves. Notably, the model exhibits particularly strong performance in extreme cases, with a true positive rate exceeding 85%. However, there is a slight decline in performance observed in intermediate classes.
On average, this model achieves an accuracy rate of 74.48% for the 2019 dataset; therefore, we can affirm that this model represents with accuracy the importance of the parameters with a relevant correlation between the true labels and the predicted ones.
Computing time can also be counted in model performance once it is trained, like in
Table 5. For reference, average times for predictive values for a particular day, week, or month’s worth of data are presented considering that this model’s training and execution was carried out on an 11th Gen Intel(R) i5-1135G7 that runs at 2.40 GHz with 16 GB of RAM and Windows 10 Pro 22H2—64 bits as the main operative system. It is concluded that this technology is perfectly adaptable for a dynamic interpretation of the data if this is to be implemented as a prediction tool by Barcelona council.
For this model,
Figure 6a shows the parameters ranked by frequency of appearance, where FromNorth, DailyHour, and DayMonth represent the majority of the relevance in order to classify the data rows. FromWest also shows a relatively superior relevance. These results are consistent with our initial hypotheses since the geographical location, the time, and the day of the month influence, and, therefore, help to explain, the behavior of current traffic status. Variables like DailyMinute and Weekday appear to be labeled as less determinant in the model. Moreover, checking not only the frequency of appearance but its cover (the relative number of observations linked in prediction with the variable) in
Figure 6b and its gain (the relative contribution of each variable on each tree for every prediction) in
Figure 6c, it can be seen how variables such as ToWest and ToNorth, even though their percentage of appearance is not outstanding compared to others in terms of variable importance, have gain and cover which show how this is also defining the behavior of the model and therefore have a relevant impact on the outcome.
As
Figure 7 implies, classes 0, 1, and 2 obtain their explainability from DayMonth in the majority of cases. The rest of the classes find their explanation impact from coordinates like ToWest and FromWest.
Lastly,
Figure 8 illustrates the model’s effect on various combinations of variables for each class. For class 0, the hour and day of the week have the greatest impact on the model. For class 1, the hour and North coordinates are the most important. For class 2, the hour and West coordinates have the greatest impact. For class 3, the hour and day of the week coordinates are the most important. In class 4, all four coordinates have the most impact on the model.
5. Discussion
Due to the increasing importance of traffic prediction for urban transportation management, numerous traffic flow prediction models have been studied over recent decades. This paper utilizes visualization techniques, such as heatmaps, to illustrate the traffic density across various city sections throughout the day. Analyzing vehicle locations at different time intervals reveals mobility patterns and street traffic density. An interactive animation plugin is identified, enabling end users to customize the visual representation with options including overview, zoom, pause, play, loop, and variable playback speeds.
Subsequently, traffic prediction is approached as a time series analysis problem, with ARIMA models being employed to forecast traffic flow across ten sections of the same street. Time series analysis, particularly ARIMA models, has historically been popular for traffic prediction due to ease of implementation and relatively high accuracy [
35]. Nevertheless, the intricate non-linear spatio-temporal dependencies, coupled with external factors such as weekends, holidays, and weather, present challenges that surpass the capabilities of ARIMA. Our results indicate that none of the ARIMA models built satisfies the assumptions of independence, homoscedasticity, and normality. Thus, classical methods struggle to make accurate predictions and are not well suited to the spatio-temporal structure of traffic data.
To tackle this challenge, an XGBoost model is employed for traffic prediction. XGBoost requires less prior knowledge of traffic patterns, handles non-linear variables effectively, and consistently achieves robust performance with the area under ROC curves exceeding 80%. Its use facilitates analysis of variable importance and enhances interpretability through Shapley Additive Explanations.
In summary, while visualization techniques aid real-time analysis of traffic density and interactive animations for clearer pattern recognition, the XGBoost model provides insights into variable importance, interpretable outputs, and accurate predictions.
By accurately predicting traffic patterns, decision-makers can make more informed about transportation infrastructure, such as the location and number of roads, highways, and public transportation systems. This can help to optimize the use of limited resources and improve the efficiency of the transportation network [
36]. In terms of reducing congestion and emissions, accurate traffic prediction can help decision-makers to implement strategies to reduce congestion and improve air quality. For instance, they could implement demand-management strategies like variable tolls or congestion pricing. These measures aim to incentivize drivers to opt for alternative modes of transportation or travel during off-peak hours. Also, traffic prediction can help to identify areas where accidents are more likely to occur, allowing them to implement strategies to reduce the likelihood of accidents and improve road/street safety. Hence, by foreseeing and outlining the variables that contribute to the forecast, this paper’s contribution can assist a decision-maker in making quick and effective decisions in order to enhance traffic management and mitigate traffic congestion [
37]. In this study, the significant variables that contribute to the forecast are the geographical location, especially when the section starts in the North direction, the time, and the day of the month. However, some limitations need to be highlighted. In order to increase the usefulness of the data and analysis, data must be accessible, interconnected, and quantifiable to facilitate decision-making and drive innovation. In the case of the city council, this is not fully achieved, as the availability of information and the interconnection between datasets are severely restricted by the format and structure in which they are delivered. This hinders a broader perspective on some possible research questions and impedes the ability to unlock the full potential of the data. No clear descriptions were given for some of the data sources, a large number of missing data were detected, and the quantification of terms like traffic density was not available. Possible data connections with other datasets on the same portal were also not available.
6. Conclusions and Further Research
Studying the mobility patterns in smart cities may provide useful insights to support decision-making related to urban mobility. Optimizing processes such as the placement of electric vehicle charging stations or designing efficient routes may minimize economic costs while contributing to reducing the environmental impacts and increasing social welfare. In this context, this paper explored mobility patterns in the city of Barcelona (Spain), relying on the open data service of the Ajuntament de Barcelona. An exploratory analysis, visualization techniques, and predictive models have been presented. In addition, a discussion regarding the potential of these tools for policymakers has been provided.
The primary limitation of this research lies in its reliance on data spanning only one month. While the selection of this period facilitated building models and yielded intriguing insights, examining a more extended period (spanning years) would enable exploration of trends, seasonal fluctuations, variability, and the influence of COVID-19 and lockdown policies on traffic patterns, among other factors. This work opens up several lines of future research. First, the use of more variables could lead to more robust, powerful, and interpretable models. Interesting variables to consider are related to weather (rainwater level, temperature, etc.), traffic accidents, and events that attract a lot of people such as football games and international conferences. In addition, comparing open data portals of smart cities across the world (regarding datasets related to urban mobility and characteristics such as update frequency, granularity, and documentation) and studying the use that different agents have for them is another promising research line. Further lines in this topic could involve comparing the results of XGBoost to those of other machine learning algorithms and examining ways to improve its interpretability. Research on the potential benefits of combining XGBoost with the use of more advanced spatio-temporal models, in order to capture the spatio-temporal dependence, is another promising option for implementing better predictive models.