**1. Introduction**

In the last years Intelligent Transportation Systems (ITS) have experienced an unparalleled expansion for many reasons. The availability of cost-effective sensor networks, pervasive computation in assorted flavors (distributed/edge/fog computing) and the so-called Internet of Things are all accelerating the evolution of ITS [1]. On top of them, Smart Cities cannot be understood anyhow without Smart Mobility and ITS as technological pillars sustaining their operation [2]. Smartness springs from connectivity and intelligence, which implies that massive flows of information are acquired, processed, modeled and used to enable faster and informed decisions.

For the last couple of decades, ITS have grown enough to cross pollinate with previously distant areas such as Machine Learning and its superset in the Artificial Intelligence

**Citation:** Laña, I.; Sanchez-Medina, J.J.; Vlahogianni, E.I.; Del Ser, J. From Data to Actions in Intelligent Transportation Systems: A Prescription of Functional Requirements for Model Actionability.*Sensors* **2021**, *21*, 1121. https:// doi.org/10.3390/s21041121

Academic Editor: Rashid Mehmood Received: 7 January 2021 Accepted: 2 February 2021 Published: 5 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional clai-ms in published maps and institutio-nal affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

taxonomy: Data Science. These days Data Science is placed at the methodological core of works ranging from traffic and safety analysis, modeling and simulation, to transit network optimization, autonomous and connected driving and shared mobility. Since the early 90's most ITS systems exclusively relied on traditional statistics, econometric methods, Kalman filters, Bayesian regression, auto-regressive models for time series and Neural Networks, to mention a few [3,4]. What has changed dramatically over the years is the abundance of available data in ITS application scenarios as a result of new forms of sensing (e.g., crowd sensing) with unprecedented levels of heterogeneity and velocity. Zhang et al. [3] have defined this new form of data-driven ITS as the systems that have vision, multisource, and learning algorithms driven to optimize its performance and augmen<sup>t</sup> its privacy-aware people-centric character.

The exploitation of this upsurge of data has been enabled by advances in computational structures for data storage, retrieval and analysis, which have rendered it feasible to train and maintain extremely complex data-based models. These baseline technologies have laid a solid substrate for the proliferation of studies dealing with powerful modeling approaches such as Deep Learning or bio-inspired computation [5], which currently protrude in the literature as the *de facto* modeling choice for a myriad of data-intensive applications.

However, significant consideration must be placed to the systematic and myopic selection of complex data-based solutions over well-established modeling choices. The current research mainstream seems to be misleadingly focusing on performance-biased studies, in a fast-paced race towards incorporating sophisticated data-based models to manifold research area, leaving aside or completely disregarding the operational aspects for the applicability of such models in ITS environment. The scope of this work is to review existing literature on data-driven modeling and ITS, and identify the functional elements and specific requirements of engineering solutions, which are the ultimate enablers for data-based models to lead towards efficient means to operate ITS assets, systems and processes; in other words, for data-based models to fully become *actionable*. Bearing the above rationale in mind, this work underscores the need for formulating the requirements to be met by forthcoming research contributions around data-based modeling in ITS systems. To this end, we focus mainly on system-level on-line operations that hinge on data-based pipelines. However, ITS is a wide research field, encompassing operations held at longer time scales (e.g., long-term and mid-term planning) that may not demand some of the functional requirements discussed throughout our work. Furthermore, our discussions target system-level operations rather than user-level or vehicle-level applications, since in the latter the information flow from and to the system is scarce. Nevertheless, some of the described functional requirements for system-level real-time decisions can be extrapolated to other levels and time scales seamlessly. From this perspective, our ultimate goal is to prescribe – or at least, set forth – the main guidelines for the design of models that rely heavily on the collection, analysis and exploitation of data. To this end, we delve into a series of contributions that are summarized below:


of this section are twofold: on the one hand, we identify and define the holistically actionable ITS model along with its main features; on the other hand, we enumerate requirements for each feature to be considered actionable, as well as a review of the latest literature dealing with these features and requisites.

•Finally, on a prospective note we elaborate on current research areas of Data Science that should progressively enter the ITS arena to bridge the identified gap to actionability. Once the challenges of modeling and ITS requirements have been stated, we review emerging research areas in Artificial Intelligence and Data Science that can contribute to the fulfilment of such requirements. We expect that our reflexive analysis serves as a guiding material for the community to steer efforts towards modeling aspects of more impact for the field than the performance of the model itself.

As a summary, the contributions of this work consist of identifying the main actionability gaps in the data-based modeling workflow, gathering and describing the fundamental requirements for a system to be actionable, and considering both the requirements and the usual data-based processing workflow, proposing solutions through the most recent technologies. These contributions are organized throughout the rest of the paper as follows: Section 2 delves into the *actionable data-based modeling workflow*, i.e., the canonical data processing pipeline that should be considered by a fully actionable ITS system with databased models in use. Section 3 follows by elaborating on the functional features that an ITS system should comply with so as to be regarded as *actionable*. Once these requirements are listed and argued in detail, Section 4 analyzes research paths under vibrant activity in areas related to Data Science that could bring profitable insights in regards to the actionability of data-based models for the ITS field, such as explainable AI, the inference of causality from data, online learning and adaptation to non-stationary data flows. Finally, Section 5 concludes the paper with summarizing remarks drawn from our prospect.

### **2. From Data to Actions: An Actionable Data-Based Modeling Workflow**

ITS applications with data driven modeling problems underneath range from the characterization of driving behavioral patterns, the inference of typical routes or traffic flow forecasting, among others. Data driven modeling can be considered to include the family of problems where a computational model or system must be characterized or learned from a set of inputs and their expected outputs [6]. In the context of this definition, actionability complements the data-driven model by prescribing the actions (in the form of rules, optimized variable values or any other representation alike) that build upon the output knowledge enabled by the model.

In general, a design workflow for data-based modeling consists of 4 sequential stages: (1) data acquisition (*sensing*), which usually considers different sources; (2) data preprocessing, which aims at building consistent, complete, statistically robust datasets; (3) data modeling, where a model is learned for different purposes; and (4) model exploitation, which includes the definition of actions to be taken with respect to the insights provided by models in real life application scenarios. These 4 stages can be regarded as the core of off-line data-driven modeling; however, when time dimension joins the game, a fifth stage—adaptation—must be considered as an iterative stage of this data pipeline, aimed at maintaining learned models updated and adapted to eventual changes in the data distribution. This adaptation is crucial for real-life scenarios, where changes can happen in all stages, from variations of the input data sources, to interpretation adjustments and other sources of non-stationarity imprinting the so-called *concept drift* in the underlying phenomenon to be modelled [7]. We now delve into these five data processing stages in the context of their implementation in ITS applications, following the diagram in Figure 1.

The stages provided in Figure 1 can be considered as a standard workflow in any data-based work; however, although these steps are easily recognisable, they are not always regarded, and it is common to observe that practitioners put the focus only on a subset of them, disregarding their interactions or omitting some of them. For instance, the prescription stage is not frequently considered, while it is an essential link between the

modeling outcome and the final decision/action derived from the modeling result. Besides, each step can have implications for the final actionability of the model, reason for which all of them are analyzed below.

**Figure 1.** Data-based modeling workflow showing its main processing stages and their principal technology areas.

### *2.1. Data Acquisition (Sensing)*

The path towards concrete data-based actions departs from the capture of available ITS information, which in this specific sector is plentiful and highly diverse. The advent of data science for ITS has come along with the unfolding of copious data sources related to transportation. Indeed, ITS are pouring volumes of sensed data, from the environment perception layer of intelligent and connected vehicles, to human behaviour detection/estimation (drivers, passengers, pedestrians) and the multiple technologies deployed to sense traffic flow and behaviour. Concurrently, many other non-traditional sources that were useful to infer behavioral needs and expectations of people that use transportation, such as social media, have started to become increasingly available and exploited augmenting the more conventional sensing sources towards more efficient mobility solutions. Some of these data sources are currently used in almost any domain of ITS, from operational perspectives such as the estimation of future transportation demands, adaptive signaling or the discovery of mobility patterns, to the provision and of practical solutions, such as the development of autonomous vehicles, although not all sources are suitable for all applications. The model actionability is dependant on this early stage too, reason for which the data selection (when possible) should not be neglected. For instance, a model that consumes speed data will probably require some other measurements (maximum speed of a road segment) to provide in the end something meaningful, while a model that consumes travel-time data will be more straight-forward.

Five main categories can be established to describe the spectrum of ITS data sources:

1. Roadside sensing, which brings together tools and mechanisms that directly capture and convey data measurements from the road, obtaining valuable metrics such as speed, occupation, flow or even which vehicles are traversing a given road segment. These are the most commonly used sensors in ITS, most frequently based on computer vision and radar, as they directly provide traffic information close to the point where it originates. This kind of sensed metrics are useful for traffic flow or speed modeling, allowing practitioners to identify mobility patterns and to model them, so future behavior in sensorized locations can be estimated. Counting vehicles or detecting their speed at a certain point of the road also allows to obtain network wide mobility patterns that can be compared to those provided by a simulation engine. This can help traffic managers and city planners take long-term decisions, such as which road should be extended or how a road cut could affect other segments. However, this information is tethered to the exact points where the sensors are placed, thus the

actionability of a system built upon these data is subject to the geographical area where such sensing devices are deployed and their range.


### *2.2. Data Preprocessing*

The variety of the above mentioned sensing sources comes with promises and perils. These data is produced in various forms and formats, various time resolutions, synchronously or asynchronously and different rates of accumulation. To leverage the full

spectrum of knowledge these data can bring to the sake of informed decision making, the more the sensing opportunities the larger the needs for powerful preprocessing and skills are before reaching the stage of modeling.

A principled data-driven modeling workflow requires more than just applying offthe-shelf tools. In this regard, preprocessing raw data is undoubtedly an elementary step of the modeling process [17], but still persists nowadays as a step frequently overlooked by researchers in the ITS field [18].

To begin with, when a model is to be built on real ITS data, an important fact to be taken into account is the proneness of real environments to present missing or corrupted data due to many uncertain events that can affect the whole collection, transformation, transmission and storage process [19]. This issue needs to be assessed, controlled and suitably tackled before proceeding further with next stages of the processing pipeline. Otherwise, missing and/or corrupted instances within the captured data may distort severely the outcome of data-based models, hindering their practical utility [20]. A wide extent of missing data imputation strategies can be found in literature [21,22], as well as methods to identify, correct or discriminate abnormal data inputs [23]. However, they are often loosely coupled to the rest of the modeling pipeline [24]. An actionable data preprocessing should focus not only on improving the quality of the captured data in terms of completeness and regularity, but also on providing valuable insights about the underlying phenomena yielding missing, corrupted and/or outlying data, along with their implications on modeling [25].

Next, the cleansed dataset can be engineered further to lie an enriched data substrate for the subsequent modeling [26,27]. A number of operations can be applied to improve the way in which data are further processed along the chain. For instance, data transformation methods can be applied for different purposes related to the representation and distribution of data (e.g., dimensionality reduction, standardization, normalization, discretization or binarization). Although these transformations are not mandatory in all cases, a deep knowledge of what input data represent and how they contribute to modeling is a key aspect to be considered in this preprocessing stage.

Furthermore, data enrichment can be held from two different perspectives that can be adopted depending on the characteristics of the dataset at this point. As such, feature selection/engineering refers to the implementation of methods to either discard irrelevant features for the modeling problem at hand, or to produce more valuable data descriptors by combining the original ones through different operations. Likewise, instance selection/generation implies a transformation of the original data in terms of the examples. Removing instances can be a straight solution for corrupted data and/or outliers, whereas the addition of synthetic instances can help train and validate models for which scarce real data instances are available. Besides, these approaches are among the most predominant techniques to cope with class imbalance [28], a very frequent problem in predictive modeling with real data. Whether each of these operations is required or not depends entirely on the input data, their quality, abundance and the relations among them. This entails a deep understanding of both data and domain, which is not always a common ground among the ITS field practitioners [29].

Finally, data fusion embodies one of the most promising research fields for data-driven ITS [3,30], ye<sup>t</sup> remains marginally studied with respect to other modeling stages despite its potential to boost the actionability of the overall data-based model. Indeed, an ITS model can hardly be actionable if it does not exploit interactions among different data sources. Upon their availability, ITS models can be enriched by fusing diverse data sources. A recent review on different operational aspects of data-driven ITS developments states that these models rarely count on more than one source of data [16]. This fact clearly unveils a niche of research when taking into account the increasing availability of data provided by the growing amount of sensors, devices and other data capturing mechanisms that are deployed in transportation networks, in all sorts of vehicles, or even in personal devices held by the infrastructure users. Despite the relative scarcity of contributions dealing with this part of the data-based modeling workflow, the combination of multiple sources of

information has been proven to enrich the model output along different axis, from accuracy to interpretability [31–34].
