*3.1. Usability*

The way in which humans interact with information systems has been thoroughly studied in last decades and formalized under the general *usability* term [63]. Although usability is a feature that can be associated to any system in which there is some kind of interaction with the user, most of its definitions to date gravitate around the design of software systems [64–66], which is not necessarily the case of ITS research. Usable designs imply defining a clear purpose for a system, and helping users making use of it to reach their objectives [67]. Within ITS, there are domains where this definitions apply directly [68], such as vehicle user interfaces [69–71], the development of navigation systems [72], road signalization [73], or even the way in which public transportation systems information is shown to users [74,75].

The aforementioned domains of application, and mostly any system lying at the core of Advanced Traveler Information Systems (ATIS), have an explicit interaction component. On the other hand, models developed for Advanced Transportation Management Systems (ATMS) are less related to user interaction (beyond the interface design of decision making tools), hence this canonical definition of usability seems to be less applicable. However, the general concept of usability can also accommodate the notion of *utility* as the quality of a system of being useful for its purpose, or the concept of *effectiveness*, in regards to how effective is the information provided by them [76]. Since ITS are systems developed as tools designed to help the different stakeholders that take part in transportation activities, the actionability of data-based models used for this regard depends stringently in this general idea of usability [77]. Models' usability is a feature largely disregarded in literature. A clear example of this situation is traffic forecasting, a preeminent subfield of ATMS, in which the link between the high end deep learning models with the requirements by the road operators in forecasts to support the decision making is very weak [78].

Usability may relate to the person that is going to operate the model, and to the type and complexity of the model, which relate to specific skills. Achieving usable ITS models does not entail the same efforts for all ITS subdomains. Thus, while for research contributions related to ATIS there is a clear interest in this matter [79], for ATMS developments some extra considerations need to be made. Usability in ITS has, therefore, a facet oriented towards user interface, where interfaces reflect at least one of the outputs of an ITS data-based model, and another facet towards creating models that are more aware of the way their outputs are going to be consumed afterwards by the decision maker.

### 3.1.1. User Interface

For the first of these facets, Spyridakis et al. [77] propose general software usability measuring tools and scales such as System Usability Scale (SUS) [65], ethnographic field studies, or even questionnaires. These basic techniques are also proposed in [80] in order to evaluate navigation systems interfaces. There are also many other evaluation measures that are more specific to the field, such as [81], or those defined by public authorities [82]. Some of the main techniques to appraise ITS interface usability are:


### 3.1.2. Consumption of the Model's Output

For this second usability facet, there are no scales or measurements in literature that provide an objective (or even subjective) usability assessments, but we propose some angles that should be considered when designing this kind of models:

•*Confidence-based outputs:* data-driven models are often subject to stochasticity as a result of their learning procedure or the uncertainty/randomness of their input data (as specially occurs in crowdsourced and Social Media data). This randomness imprints a certain degree of uncertainty in their outputs, which can be estimated values, predicted categories, solutions to an optimization problem or any other alike. Such outcomes are often assessed in terms of their similarity to a ground truth in order to quantitatively assess the performance of the data-based model. Thus, a practitioner aiming to make decisions based on the model's output is informed with a nominal performance score (which has been computed over test data), and the predicted output for a given input. However, when one of such data-based models is intended to work in a real environment, there is no ground truth to evaluate the quality of the result they are providing towards making a decision.For instance, a predictive model could score high on average as per the errors made during the testing phase. However, predictions produced by the model could be less reliable during peak hours than during the night, being less trustworthy in the first case as per the variability of the data from which it was learned, and/or the model's learning algorithm itself. For this reason, the estimation of the confidence of outputs from a data-based model must be analyzed for the sake of its usability. For example, a public transportation model that provides outlooks of future demand could be more usable if, besides the estimation itself, some kind of confidence metric was provided. Elaborating on this aspect is not very frequent in academic research, mainly due to the fact that confidence is not always that easy to obtain and the estimation procedure is, in most cases, model-specific, requiring a previous statistical analysis of input data to properly understand their

variability and characteristics. Unfortunately, such a confidence analysis is usually left out of the scope of research contributions, which rather focus on finding the best scoring model for a particular problem. Exceptions to this scarcity of related works are [84], in which the uncertainty inherent to artificial neural networks is analyzed in a real ITS context; [85], in which a committee of different models provides intervals of confidence to predictions;or the more recent contribution in [86], which departs from previous findings in [87,88] to estimate the uncertainty of traffic demand. This uncertainty estimation is then used as an input to assess the confidence of traffic demand predictions. These few references exemplify good practices that should be universally considered in contributions to appear.

• *Interpretability:* a stream of work has been lately concentrated around the noted need for *explaining* how complex models process input data and produce decisions/actions therefrom. Under the so-called XAI (eXplainable Artificial Intelligence) term, a torrent of techniques have been reported lately to explain the rationale behind traditional black-box models, mainly built for prediction purposes [89,90]. Nowadays, Deep Learning is arguably the family of data-driven models mostly targeted by XAI-related studies [91,92].

The interest of transport researchers to interpretable data-driven models is not new; intuitively, any decision in transportation and traffic operations should be based on a solid understanding of the mechanism by which different factors interact and influence transportation phenomena [93]. In the transportation context explainability is closely related to integrability, when it comes to traffic managers, as ensuring that data-based models can be understood by non-AI expert can make them appropriately trust and favor the inclusion of data-based models in their decisional processes. When framed within ITS systems and processes, the need for explainable data-based models can help decision makers understand how information is processed along the data modeling pipeline, including the quantification of insightful managerial aspects such as the relationship and sensibility of a predicted output with respect to their inputs.

•*Trade-off between accuracy and usability:* when ITS data-based models aim at superior performances, they often work in ideal scenarios where the real context of application is disregarded; should that context apply in practice, the claimed suitability of the developed model for its particular purpose could be compromised. For instance, the goodness of an ITS model devised to detect users' typical trajectories can be measured with regard to the exactitude of the detected trajectories. If the pursuit of a superb performance relies on a constant stream of data (hence, eventually depleting the user's phone battery), it could be a pointless achievement when put to practice. This particular example has been already considered by plenty of researchers [94,95]. However, there is a long way to go in this aspect, as most ITS research developments consider only ideal circumstances without regarding the implications that an accurate design could have on its final usability.
