1. Introduction
According to the World Health Organization (WHO), air pollution is the second leading cause of noncommunicable diseases, such as stroke, cancer, and heart disease, and of pulmonary diseases, such as chronic obstructive pulmonary diseases and lower respiratory infections. Ambient air pollution accounts for an estimated 4.2 million deaths per year [
1]. Around
of the world’s population lives in places where air-quality levels exceed WHO limits and the suggested standards for a healthy life [
2,
3,
4]. Air pollution is due to the presence of particulate matter 2.5 (PM2.5), which refers to tiny particles in the air that are two and one-half microns or less in width. Studies suggest that long-term exposure to fine particulate matter may be associated with increased rates of chronic bronchitis, reduced lung function, and increased mortality from lung cancer and heart disease. Furthermore, nitrogen dioxide (NO
) is one of the other main air-quality pollutants of concern and is typically associated with vehicle emissions. The annual EU limit for NO
was widely exceeded across Europe in 2017. Some 86% of these exceedances were detected at roadside monitoring locations.
The red and violet colors in the map in
Figure 1 show the areas in which the limits were overcome multiple times in past years in European countries. Similar maps are available for the other main air pollutants. In many countries, diseases can only be significantly reduced by improving air quality. Turning air-pollution-reduction goals into policies to combat noncommunicable diseases leads to multiple benefits for the environment, economy, and health. With this work, we address these concerns by putting data science to use at the service of public policies. According to the European Environment Agency, we can reach the goal of a reduction in air pollution by monitoring and modeling air quality, by collecting data using sensors on roads and on vehicles, and by maintaining emission inventories. We should employ emission-control strategies to reduce the amount of private transports; to improve public ones; to reduce their emissions; to increase the use of renewable energy; and to apply contingency measures, new policies, and rules that, for instance, encourage planning of more compact cities.
In this work, we employ machine learning models, specifically, Bayesian networks, to analyze sensor data installed on the buses of a public transport company in a European city. The sensors collect data about the vehicle and its use (acceleration, braking, speed, stop durations with the engine on, etc.) with some contextual information about the vehicle location (such as altitude). An analysis of the sensor data using machine learning algorithms applied using procedures of predictive maintenance can also be used to improve vehicle equipment maintenance, with a reduction in costs due to stop times for faults and repair. Several related works exist in the literature. The application of Bayesian networks for the purposes of monitoring natural resources and applying policies wa proposed in [
5]. The majority of the works that monitor fuel consumption in vehicles applied predictive models. Schoen et al. in [
6] adopted Artificial Neural Networks (ANN) to predict average fuel consumption in a fleet of heavy vehicles. They adopted a data summarization technique of the consumption based on distance rather than time in order to eliminate a conversion of the scale for the prediction of average fuel consumption. We also apply a similar technique in this work because we build models that employ the fuel consumed per kilometer. Perrotta et al. [
7] compared multiple machine learning models—support vector regression (SVR), random forest (RF), and artificial neural networks (ANN)—to predict fuel consumption in heavy vehicles. Moradi et al. [
8] used multiple models in cascade and confirmed that ANN outperforms the other models. The goals of these works were to reduce costs and to obtain better routing of the fleets even though they found it difficult to determine an accurate estimation of the fuel level. Yao et al. in [
9] used smartphones to collect vehicle mobility data based on their global positioning system (GPS) combined with data from on-board diagnostics (OBD) terminals to predict fuel consumption based on taxi-drivers’ driving styles. They compared ANN, SVR, and RF and showed that all of them reach satisfactory prediction performances. Random forest achieved a superior accuracy. Rimpas et al. in [
10] selected some parameters for monitoring vehicles retrieved through the OBD-II diagnostics protocol and related them to vehicle operation and fuel consumption. They collected the proportion of oxygen in exhaust gases using a Lambda Sensor and adjusted the fuel quantity measured by a short-term fuel trim (STFT) sensor related to the immediate change in fuel flow and used as a proxy of the accelerator pedal pressed by the driver. They collected the air flow as measured by a mass air flow sensor (MAF) as a measure of engine malfunction, a vehicle speed sensor (VSS), and the value of the engine coolant temperature (ECT) sensor where the coolant temperature affects engine overheating and fuel consumption. The authors in [
11] quantified the uncertainty in measuring fuel consumption, both in light and heavy vehicles. They show that, in urban conditions, the uncertainty reaches
. In [
12], the authors considered the prediction of fuel consumption in public buses using a multivariate data set including several explanatory variables. They compared RF, gradient boosting (GB), and ANN. Based on their analysis, RF produces a more accurate prediction compared to both GB and ANN. In [
13], the authors included weather variables for the task of fuel prediction and considered them useful for an accurate prediction. Quite often in the above studies, the sample vehicles (in terms of make, model, and age) were comparable so that the type and status of the vehicle does not influence fuel consumption. We made a similar choice in the selection of heavy vehicles (buses of the same model, type, mass, length, and age).
In this work, we used sensors with the sole purpose of collecting data about fuel consumption and monitoring the drivers’ usage of the bus’s resources (fuel, breaks, acceleration, and air conditioning). The goal was to monitor fuel consumption and its contextual conditions with the ultimate objective to provide a descriptive and explainable model of the variables that influence and cause fuel consumption and that ultimately produce air pollution. We employed Bayesian networks that permit us to afford a unique model with multiple tasks: description with a graph of the dependence relationships between the variables, identification of the variables that are independent from the target, selection of the variables that have an impact on the target, quantification of the amount of impact on the target, prediction of the target, simulation of the variables in a scenario, and intervention in the scenario by changing some of the variables.
The first contribution of this work is to provide a public data set [
14] on sensors installed on board public transports with information about vehicle usage and fuel consumption. Sensors communicate their measures via the controller area network (CAN-bus), a specialized internal communications network that interconnects components inside a vehicle [
15]. CAN is a robust vehicle standard designed to allow micro-controllers and devices to communicate with each other’s applications without a host computer. It is a message-based protocol, originally designed for multiplex electrical wiring within automobiles, but it can be applied to many other contexts. For each device (sensor and actuator), the data in a frame are transmitted sequentially. Thanks to this, the vehicle turns out to be an advanced, computerized control system available on board and capable of sensor data storage.
Thanks to the collected data, we assessed the sensor outcomes to support decision making. We employed Bayesian networks (BN) as an essential tool that is able to provide descriptive and explainable models of the relationships between the monitored variables, and dependence relations that might also represent the cause–effect relationships [
16]. In fact, BN captures the independence and the conditional independence among the variables: in a BN, we represent variables with nodes and dependence relationships with edges. The presence of a path connecting a variable
V with a target variable
T makes it clear that we should change the values of
V in order to modify the values of the target (query) variable
T. Instead, a change in variables not connected within a path including the target should not cause any effect on it. The main contribution of this work is to provide a BN on the variables monitored by sensors connected in CAN-bus. These BN show which variables we should change to control the fuel consumption variable. Furthermore, BN also supports simulation of the behavior of the system. We use this feature of the BN model because we generate synthetic data of a set of sensors that obey a known ground truth [
14]. The purpose is to verify correspondences between the cause–effect dependencies reconstructed from the data and the true ones. We made these synthetic data publicly available too [
14].
BN is employed also to perform an assessment of the observed phenomena and to perform an intervention analysis on the causal variables so that the monitored target can be improved. As a result, we can provide the results and suggestions to drivers and policy-makers with the goal of improving air quality and reducing costs for fuel. This is the third contribution of this work. One of the main results of this intervention analysis is to show that a change in the vehicle paths (longer but with a reduced slope) turns into a decrease in fuel consumption. Other results concern the quantification of the impact on fuel consumption of air conditioning and of brake usage.
The main difficulty with BN is that the search space of the possible alternative models increases in a super-exponential way in the number of variables (graph nodes) [
17]. Therefore, it is customary to employ approximate algorithms [
18,
19,
20] driven by heuristics that are used to rank and evaluate the alternatives. The results are that the algorithms might converge to different and suboptimal solutions but in tractable times. Their results, as we experienced and show in this work, might differ. In this paper, we deal with some representative algorithms for BN synthesis from data that are popular in the BN community [
19,
20]. We use the BIC score [
21], a derivation of the likelihood of the data under the assumed BN model, as a heuristic to evaluate the alternative networks. We revised them and compared their solutions on the sensor data by providing a brute force alternative. Brute force converges to the global optimum of the BIC score within the search space. The brute force alternative is possible (provided the number of variables is kept limited to some units) thanks to the opportunity that high-performance computing gives us. It makes the workload efficient by distributing the computation among multiple servers and CPUs, and their execution in parallel. This is the fourth contribution of this work and one of the novelties of our approach: a comparison of the results of different algorithms for BN generation from data that allows us to rank them and to evaluate how closely they reach the overall optimum of brute force. This is not so common in the BN community, since BNs are usually initially provided by domain experts and later validated against evidence from data [
22,
23]. To overcome the discrepancies among BNs, we compared and ranked them by proposing and adopting an alternative method: Granger causality [
24]. This is one novelty of our approach and the last, but not least, contribution of our work. Granger causality and its statistical test employ vector auto-regression (VAR) as a tool to predict the target in time with the aid of multiple variables (the variables that are in the pathway from causes to the effect). In its essence, the statistical test in Granger causality method verifies that the prediction of the target, with the aid of the cause variables, is better than without them. The application of this latter criterion is possible only when the flow of values of these variables is stored in time. Granger causality is commonly judged as a weaker principle than the stricter principle of probabilistic dependency between cause and effect. With Granger causality, the existence of a causality relation between cause and the effect is verified only in time thanks to the ability of the cause to predict and anticipate the effect in time [
25,
26].
4. Conclusions
This work presents different contributions with the purpose of analyzing the conditions at which fuel consumption occurs in vehicles and of understanding how to reduce it by intervening in the scenario. We provided a collection of data from sensors installed on buses used as public transport. Thanks to the sensor data analysis, we discovered that, in some contextual conditions (with a fuel consumption per kilometer that does not exceed the value of 0.75 L per kilometer), it is preferable to choose a longer but less steep path than a shorter one. As a consequence of the analysis of cause–effect relationships between the variables and the target, we precisely quantified the impact of all causes on the target: with a decrease of one unit of
air_cond_ptime (percentage of travel time with air conditioning), we can can expect a decrease of
units in
fuel_per_km; with a decrease of one unit of the percentage of time with
brake_usage, we can expect a decrease of
units in
fuel_per_km; and with a decrease of a unit in
stop_ptime (stop percentage time with engine on), we can expect a decrease of
units in
fuel_per_km. In the literature [
6], the important effect of this variable was confirmed.
We tested both approximate algorithms, driven by the BIC score and brute force with the purpose of comparing the ability of the algorithms to converge to the same resulting networks. We evaluated their results with the adoption of Granger causality, a third-party criterion, based on the time series formed in time by the observed variables. This is an original contribution to the scientific community of Bayesian networks that are usually scored by BIC or K2. According to the Granger causality, we are also able to rank the alternatives, even in the case where multiple BNs share the same score. We compared BNs also by using their ability to perform feature selection and to predict the target variable.
We also provided a synthetic data set that we created with a known ground truth of which the purpose is to test the algorithms of synthesis of BN from data and to verify their convergence toward the ground truth. We discussed the comparison results. The networks sometimes agree, and other times, they do not. This mismatch perhaps is due to the multiple maxima that sometimes exist in the large search space of the solutions: this occurs especially in the synthetic data in which the ground truth is known and in which the data determine similar links between cause and effect, but in opposite directions. The observed mismatches on the edges might also be a consequence of the heuristics. Heuristics are indeed used to eliminate multiple rankings of the alternatives, in choosing edge directions (choice of the cause and the effect that often requires the experts’ advises), and for avoiding cycles in the BN graphs.
In summary, the contributions of our work are as follows:
Bayesian networks were applied for the analysis of fuel consumption. Past studies on fuel consumption in vehicles (reported in
Section 1) applied only machine learning predictive models (based on SVR, ANN, random forest, or gradient boosting). All of them have the sole goal of predicting the target value. None provide machine learning models that are able to also perform the following:
- (a)
describing and discovering the cause–effect relationships between variables and the target (
Section 3) and
- (b)
performing an intervention analysis on the causes, with the goal of achieving a desired impact on the target and quantifying this impact (
Section 3.1.3).
Bayesian networks are powerful and we used them to reach multiple goals: perform feature selection (
Section 3.1.2) whose outcomes we compared with another standard method (VIF [
37]); perform predictive modeling (target estimation, whose results are shown in
Table 3), scenario simulation (
Section 3.3.1), intervention analysis (
Section 3.1.3) and counterfactual analysis (what-if analysis).
Comparing the results of approximate algorithms (heuristic-driven) for Bayesian networks with a brute force algorithm, an original one, implemented for this work (Algorithm 1) was made possible thanks to the availability of high-performance computing technology that permits us to afford an extremely high computational load of traversing the huge search space of the possible networks by partitioning it and spreading evaluations of the alternative graphs throughout many servers. The outcome of this comparison (
Section 3.2) can help analysts with the uncertainty of which Bayesian network to use.
The use of the Granger causality concept was introduced and formalized for an evaluation of Bayesian networks (
Section 2.4). Granger causality was used as an independent, third party notion to compare, evaluate, and rank the different Bayesian networks, generated from the same data by different algorithms.
Bayesian network discovery is customarily used to test the domain knowledge, previously distilled under the form of an already available graph [
22,
23,
36,
41]. Differently, in this paper, we did not start from an already available graph but directly started from the collected (sensor) data and provided experts with assumptions about this knowledge (cause–effect relationships) under the form of a Bayesian network.
Last but not least, we provided two public data sets to the scientific community [
14] with real data from buses and a synthetic data set with ground truths, useful for testing Bayesian network algorithms and time series analyses.