1. Introduction
Exposure to high levels of particulate matter (
) is a big social problem [
1] due to its impacts on human health, with effects including pulmonary and cardio-vascular diseases [
2,
3]. One of the main challenges in decision making related to
control is that, usually, win–win solutions that also consider other pollutants, such as nitrogen oxides (
) and ozone (
), are complex to identify and implement [
4,
5,
6,
7]. For this reason, having detailed information about the level of all of the significant air pollutants over a certain area is a key issue in decision-making processes. In this context, the use of integrated information coming from regional networks and novel/private networks supported by low-cost technology [
8,
9] has become more and more important, which has been mainly due to the fact that they can provide suitable information for chemical transport models (CTMs), allowing them to compute concentrations far away from the official monitoring network stations [
10,
11,
12].
In principle, four main techniques for the measurement of
are presented in literature [
13]: (1) gravimetric analysis of pumped and filtered particles; (2) tapering element oscillating microbalance (TEOM); (3) beta-attenuation; (4) light scattering. The first three of these techniques are quite expensive, so their use is limited to regional authorities, private companies and research groups [
13]. Light scattering, instead, is a relatively low-cost technique, but it is often affected by consistent biases [
14].
The objective of this work is to evaluate the possibility of implementing a virtual sensor for daily mean concentration starting from the data measured by sensors detecting other pollutants and meteorological variables. In particular, the virtual sensors applied in this work are based on daily mean concentration and meteorological variables, such as wind speed, rainfall, relative humidity and temperature.
As indicated by the name, virtual sensors can be broadly described as a software that allows us to compute the value of a certain variable without direct measurement considering measurements that are physically/chemically related to the variable that should be reproduced [
15]. They assume a key role when it is not possible to place a physical sensor due to any kind of limitations (e.g., unreachable position, high cost). There are two possible approaches to virtual sensor implementation:
- 1.
Data-driven: in this approach, time series of input and output variables are collected from direct measurement and are used to compute a mathematical, approximated relationship between the measured variables’ and sensors’ output [
16];
- 2.
Deterministic: in this approach, the (eventually approximated) physical/chemical relationships among input and output variables are used to compute the unmeasured variable through the virtual sensor [
17].
This work presents a data-driven approach based on wavenet models to implement a
virtual sensor using
and meteorological variables. All these variables are strictly related to the phenomena involved in the formation and accumulation of
in atmosphere; their choice is due to the presence in the literature of low-cost sensors with performances that are adequate [
18] enough to identify a virtual sensor, therefore allowing the definition of a low-cost
measuring network. Wavenets are data-driven models resulting from the integration of wavelet theory and neural network models [
19]. Their main applications are related to sound management/filtering [
20], even if their nonlinear function approximation (and thus forecasting) properties have been applied with good results also in other fields such as energy systems [
19,
21]. These approximation properties make them suitable for environmental monitoring and forecasting applications, but still, there is no literature related to their application to reproduce
or other air quality pollutants. Therefore, since artificial neural networks are widely used in this field [
4,
22,
23], wavenets could also be useful for the definition of a
virtual sensor. The paper is organized in two main parts, a methodological one (
Section 2) where the basics of the artificial neural network, wavelet theory and wavenets are introduced and a second part presenting the evaluation of the results on a test case.
3. Results and Discussion
3.1. Case Study and Dataset Definition
The aim of this work is the definition of a virtual sensor to compute
daily average concentrations starting from the measured data of daily average
concentration and the measured values of two meteorological variables: average daily wind speed
, total daily rainfall
, average daily relative humidity
and average daily temperature
T. The selection of
as the input variable is due to the fact that its levels are strongly related to
ones, as they shared some emission drivers (i.e., road traffic, domestic heating) and chemical paths (i.e., formation of secondary inorganic aerosol starting from the ammonium nitrates). On the other hand, the selected meteorological variables can be related to general deposition or dispersion conditions (mainly rainfall and wind speed) or to the formation of secondary aerosol by condensation. Thus, the
in Equation (
4) is the daily
concentration computed by the model, which is referred to as n
from now on. Moreover, the input
x of the wavenet function is time dependent, so
, and it includes both
concentrations and meteorological variables for the day t and the previous days, as in:
In order to test the presented methodology, a series of models has been trained and validated to reproduce the
daily mean concentrations starting from different input measured by the Lombardy region monitoring network. The work has been tested using data measured by 14 monitoring stations belonging to the Lombardy region (Italy) monitoring network (
Figure 4).
More in detail, the data from year 2019 have been used ( available raw data tuples). The performance evaluation for the different models has been performed using a leave-p-out approach with . Following this approach, 100 tests have been performed for each model configuration, with 10 stations being used for the identification, and the data for being randomly selected as stations queued in order to define the metastation used for the validation.
3.2. Configuration Tests
In order to evaluate the capability of the methodology presented in
Section 2 to compute
concentrations, all the possible configurations among the input variables have been considered, and the relative models
trained.
In principle, the different configurations can be grouped into three categories:
Configurations including only concentration as input;
Configurations including only meteorological variables as input;
Configurations including both concentrations and meteorological variables as input.
For each test, an analysis of the memory of the systems, i.e., an evaluation of the performances of varying , , , and , has been performed. On the basis of the knowledge of the phenomena related to the formation of in atmosphere, a maximum value of 5 days can be considered for these parameters. Each model has been evaluated on the basis of the following three different statistical indexes:
Normalized Root Mean Squared Deviation:
Root Mean Squared Error
Correlation Coefficient
where and are, respectively, the t-th values of the model output and of the validation dataset, and and are their mean values. From the huge set of performed tests, only the best-performing ones are presented in this context, in particular for the combination of multiple input.
3.3. Validation Results
3.3.1. Models with as Input
This first class of models includes only
daily mean concentrations as input. This is due to the fact that
and
concentrations are generated by several common emitting activities (i.e., road transport) and that the secondary inorganic fraction of
is composed, in part, of nitrates, in particular ammonia nitrate, whose formation depends on the
concentration in atmosphere.
Table 1 highlights that the performances are quite good in terms of correlation, with values around 0.74, and acceptable in terms of root mean square error, with a normalised root mean standard deviation (allowing one to compare the root mean square error with respect to the overall variability of the output time series) around 0.1.
From these results, it is clear that an increase in the memory of the system does not lead to significant impacts on the performances and on the behavior of the model. The negligible increase in performances for the test with
does not justify the increasing number of parameters.
Table 2 shows the performances for the same configurations for the part of the time series where
concentrations higher than
have been measured. The table states that the model has strong difficulties in reproducing high concentrations, as highlighted by the strong decrease in statistical indexes.
3.3.2. Models with Meteorological Variables as Input
The second class of models considers only the meteorological variables as input. These tests allow an assessment of the relative “importance” between meteorology and
concentration for the computation of
levels.
Table 3 and
Table 4 show poor performances, with the limited exception of the cases with temperature
T as input. Thus, the performances suggest that the meteorological conditions alone are not enough to estimate
concentrations, and, so, they may be at best used to increase the performances in addition to the
concentrations.
3.3.3. Models with and Meteorological Variables as Input
The last class of models considers both the meteorological variables and the
daily mean concentration as input in order to evaluate if the joint use of these information sources leads to an increase in the performances.
Table 5 presents the results with
concentrations coupled to a meteorological variable at a certain time. The performances are in line with that of the models with only
as an input. Moreover, the combined use of more than one meteorological variable did not lead to a consistent increase in performance (
Table 6,
Table 7 and
Table 8). The only slight improvement can be seen for high concentrations when the temperature is used as input (
Table 9,
Table 10,
Table 11 and
Table 12), but also, in this case, the performances seem not to be good enough (correlation coefficient close to 0.52) in the preproduction of the peaks. These results suggest that, to reproduce mean
levels in this domain, only the
concentrations should used, thus relying on cheaper sensors. Nevertheless, a bond in the performances exists, which did not allow the reconstruction of peak concentrations.
3.4. Comparison to State-of-the-Art Models
In this section, the comparison of the wavenet approach used in this work with two different state-of-the-art models is presented. The two models are a (1) K-nearest neighbors (KNN) and an (2) artificial neural network-based model, which are often used in this context to capture the dynamic of the
[
29]. The comparison (
Table 13) shows how the performances of the best-identified wavenet are strongly better than that of the KNN model and very similar (slightly better for high orders) to that of the ANN ones. Moreover, it has to be stressed how the best model for the wavenet approach ensures these performances with limited complexity and with a limited number of variables (only
concentration) with respect to the other approaches.
Figure 5,
Figure 6 and
Figure 7 present the time series plots for the best configuration of wavenet, artificial neural network and KNN models, respectively. As expected, the behaviour of the wavenet and ANN models is very similar, with the first models showing slightly better performances for the low value close to the sample n. 800. In general, the KNN model reproduces higher value but, as also stated by the lower values of correlation coefficient, the time series rarely follows the value and the gradient of the measured values.
4. Conclusions
In this work, a data-driven, wavenet-based virtual sensor for daily mean concentration is presented and evaluated. Different model configurations have been tested and evaluated. The methodology has been applied to data measured by the Lombardy regional monitoring network. The results show good agreement between the output of the virtual sensor and the measured data used for validation when the daily mean concentration is used as input—in particular, around the mean concentration values. Therefore, the models fail to reproduce the peak concentrations, and this behaviour will not change even if other inputs, such as meteorological data, are used. Nevertheless, the performances show that this approach can be used to produce supporting information to integrate the regional monitoring network that can be made available through app/web services due to a relatively fast computation.