3.1. Statistical Description of the Data
We performed PCA to investigate whether any variable can explain most of the dataset variability. PCA applied to all the available meteorological variables highlighted, as expected, that among the 14 variables, there were pre-existing groups of highly correlated variables, such as the minimum, maximum and average temperature. This preliminary PCA allowed to select among the highly correlated variables the most informative ones. Ultimately, the meteorological variables used in the analysis were 7: maximum temperature, minimum relative humidity, precipitation, maximum wind speed, prevailing wind direction sector, mean global radiation, and mean atmospheric pressure.
Meteorological and air quality data are characterized by different behaviors in different seasons of the year, see
Table 1 and
Table 2.
This data seasonality is expected for meteorological data but is not expected for air quality data. In fact, except for , all air quality variables present a significant increment during winter (Kruskal–Wallis test, p-value < ), an effect, probably, which could be explained in terms of the urban traffic increment in the winter or in terms of heating in summer.
A correlation analysis was performed to assess the presence of redundant features within both datasets. Interestingly, the highest correlation value was found between the mean global radiation and the minimum relative humidity; otherwise, no statistically significant correlation was detected.
At this stage we performed PCA on both datasets, meteorological and air quality, to investigate whether any variable can explain most of our own dataset. The PCA results for the 7 selected meteorological data are presented in
Figure 3 panel A for the summer and winter seasons, separately. The scree plot shows the presence of a principal component which dominates the data during the summer seasons. This variable explains
of the total variance of the data.
An analogous analysis was performed for air quality, see
Figure 3B. The first PC of the air quality dataset explains
and
for the summer and winter seasons, respectively.
In both cases, to account for at least of the variance, at least the first 3 PCs have to be considered. Since the weights of all variables in the first 3 PCs are not negligible, we conclude that there is no significant benefit in excluding any variable for further analyses.
3.2. Insights from Linear Models
We explored the existence of a linear relationship between meteorological and air quality data. To this aim, we firstly evaluated the pairwise Pearson’s correlation between meteorological and air quality features; the results are shown in
Table 3. Correlations were less than moderate, the highest value being 0.44. Accordingly, for both seasons, no linear relationship was identified between any single meteorological and air quality variable.
We also used a linear model to test for the presence of multivariate relationships between meteorological features, used as independent variables, and air quality features, each used as the dependent variable in 6 different models.
In
Table 4 we report, for each air quality variable, the correlation value between the predicted values from the linear model and the measured values from monitoring station. The whole dataset was used for training and validation. Even in this case, we observe a poor relationship between the meteorological conditions and air quality.
A poor correlation is obtained for all air quality features. In terms of RMSE, the worst performance is obtained by with , while for the other pollutants, we obtained on average . These findings suggest that linear models cannot model the relationship between meteorological conditions and pollution, if any.
Finally, we investigated the existence of a linear relationship using CCA. Combining 6 air quality and 7 meteorological features, we obtained 6 canonical components. They are the 6 most correlated pairs
with first component
x, a linear combination of air quality features, and with second component
y, a linear combination of meteorological features. The results are presented in
Table 5. These results show how the best linear correlation value is
during summer and
during winter. Both results indicate that, even if we globally consider the meteorological and the air quality data, there is a non-linear relationship between the two datasets.
3.4. Air Quality Data Predictions
The result obtained with RF trained on the whole dataset suggests the possibility to use a RF model for prediction purposes. We explored the possibility of modeling the air quality using meteorological predictors. In this case, we adopted a 5-fold cross-validation framework repeated 100 times. In order not to alter the configuration of the RF model investigated previously, we performed the analysis separately for the summer and winter periods.
Figure 4 shows both the summer and winter results. As expected, we observe a performance deterioration with respect to
Table 6. Except for
and benzene, correlations exceed
. In terms of
, performance for all pollutants remains stable, except for
. Interestingly, despite the cross validation, the variance of the performance measured in terms of the interquartile range of the boxplots is extremely small, suggesting particularly robust results. We reported additional figures and tables with the results of our analysis in
Supplementary Materials section.
The previous results demonstrated how meteorological conditions at time t can be suitably used to predict the air quality at the same time t. Here, we address the problem of forecasting air quality at future times . In particular, using the numerical weather prediction model WRF, we investigated three distinct cases: h, 1, 2, and 3 days in advance, respectively. Furthermore, we compared results obtained using data with spatial resolution d01 (16 km) and d02 (4 km). Finally, as the ground station has its own geographical coordinates which do not coincide with a node of the grid, we used the GrADS software to allow the comparison and applied post-processing techniques to reduce the forecast error for those measures acquired on ground: 2 m temperature, 10 m temperature, 10 m wind direction and speed, and 2 m relative humidity. We assessed the performance in terms of MSE and Pearson’s correlation.
The forecast error reduction approach basically consists in training a specific RF model on a period of 30 days prior to the day of interest in order to predict the WRF bias on the estimation of the weather variables. This approach proved effective in reducing the average forecast error to the point of making it close to zero. In the case of the wind speed of 10 m, in addition to reducing the systematic error, this approach also reduced the error globally.
This error reduction occurred for all the considered variables, except for the 10 m wind direction (for the temperature at 2 m, no significant improvement was observed because the forecast was already excellent). However, for the 10 m wind direction, this approach effectively reduced the direction accuracy (DACC). For a detailed description of the forecast error reduction in the predicted WRF variables, see the
Supporting Material section.
Once the WRF prediction mean error was reduced, a new RF model was trained for each of the air quality variables, in the same way as in the previous analysis, with the difference that the measured weather values were replaced by the WRF predicted and corrected weather values. Since the WRF model was used to obtain forecasts with 1, 2 and 3 days in advance on two spatial domains with resolutions of 16 and 4 km, respectively, the same number of air quality variable predictions are available.
In panel A of
Figure 5,
Figure 6 and
Figure 7, we show the performance of each of the 9 models in terms of relative RMSE and correlation coefficient on the lowest resolution domain d01 using 1-, 2- or 3-day forecasts, respectively.
In panel B of
Figure 5,
Figure 6 and
Figure 7, the performance for each of the 9 models in terms of relative RMSE and correlation coefficient on the higher resolution domain d02 using 1-, 2- or 3-day forecasts, respectively, are shown.
The results show that regardless of the spatial domain and the type of forecast, the RF models for the prediction of air quality achieve good performance in terms of the correlation coefficient (higher than
) for
,
,
, and
. If for
,
, and
, the relative RMSE is excellent, as it is lower than
, and for
, despite the good correlation between the predicted and measured values, the relative RMSE is
. In the
Supporting Material section, we reported scatter plots and time series for the four air quality variables showed good performance. Given the length of the analyzed time window, for the time series, we considered only the first four months of the considered period.