1. Introduction
Harmful air pollution, particularly in densely populated cities, is an unquestionable fact. The growing population of cities and the increasing use of motor vehicles among inhabitants are reasons for the ever-increasing traffic volume and resulting increase in exhaust gas emissions. The expansion of cities and a high density of buildings reduce ventilation in cities. Increased surface roughness results in a decrease in the impact of low wind speeds on the evacuation of pollution. Wrocław currently has 641,600 residents [
1]. It is estimated that about 15,000 vehicles move around the city streets every day [
2]. One of the main air pollutants emitted by cars’ combustion engines are nitrogen oxides: NO
2 and NO
x (sum of nitrogen oxide and nitrogen dioxide). Studying and measuring the impact of traffic intensity and meteorological factors on the nitrogen oxide concentration in the air of the urban agglomeration provides an opportunity to attempt to manipulate traffic so as to reduce the concentration of pollutants, thereby improving the air quality and the quality of life of the residents. Pollution models can support urban managers in taking action to improve air quality in the city [
3,
4,
5,
6].
In the literature, there are several methods for modelling air pollution concentrations. For example, multidimensional regression models are still in use [
7,
8,
9]. The main advantage of linear models is having an interpretable explicit function that can be used to determine the quantitative impact of each predictor on the value of the explaining variable. The extension of linear models includes polynomial forms of functions—in which each variable may be raised to some power [
10,
11] and non-linear behavior can be taken into account. On the other end of this spectrum, there are models that do not require any assumption about a specific analytical function, also called black-box models, such as artificial neural networks [
12,
13], or those combined with multiple regression [
14], single random trees [
15], or more complex structures like random forest (RF) [
16,
17,
18,
19,
20,
21] and boosted regression trees (BRT) [
22,
23]. These models are more computationally advanced and have been successfully used in pollution concentration modeling.
Zhu et al. [
17] developed an RF model based on NO
2 environmental monitoring data and geographical covariates to predict monthly average NO
2 concentration. The RF model shown in this paper reveals performance with a cross validation R
2 of 0.77 and a root mean square error (RMSE) of 11.0 µgm
−3. Araki et al. [
16] developed a spatiotemporal land use random forest model of the monthly mean NO
2 in metropolitan areas of Japan. The authors obtained an R
2 value of 0.79. A model using RF methodology in combination with spatiotemporal kriging was developed by Zhan et al. [
20] to estimate the daily ambient NO
2 concentrations across China based on satellite retrievals and geographic covariates. The model has a prediction performance R
2 of 0.62 (RMSE = 13.3 µgm
−3) for time (daily) and an R
2 of 0.73 (RMSE = 6.5 µgm
−3) for spatial predictions. The literature also contains examples of the use of the RF method in combination with techniques other than spatial modeling, for example with the evolving differential evolution method [
24]. Kamińska [
25] used an RF model to predict daily minimum, average, and maximum daily NO
2 concentrations. The best fit was obtained for a daily average model where
,
, and
. The accuracy of the models generally decreases as the data frequency increases and the research period extends. Increasing the variation in hourly values makes it difficult to predict their values effectively. For hourly data, the typical model fit drops to an R
2 of around 0.5 [
21]. Kamińska [
26] carried out a modification of the RF model which improved the quality of fit to R
2 = 0.82, although there were difficulties predicting future values, and the authors hypothesized that past predictor values may have a greater impact on the current pollution concentration than actual ones.
Studies on the impact of traffic flow and meteorological conditions in Wrocław have already been carried out. In [
18], the influence of nine predictors on nitrogen oxide and particulate matter (diameter less than 2.5 µm) concentrations was presented. Nine different models corresponding with nine time-sets of cases were defined, depending on the time for which the analysis was performed. Hourly data (in the years 2015–2016) were considered in one of them. The others were warm and cool (heating) seasons, working and non-working days, and the seasons of the year. The researcher showed that traffic flow intensity has the greatest impact on NO
2 and NO
x concentrations, with engines being the biggest source of emissions. The speed and direction of the wind, responsible for the evacuation of pollution, were factors with about half the importance of traffic flow. In a study by Zhang et al. [
9], however, the values of all predictors and explained variables were considered at the same time (
t).
Models taking into account the past as well as the current values of the predictors have been mainly used to study the impact of pollution concentration on human health and life, in which lagged variables take into account the harmful exposure duration. The health effect of exposure to high particulate matter species using lagged variables has been studied [
27,
28,
29,
30]. Chemical reactions in the atmosphere between primary and secondary pollutants are more intense the longer favorable conditions persist, regardless of the duration of these reactions. It can be assumed that in analyzing instantaneous concentration values, not only the current values of predictors, but also values from previous moments play an important role. Therefore, when modeling concentrations of pollutants, the values of ambient factors (predictors) should also be taken into account, not only at the current moment (
t) but also at previous moments (
t-1,
t-2,
t-3, etc.).
Classically, this problem is solved by simply adding new variables to the set of predictors with a delay of 1, 2, 3, etc. This method has two basic disadvantages: firstly, it is not known how far back the lagged variables should be created, and secondly, creating a set of variables for each delay multiplies the number of explanatory variables significantly, increasing the calculation time and lowering the quality of the interpretation. In [
31,
32], the authors proposed multi-objective optimization algorithms (MOAs) to take into account the temporal components of the data. In [
31], in particular, a three-objective optimization was developed for polynomial regression, in which the maximum exponent, the optimal delay, and the regression coefficient were simultaneously optimized for a better match. Allowing more than one degree of a polynomial makes the functions and models more flexible and allows more in-depth analysis, which can capture more complex underlying processes. To assess the influence of the variables’ delay, we used an RF algorithm with lagged variables whose delay was optimized, and we compared it to an RF developed with the original variables but without delay.
The goal of this paper is to determine the impact of accounting for past values of the explanatory variables on the accuracy of the model and their importance in the model. The comparison is made on the basis of hourly data covering three full years, from 2015 to 2017.
2. Materials
We performed numerical analyses using data from Wrocław (western Poland). The data covered a full three years, from 2015 to 2017.
Traffic data were provided by the Traffic and Public Transport Management Department of the Roads and City Maintenance Board in Wrocław. The data contain counts of all vehicles (cars, buses, trucks, etc.) passing through the measurement intersection in a given traffic lane. The numbers of vehicles are recorded with a camera at 15-min intervals. In order to maintain equal time intervals for all factors, the values were aggregated into hourly counts. This operation reduced the noise while maintaining the characteristics of the original distribution. Out of the set of 26,304 cases, 212 gaps and 8 clear outliers resulting from failures of the measuring system were removed. Traffic flow is characterized by daily periodicity (
Figure A1). There are two peaks: in the morning from 7:00 to 8:00 a.m., and in the afternoon between 3:00 and 5:00 p.m. The daily maximum of vehicles passing through the intersection during the three years under consideration was 6713 (
Table 1). The annual average daily traffic at this intersection amounts to 65,470 vehicles.
Meteorological hourly data were provided by the Institute of Meteorology and Water Management (IMGW) at only one station in Wrocław, located on the outskirts of the city 9 km from the intersection in a straight line (GPS coordinates: 51.1050N, 16.9000E; height 120 m a.s.l.). The meteorological data set contains air temperature, solar duration, wind speed, air pressure, and relative humidity. Clear seasonal variations in temperature can be observed which are characteristic of a transitional climate type subject to both oceanic and continental influences. The annual average air temperature in 2015–2017 in Wrocław was 10.7 °C (
Table 1). The air temperature in the study period varied from slightly below −10 °C in winter to just over 30 °C. The average wind speed in Wrocław was 3.1 ms
−1 during the study period. The city did not experience very strong winds, as the maximum observed was 19 ms
−1 for one hour at 10 p.m. on 23 July 2017. Prevailing westerly and northwesterly winds accounted for about 50% of all winds.
Air pollution data are collected by the Provincial Environment Protection Inspectorate and measured at hourly intervals. The measuring station (GPS coordinates: 51.0864N, 17.0127E; height 125 m a.s.l.) is located in the direct vicinity of the intersection with traffic measurement (30 m from the middle of the intersection). It can therefore be concluded that the distance between the pollutant measuring point and the source (intersection) does not play a significant role. There were 117 h (cases) of missing data during the study period, which were removed from the data set. The concentrations of NO
2 and NO
x reveal a clear daily (
Figure A2) and seasonal (
Figure 1) variation. The daily peaks roughly coincide with the peaks of traffic flow. The highest concentration variation occurs in autumn. This is particularly noticeable for NO
x. The nitrogen oxide concentrations in summer are the least differentiated, which means that the proportion and diversity of NO increases in summer. NO
2 concentration shows little annual seasonality, whereas NO
x compounds display significantly higher values in winter and autumn (hourly average: 160.6 and 163.8 µgm
−3, respectively) than in summer (116.7 µgm
−3). The maximum values of both of pollutants were reported in autumn (231.6 and 1728.0 µgm
−3, respectively).
4. Results and Discussion
4.1. Lag Determination
Using multi-objective optimization, we determined the function that described the dependence of NO
2 and NO
x concentrations on meteorological factors and traffic flow. We assumed a maximum allowable power of the variable to be 3. As part of the process, we determined the delay (lag), the regression coefficient, and the power of each variable to maximize the fit of the model to real data. One should remember that the power of the variable was designated as the lowest possible one guaranteeing the best fit. In other words, the algorithm allowed a higher power than the first one, but the simplest (for interpretation) form of the function was preferred. Based on a 10-fold cross-validation process and on the selection of the most appropriate interpretation (in terms of the phenomena that occur in the atmosphere), we obtained linear functions with the delays presented in
Table 2. The fact that a linear function was obtained proves that the relationship is indeed linear and not of a higher degree.
For both NO2 and NOx, traffic flow from one hour prior has the largest influence on current concentration. This is the time that elapses from the emission of pollutants in the vicinity of the intersection until the pollution cloud reaches the sensor which measures the concentration of nitrogen oxides. The lower the wind speed is, the longer this time is. It should be remembered that the time for analysis purposes is rounded off to one hour. Therefore, a delay of 31 min and a delay of 1 h and 29 min will be both identified in the model as a delay of 1 h.
Wind speed has an impact on the evacuation of pollution. The stronger the wind speed, the more intense the evacuation and the lower the pollution concentration. Due to the distance of 9 km between the intersection and the meteorological station, the effect of wind speed is delayed by 2–3 h. This is a consequence of the time needed for the air masses to reach the air quality measurement station. In Wrocław, westerly and northwesterly winds prevail, i.e., blowing from the meteorological station to the city center (
Figure A3). At an average wind speed of 3.1 ms
−1 covering a distance of 9 km, taking into account the roughness of urban buildings, takes from two to three hours. Delays are determined with an accuracy of one hour so the difference in values may be due to rounding. The actual air temperature is positively correlated with the current NO
2 and NO
x concentrations. In other words, high concentrations of pollutants occur at higher temperatures, and vice versa—low concentrations occur at lower temperatures. In addition, during the day—when the temperature is higher—the concentrations of pollutants are also higher. The occurrence of increased NO
2 concentrations at higher temperatures may be due to thermal decomposition of peroxyacetyl nitrates transported from other regions [
35,
36]. However, at night—when emissions and the temperature are lower—the NO
2 concentrations are lower. This state is the result of many processes and chemical reactions taking place at night; for example, chemical reactions transforming NO
2 into N
2O
5 or, conversely, NO oxidation with ozone to form NO
2.
Sunshine duration has a very important influence on chemical reactions with nitrogen oxides. Under the influence of sunlight, the Leighton relationship occurs [
37]. NO
2 disintegrates into NO and ozone. At the same time, the reverse reaction occurs. During daylight hours, NO, NO
2, and O
3 concentrations persist in the photostationary state. The time to reach a stationary state depends on the NO
2 concentration and ranges from several minutes to tens of minutes [
38]. This phenomenon is confirmed by the received impact of the two-hour delay in sunshine duration. The impact of the seven-hour delay in relative humidity on NO
2 concentration may result from a large inertia of humidity terms of the change in cloud cover. Air pressure is another meteorological condition that indirectly influences pollution concentration through the relationship with the type of cloud cover and precipitation (atmospheric fronts). With respect to air pressure, the value with the highest impact on nitrogen oxide concentration is the current one.
NOx is the sum of NO and NO2 concentrations. Consequently, the transformation of nitric oxide into nitrogen dioxide and vice versa does not change the NOx concentration. Unlike NO2, sunshine duration has the strongest impact on NOx with a ten-hour delay—the opposite part of the day. Thus, the night after a sunny day, a relatively high NOx concentration can be observed. The current relative humidity influences the current NOx concentration (delay equals 0). High humidity from cloud cover and or precipitation prevents the chemical reactions between NOx and volatile organic compounds (VOC) from occurring. Thus, nitrogen oxides float in the air, increasing the concentration.
4.2. Random Forest Fitting
In the next step, we built two RF models: (1) using
current values of predictors and (2) using
lagged values (values of predictors with a delay) for each pollutant: NO
2 and NO
x. Then, we compared the impact of lag variables on the quality of model fit (
Table 3) and assessed the impact (importance) of each of the factors in both models on the predicted values of the concentration of pollutants. The comparison of model fit quality including only current predictor values broken down by season is presented in
Table 4. For the entire 2015–2017 measurement period, the model containing
lagged variables turned out to be better suited to the empirical data, as indicated by the values of all goodness of fit measures. Both models,
current and
lag, best describe the relationship between pollutant concentrations and meteorological factors and traffic flow for winters. For the winter period (December–February) the greatest improvement in model fit also came after taking into account lag variables, probably because in winter wind speeds are low and the concentration is more influenced by local traffic than by emissions in the surroundings. In addition, stagnation processes and atmospheric inversions also have a significant impact on the retention of pollutants. A slightly smaller improvement in the quality of fit occurred in the autumn. There was no change in the quality of fit for NO
2 modeling during the warmer part of the year (June–August). This was probably caused by the lack of some important factor determining the NO
2 concentration during periods of strong sunshine and high air temperature.
Due to the greater variation in NO
x values (coefficient of variation for NO
2 is 46%, and for NO
x 73%) it is more difficult to predict its values effectively. This is generally indicated by less goodness of fit than for NO
2 (
Table 5). As in the case of nitrogen dioxide, for NO
x the greatest improvement in fit quality occurred in autumn and winter. It is worth noting that taking into account the delay of variables resulted in an improvement in the quality of NO
x model matching in the summer as well. This means that in a situation where the transformations of NO into NO
2 and vice versa are not relevant for the concentration of the pollutant (NO
x = NO + NO
2), determining the correct delay of the variable is an effective way to improve the model.
4.3. Random Forest Variable Importance
The method of determining the relationship between the concentration of pollution and environmental factors implemented in the paper was done to determine the impact of each of the factors. Therefore, feature importance values were used to assess the impact of the factor. In each model,
current and
lag, feature importance values for each predictor were determined (
Table 6). An analysis was also made by season. The highest influence came from traffic flow, the only predictor related with the source of pollution (emission). Wind speed has about half as much of an influence on the concentration of nitrogen oxides in the air—the only factor among those studied which directly affects the evacuation of pollution. These findings are in accordance with analyses performed for other regions, e.g., [
21] in Madrid or [
8] in Oslo. Other factors (air temperature, relative humidity, sunshine duration, and air pressure) indirectly affect the final concentration value. Their impact is therefore significantly smaller.
The change of feature importance as a result of taking into account variable delays is presented graphically in
Figure 2. It can be seen that delays in the predictors for NO
2 and NO
x have different impacts on the importance of dependent variables. The bars represent the values of the relevant differences in feature importance between the current and lag models. Optimal lags changed the significance of variables in the NO
2 and NO
x models. Only traffic flow with a one-hour delay is still the most important factor determining the concentration of both pollutants.
When drawing conclusions about changes in the importance of variables, remember that feature importance values are related to the most important variable that has been assigned the value of 100 (here, it was always traffic flow). Therefore, a change in feature importance value can only be assessed as a change in relation to the significance of traffic flow. Factoring in a three-hour delay in wind speed reduced the significance of this variable in relation to traffic flow for emissions in NO2 models. However, it cannot be unequivocally stated whether it was the load of pollutants with a one-hour delay that grew in importance, or whether the intensity of evacuation was less intense. Relative humidity increased significantly when it was introduced to the model with a delay of seven hours. This means that it is important in modeling to recognize the inertia of the response of individual meteorological factors to current weather conditions. An important role in NO2 concentration modeling is played by the chemical reactions that occur in the atmosphere, which vary during the day due to sunlight and at night or on a cloudy day. Determining the optimal delay allows one to identify these relationships. The importance of air temperature, sunshine duration, and air pressure does not change after factoring in delays. Markedly different observations of the dependencies between feature importance in the lag modeling of NOx do occur. The most visible changes were observed in the importance of wind speed. For each whole season—especially summer—wind speed became a more important variable. Therefore, correctly taking into account the lagged variables shows that in NOx modeling wind speed is more important than what was suggested by the basic current model. An interesting phenomenon appeared for the importance of sunshine duration. When analyzing the entire research period, the impact of this variable in both models (current and lag) was the same in relation to traffic flow, even when separated by season. It turns out that this is the outcome of a significantly higher impact in autumn (about twice as important) than in spring and summer. It can be supposed that extremely high NOx concentrations in the afternoon and evening in the autumn are related to the sunlight that day. A delay in relative humidity did not affect the importance change in relation to traffic flow.
Following the above argumentation, it can therefore be concluded that determining the optimal delays (lag variable) of environmental factors in the modeling of pollutant concentration significantly increases the accuracy of the model and is confirmed by the physicochemical developments on the atmosphere. The goodness of fit values of the presented models, as to orders of magnitude, are similar to those reported by other researchers. For example, for average monthly NO
2 concentration, Zhu et al. [
17] obtained an RMSE of 11.0 µgm
−3 and Zhan of 13.3 µgm
−3 for daily NO
2 modelling in China; Kamińska and Turek [
25] reported a daily average of 7.47 µgm
−3 for modelling in Wrocław; for the Polish capital—Warsaw—Holnicki et al. [
39] obtained a normalized mean square error of 0.070. The application of the optimal lag variable models presented herein enhanced the model fit, as measured by RMSE, from 16.2 µgm
−3 to 15.4 µgm
−3 for the entire three-year period and from 14.9 µgm
−3 to 13.6 µgm
−3 for winter. The R
2 of 0.56 (10% greater than without lag variables) for NO
2 confirms the effectiveness of the model and provides better accuracy than that presented by, for example, Laña [
21] (
).
In full generality it can be concluded that determining the optimal delay for environmental variables and including such lag variables as predictors increases the accuracy of the model. The only exception is the summer season in the modeling of NO2 concentration. Despite the changes identified in feature importance in comparison to the current model, the model fit did not increase. This is probably due to the lack of some other factors which significantly affect the concentration of NO2 during periods of intense sunlight and high temperature; these may include rainfall, solar intensity, radiation, or concentrations of pollutants that react intensively with nitrogen oxides, such as ozone. However, such analysis delves into the details of atmospheric chemistry, which was not the purpose of this study. Further studies should consider extending the set of predictors with new variables.
The method for determining the optimal delay for each independent variable and input of this lag variable model is a very general method; it may be utilized in any air pollution model regardless of the factors and type of pollution under study. Conclusions regarding the linearity of each of the factors in this paper, however, are local and refer only to the conditions considered herein. Nevertheless, the general, most important goal is to obtain ideally the first power of all variables. The impact of climatic conditions (here, the seasons) can also be regarded as significant in full generality. The results suggest that in similar studies for other locations or pollutants, the seasonal variability of meteorological conditions should be considered, and that the impact of factors determining air pollution should be assessed with this variability taken into account.
The idea of determining the effect of delayed values (from a few hours prior) of the key factors on the concentration of pollution is very general. As a mathematical method, it can be applied to any pollution and any set of explanatory variables. The example of the specific intersection in Wrocław presented herein leads to locally valid conclusions. The specific characteristics of the wind direction distribution (here, winds parallel to the intersection’s axis in the direction of the air pollution sensor predominate) suggests using a delay in relation to the time of air mass movement and pollution clouds. The consequence of the significant distance between the intersection and the meteorological station, as well as the city-specific buildings (terrain roughness), is a delay in the variable wind speed. These observations confirm the effectiveness of the model and the correctness of factoring in delays in such modeling. For a different location, the delays would likely be different because they would result from the specific nature of that location. Nevertheless, the interpretation and their impact on the quality of the model are equally valuable.
4.4. Limitations
The presented methodology is subject to certain limitations. In a RF regression problem, the range of predictions values is bound by the highest and lowest labels in the training data. Due to the RF construction consisting in averaging the predicted values from all constructed random trees, this model does not work in predicting extremely high values (concentration peaks) when we do not take into account the past values of the dependent variable (
Figure A4). Although the distance between the intersection and the meteorological station (9 km) was included in the lag model in the form of delays, which can be clearly seen in the example of wind speed the relationship between meteorological factors and air pollution concentration might be weakened by this long distance considering the footprint of sensors, especially when considering the heterogeneity of the urban fabric [
40]. It should be remembered that each of the mathematical methods used (MO and RF) does not take into account physical and chemical phenomena occurring in the atmosphere in the process of creating/optimizing. Nevertheless, the obtained results (optimal delays and importance of variables) are consistent with the phenomena actually taking place in the atmosphere.
5. Conclusions
The paper presents an assessment of the impact of finding the optimal delay of an independent variable on a model of air pollution concentration (NO2 and NOx) in a street canyon. Based on the supposition that the current concentration of pollutants can be more strongly influenced not by current, but past factor values, we proposed a method for determining delays for this factors. We have developed a three-object optimization method to determine a polynomial function (with the power limited to three) that guarantees the best possible fit to the empirical data. For both NO2 and NOx compounds, the linear function proved to be the most appropriate. For both pollutants, a one-hour delay for traffic flow and a two- or three-hour delay for wind speed had a larger impact than the current values. Likewise, past sunshine duration values had a greater impact on current pollutant concentration values. For NO2, this delay is two hours due to NO–NO2 transformations. For NOx, the delay is 10 h, which indicates that sunshine has a significant effect on nighttime concentration values. The delays determined by this analysis were used in a random forest model. The designated lag variables were taken as predictors. We developed four random forests: current models, using just the current values of the predictors, and lag models, using lagged variables, for NO2 and NOx separately. Considering individual seasons, the greatest changes in the importance of the delayed variables in NO2 forecasting occurred in autumn. Wind speed from 3 h ago turned out to be a relatively less important variable in the lag model than the actual wind speed in the current model. Relative humidity from before 7 h in the lag model has gained the most in importance. In forecasting NOx concentrations in autumn, the greatest increase in importance in the lag model compared to the current model occurred for the sunshine duration 10 h ago. In summer, the greatest increase in the impact of wind speed from 2 h ago was observed, compared to the actual values in the current model.
Taking into account factors in the form of lag variables also influenced the importance of variables, i.e., their impact on the level of pollution. For NO2, the wind speed delay decreased its importance and relative humidity increased its importance in relation to traffic flow. For NOx, a three-hour wind speed delay increased its importance in relation to traffic flow. Generally speaking, the method is universal. Detailed conclusions depend on local meteorological, topographical, and traffic conditions.