1. Introduction
Nowadays, there is an ever-growing interest in air pollution which has led to the birth of the One Health paradigm. This paradigm studies the relationship between human, animal and environmental health and represents a new front for the study of complex diseases, where the connections with environmental conditions, including pollution, are evaluated [
1].
In response to the complicated challenges posed by air quality, scientists have increasingly relied on satellite and climate reanalysis data which provide a global view on atmospheric conditions, making them indispensable for assessing the dispersion and density of pollutants, especially in areas where on-site monitoring is insufficient, but they present some critical issues [
2].
A notable discrepancy arises when comparing satellite data with ground-level measurements of air quality [
3]. The perspective from space offers a macroscopic view that may not capture the fine-grained variations in pollution levels experienced at ground level. This discrepancy between satellite and ground-based measurements raises questions about the accuracy and applicability of satellite data for air quality monitoring.
Some satellite missions deal with the detection of pollutants, such as the Sentinel-5P of the European Earth Observation Program Copernicus. The Copernicus Sentinel-5 Precursor mission, launched on 13 October 2017, is the first Copernicus mission dedicated to monitoring Earth’s atmosphere. Thanks to the TROPOspheric Monitoring Instrument (TROPOMI) spectrometer, the Sentinel-5P missions provide observations of key atmospheric constituents (such as
,
,
,
,
,
, aerosols and clouds) at the level of the troposphere [
4]. However measurements provided by Sentinel-5P, which have a spatial resolution ranging from 1.1 to 5 km, are column concentrations and therefore they are expressed in mol/m
2 unlike the ground measurements, which are measured in μg/m
3. Furthermore, numerous effects related to traffic, the presence of industries and the nature of the territory alter the surface concentration of certain pollutants and are not directly observable from satellite [
5]. Integrating Sentinel-5P measurements with atmospheric reanalysis of the global climate, namely ERA5, can allow the creation of an improved model to estimate surface level concentrations.
In general, air quality monitoring at surface level is conducted by special government agencies. In Italy, the environmental monitoring is conducted by the Regional Environmental Protection Agency (ARPA) [
6]. ARPA has several hundred monitoring stations throughout Italy that are responsible for the hourly monitoring of various air pollutants, including
,
,
,
,
. The main problem of these monitoring stations is their insufficient number to cover the entire Italian territory.
Our work aimed to create a model for estimating the daily ground level concentrations of air pollutants at municipal scale, using satellite, meteorological and geographical data over the period 2019–2022. To this end, we used artificial intelligence techniques for the creation of the model and Explainable Artificial Intelligence (XAI), for the interpretation of the results. The model, based on ensemble algorithms, was trained using data from 337 ARPA control units, distributed over 4 different Italian regions and considered as the ground truth of our framework. Panel A of
Figure 1 shows the Italian territory with all the considered municipalities in Italy; panel B displays the control units used in our analysis. Municipal areas range from 1 to 1287.24 km
2 with an average of 37.3 km
2.
We compared our findings with the predictions provided by Copernicus Atmosphere Monitoring Service (CAMS) global reanalysis dataset [
7]. CAMS [
8] is a service implemented by the European Centre for Medium-Range Weather Forecasts (ECMWF), based on a variety of ground level and satellite retrieved data. Its purpose is to provide continuous information on atmospheric composition including total column values for
and
and surface concentrations of
and
.
The proposed model, optimized at the municipal level, could facilitate One Health studies as well as support local and national stakeholders, agencies, and policymakers.
2. Materials and Methods
The goal of our study was to develop a model to estimate ground level air pollution in Italy at the municipal level from 2019 to 2022 through heterogeneous data such as satellite, meteorological, geographical and social data. In particular, we focused on the estimation of ground level concentrations of 4 air pollutants, namely
,
,
and
, through a machine learning approach, as summarized in
Figure 2. After a preprocessing phase, we selected the ML algorithm with the best performance among linear model, Random Forest and XGBoost, by means of a five-fold cross validation procedure. Then, we implemented a feature importance procedure using an approach based on Shapley (SHAP) values to assess the role of each feature in the model. We collected different types of data for the construction of the machine learning model: satellite, meteorological, and ground pollution data, geographical and social data. All data was preprocessed to have daily time granularity, covering the years between 2019 and 2022 at a municipal scale with a total of 8092 Italian municipalities.
2.1. Sentinel-5P Data
Satellite data refers to information collected from Earth-observing satellites orbiting around our planet. These satellites are equipped with various sensors and instruments that capture a wide range of data, including atmospheric composition, meteorological, and environmental parameters. Satellite data has become pivotal for monitoring and understanding Earth’s dynamic processes, climate change, and environmental trends.
Copernicus Sentinel-5P mission is part of the European Space Agency’s Copernicus Earth Observation Program, which aims to provide open and free access to environmental data for a multitude of applications [
9]. Sentinel-5P specifically focuses on monitoring the Earth’s atmosphere and plays a crucial role in tracking air quality and atmospheric composition. Sentinel-5P is equipped with a state-of-the-art spectrometer called TROPOMI (Tropospheric Monitoring Instrument). TROPOMI can measure a wide range of atmospheric gases with a spatial resolution of
×
km
2, a swath width of 2600 km and time of overpassing Italy around 2 p.m. It measures a wide range of atmospheric trace gases such as nitrogen dioxide (
), ozone (
), sulfur dioxide (
), and carbon monoxide (
), among others. These measurements are crucial to assess air pollution, greenhouse gas levels and to evaluate their impact on climate and human health [
10].
For the construction of the model, we collected daily concentrations of pollutants, namely
and
from the Google Earth Engine [
11,
12]. From the same source we collected the Aerosols Absorbing Index [
13], which can be used to determine the presence of UV-absorbing aerosols, such as dust and smoke. Positive values of this index indicate the presence of these pollutants. This index can be a proxy for the concentration of
and
and positive values of this index indicate the presence of elevated absorbing aerosols in the Earth’s atmosphere.
The original spatial resolution of this data is × km2, however Google Earth Engine converts original L2 data to L3 images using a grid with the pixel size smaller then the actual resolution in order to avoid data loss. The final spatial resolution is then × km2.
2.2. ERA5 Data
Climate reanalysis data combine past observations collected by a variety of sources on land, ocean, airplains, satellites and from instruments with different lifespans, quality and resolution with models to generate consistent time series of multiple climate variables.
ERA5 stands for the “Fifth Generation European Reanalysis”. It is a project led by ECMWF that aims to create a comprehensive, high-quality dataset of historical and current weather and climate information [
14].
ERA5 utilizes a large amount of observational data including data from satellites, weather stations, aircraft, and more, to reconstruct the Earth’s atmospheric conditions and surface variables. This reanalysis dataset provides a consistent and detailed record of past weather and climate conditions on a global scale, allowing scientists and researchers to analyze long-term climate trends, investigate extreme weather events, and improve climate modeling and forecasting.
From the Google Earth Engine we collected ERA5 [
15], namely temperature 2 m above ground, surface pressure, u and v component of wind 1 m above the surface and the amount of precipitation. The spatial resolution of this data is
×
km
2 with a daily granularity. Also, from the wind components we calculated wind speed using the classical Euclidean norm.
2.3. ISTAT Data
For a further spatial characterization of the municipalities, we collected social and geographical data from public repositories of the Italian National Institute of Statistics (ISTAT). ISTAT is the primary governmental agency responsible for collecting, analyzing, and disseminating statistical information in Italy [
16]. From the ISTAT repository we extracted 39 features from the 2011 census data [
17], reported in the
Supplementary Materials. Features include altitude, type (coastal, urban, etc…), population density, density of buildings, density of roads and number of workers for each municipality.
2.4. ARPA Ground Data
We used pollution data from ARPA ground stations as labels to train our machine learning model. The environmental quality monitoring conducted by ARPA involves the systematic assessment and measurement of various environmental parameters within individual regions. This monitoring process includes the collection and analysis of data related to air and water quality, soil conditions, noise levels, and other environmental factors.
To train our model, we collected air pollution data from 337 control units located in four regions: Puglia (60 control units) [
18], Lazio (53 control units) [
19], Emilia Romagna (54 control units) [
20] and Lombardy (170 control units) [
21], placed in the Southern, Central, Northeastern, and Northwestern part of Italy, respectively. The set of control units has been chosen to be as heterogeneous as possible. The types of stations are: Traffic, Industry and Background. The areas are: Urban, Suburban and Rural. The data are hourly or daily averages, cover the period between 2019 and 2022, and provide concentrations in
g/m
3 of four pollutants, namely
,
,
and
.
We have chosen these 4 regions for a double reason: (i) these regions are representative of the territorial and climatic diversity of Italy due to their geographical location; (ii) to reduce computational costs, since the analysis of these data required in fact several days of processing.
To improve the performance of our framework in estimating ground level concentrations of a pollutant, we also used satellite measurements of the other three pollutants
Section 2.1) as independent variables of the model, given the high correlation between the different pollutants (see
Figure S1 of the Supplementary Materials).
3. Data Preprocessing
We followed a preprocessing strategy to handle the missing data to reduce redundant information in our dataset and to address data colocation in time and space. Data missingness is an issue inherent to the nature of satellite data, since not all Italian areas are crossed daily by the satellite’s orbit. On the other hand, the ARPA data also contained missing values mainly due to malfunction or temporary shutdowns of the control units. To overcome this problem, we removed all observations with missing ground level data from the control unit. The percentage of missing values in the ARPA data was 1% for , 19% for , 23% for and 3% for .
As for the data obtained by satellite, these variables were downloaded at level L3, i.e., with pixels that have a QA value >
. The percentages of missing values were
for
,
for
and
for AAI. To encode time-related information in the model we added three features, namely year, month and day of the week. With the exception of year, we converted time variables using cyclic encoding from R’s
Lubridate package. Cyclic encoding of time variables involves the representation of time data in a circular or periodic manner. Cyclic encoding of time variables is a common practice in machine learning. Through this procedure it is possible to capture recurring patterns within a data set. For example, if an input feature of the model is month of the year, ordinary encoding will match the month with an integer between 0 an 11, starting with January; in this encoding January (0) and December (11) will be very different even though they are close temporally. Generally, periodic functions such as sine and cosine are used to encode time variables such as day of the week and month [
22]. This is often referred to as circular coding or circular representation. In our case, we represented the days of the week as if they were angles we then applied the sine and cosine functions:
where
d is the day of the week, an integer between 0 and 6 starting from Monday and
N is 7. Then we calculated:
We repeated the same procedure to encode the months of the year by replacing
d in (
1) with an integer between 0 and 11, starting with January, and
N with 12. At the end our dataset was composed by 68 features including satellite, meteorological, geografical and social variables.
A first Pearson correlation analysis highlighted a strong correlation between some features. Therefore, to remove the redundant information we selected a correlation threshold of 50% such that no two variables that have a correlation greater this threshold are included in the model which reduced the final number of features to 32. We selected the threshold that minimized the error of the model. In the
Supplementary Materials we list all features used in the model, including redundant features.
When the data from the ARPA control units had an hourly time granularity, we averaged over a daily time window to achieve the granularity of satellite data. Our input data also had different spatial granularity. Since our analysis had the granularity of municipalities, when the input data had higher resolution, we averaged measurements covering the same municipality.
The spatial analysis required the use of different R packages. The used packages were gstat, raster, sf and exactextractr. Specifically, the satellite images were downloaded in .tif format from Google Earth Engine and read in with the raster package. The image was then re-projected into the same coordinate reference system (CRS) as the shapefile used for the Italian municipality. Finally, the image values were extracted with the exactextractr package, using the mean value as an aggregation function.
7. Discussion
Our model aims at estimating daily ground level air pollution in Italian municipalities. Our choice of the granularity, namely at the level of municipality, is motivated both by a reduction of the model complexity and by our intention to use our results in a future One Health study, where only the municipality of residence is known—as it is usually the case in population studies. As we can see from
Table 1, XGBoost is the best model for estimating the four pollutants considered. In addition, this model has the highest computational performance. The performance of our model seems comparable, or even superior, to those reported in the literature.
Our results are in line with the literature. Stafoggia, M et al. [
32] applied a multilevel approach to obtain daily maps of
and
in Italy by using the Random Forest algorithm as predictor and Institute for Environmental Protection and Research (ISPRA) monitoring stations as ground truth together with different meteorological, geographical and land use variables. Comparing the errors obtained with a cross validation procedure, we see that the RMSE of their model in predicting
is
g/m
3 in the best case, while our model reaches
g/m
3; for
their error is
g/m
3, while ours is 5.46
g/m
3. It is worth mentioning that the spatial resolution of their estimates is 1 km, which is higher than our model.
Cedeno et al. [
33] reported a RMSE value of
g/m
3 predicting the daily concentration of
in the area of Milan and using ARPA’s control units as ground truth and Machine Learning models.
Silibello et al. [
34] estimated the daily ground level concentration of
and Ozone using the Random Forest algorithm, geographical variables and a model called FARM. Also in this case the spatial resolution of the model was 1 km. The best RMSE values found were
and
g/m
3 for
and
respectively.
Chen et al. [
35] obtained a RMSE of
g/m
3 for the prediction of surface Ozone in a large area of China, using meteorological data between September 2015 and August 2021. As regards
, Chu-Chih Chen et al. [
36] presented a machine learning framework to forecast the monthly
concentrations of Taiwan at different spatial resolutions obtaining a
of
, comparable with the results of our XGBoost model. Peddle et al. [
37] used Aerosol Optical Depth data to predict concentrations of
and
for six US urban areas: Los Angeles, CA; Chicago, IL; St. Paul, MN; Baltimore, MD; New York, NY; Winston-Salem, NC. This study covered a period between 2000 and 2012 obtaining a performance in terms of
ranging from
to
.
Figure 5 highlights a seasonal trend of the concentration of
and
, pollutants that are particularly related to temperature and urban pollution, as shown by Nguyen et al. [
29], who also emphasised the relationship between the concentration of
and heating systems and population density.
In a study conducted by Di Bernardino et al. [
38] in Italy, the same seasonal behaviour of
and
was found when analyzing control sites in Rome. When analyzing weekly
concentrations, they also concluded that the decrease in
was related to the decrease in urban traffic that typically accompanies the weekend. Another study in Italy by Ravina et al. [
39] confirmed these results. They investigated the
concentrations of two stations in the Turin area and showed the influence of temperature on the
and
concentrations. In particular, by comparing the trends of
concentrations measured by two different stations, they found significant differences during the winter season. This behavior seems to be influenced by the increased traffic volume and home heating.
The connection among
and
concentrations, temperature and population density is confirmed by the results of our SHAP analysis displayed in
Figure 6, which shows the twenty features with the highest shap values. As we can see, population density plays a crucial role in predicting
on the ground surpassing the influence of satellite retrieved
. However, the influence of the urban context seems to be less influential for the prediction of
. Nevertheless, the pivotal role of temperature is confirmed, in particular high temperatures are associated with higher values of
concentrations and the opposite is true for
, which is expected. In fact,
is a secondary pollutant whose formation is catalyzed by solar radiation [
30]. Satellite measures of
and
concentrations are anti-correlated with each other and play an important role in predicting ground level concentrations of both
and
, as expected [
40,
41].
The Shap diagrams for particulate matters, which are shown in
Figure 7, indicate that wind speed plays a decisive role in addition to temperature and appears to be anti-correlated with the model results [
42]. The role of the wind in moving the dust masses and reducing their concentrations is straightforward. An interesting result is the importance given by the model to the south-north component (wind_v_component) of the wind, which is positively correlated with particulate concentrations except for
perhaps where positive and negative contributions are mixed and could indicate the transport of dust from Africa to the Italian regions.
This result is confirmed by other studies. Calidonna et al. [
43] conducted medium-term observations at the GAW regional observatory in Lamezia Terme from 2015–2019 to identify dust outbreaks and investigate aerosol properties. They investigated an intense dust outburst episode in April 2019 as a case study and performed a detailed analysis considering surface and column optical properties, chemical properties, air quality modeling, satellite products and the return trajectory analysis, confirming the role of wind speed as the main cause of dust transportation.
Other meteorological variables that emerge as important in the model are precipitations and pressure; their behaviour confirms the goodness of our model. In particular, from the Shap diagrams we can see that precipitation is negatively correlated with the concentrations of the different pollutants, while pressure is positively correlated. This is a reasonable result, since rain combined with low pressure causes air pollutants to precipitate on the surface and their concentration to decrease [
44].
Time-related features also seem to play an important role within the model. For example, in the estimation of
concentrations, variable
sin.week, which correlates positively with the prediction, could be related to the “weekend effect” [
45], which links the concentration of
with the traffic flow [
46]. The Shap diagrams show that the use of satellite measurements of
and
in the model to estimate ground level concentrations of the four considered pollutants was important. In contrast, the aerosol absorption index was not among the most important variables for the prediction of
and
. This result is consistent with the literature. The Aerosol Absorption Index does not in fact provide a quantitative measure of the concentration of aerosols, but is used for special events such as volcanic eruptions, large dust events and forest fires [
47].
Finally, as mentioned in the introduction, we compared the results of our model with the predictions of the CAMS model, which are available from 1 July 2021. For the comparison, we used linear correlation because
and
measurements provided by CAMS have different units of measurement than ground station measurements, namely (kg/m
2) versus (kg/m
3).
Table 2 shows the values of the linear correlations between the values provided by CAMS, the results of our model and the ground truth provided by ARPA. From these values we can see that, unlike our models predictions, CAMS predictions are not statistically correlated with ground station measurements. This low correlation may be due to the granularity of the CAMS model, which has a spatial resolution of
×
km
2 and therefore ground stations that are not as far, compared to CAMS resolution, are assigned the same predicted value by CAMS even if their geographical and meteorological conditions are different.
8. Conclusions
We compared three different learning models for the daily prediction of concentrations of , , and of Italian municipalities at the surface level. Our framework incorporates information from heterogeneous data such as satellite, meteorological, geographic and social indicators as well as control station measurements provided by the Regional Environmental Protection Agency for the period 2019–2022 that we used as ground truth. The algorithm XGBoost had the best performance with an average of . Our results outperform or are comparable with results reported in other papers in the literature, although some studies present models with a higher resolution than the one used in this study.
Furthermore we evaluated the impact of the different features on the estimation of the concentration of each pollutant through an eXplainable Artificial Intelligence method using SHAP values to improve the interpretability and transparency of our Machine Learning models. The SHAP analysis confirmed some aspects already described in the literature, such as the anti-correlation between wind speed and and dust concentrations, or the positive correlation between temperature and concentrations.
A possible application of our model can be the prediction of extreme air pollution events combining our procedure with the approach of Varotsos et al. [
48]. They developed a model to forecast pollution extremes in Athens given changes in the dynamics of pollution and using data from ground stations. Their approach was based on fitting the surface concentration of
,
and
to the Gutenberg-Richter law. In addition, they introduced the concept of natural time as opposed to clock time. This concept is based on the observation that temporal fluctuations in time series can be used to quantify long-term dependencies and to differentiate the type of self-similarity within the series. As a result, they calculated the average waiting time between successive extreme concentrations of these three pollutants.
Moreover, our model can be used in One Health cohort studies to assess the impact of air pollution on human health at the municipal level. Future improvements of this model could increase the spatial resolution going from municipalities to distances of the order of kilometers.