1. Introduction
Air pollution, global warming and other pollutants have a great impact on the environment, and have become a major global concern [
1]. NO
, which is used as the case study in this paper, is one of the greenhouse gases, and an important indicator of air pollution. It is also a precursor for several harmful secondary air pollutants, such as ozone and particulate matter (PM
, and PM
). This was the reason why networks of in situ Internet of Things (IoT) sensors were established for monitoring environmental variables [
2,
3]. IoT sensors are low cost, easy to install and can perform measurements with high temporal resolution [
4,
5,
6,
7]. However, today, the networks of IoT sensors do not cover larger areas. On the other hand, space agencies have also addressed this issue by launching satellites equipped with instruments to observe air pollutants in the Earth’s atmosphere [
8]. Satellite measurements assure large coverage, but their temporal resolution is low. Satellites provide measurements in the form of raster images, associated with various environmental attributes [
9]. In this context, the spatiotemporal alignment of IoT and satellite data sources represents the main challenge of low-level data fusion approaches that can limit the efficiency of higher levels significantly. Past studies addressed the issue of spatiotemporal alignment, either by interpolation or simulation of monitored sensor values, to match the spatial resolution of satellite images, while feeding the aligned features into the higher level analytics tools [
10].
In comparison to simulations, interpolation approaches are usually less computationally demanding. The most commonly used approaches include simple aggregation of the closest sensor values (i.e., Voronoi Natural Neighbors’ Interpolation), linear and bilinear interpolation, Inverse Distance Weighting (IDW) and kriging [
11,
12]. The aggregations of the closest sensor values are the most straightforward, as they do not require any additional data processing for estimating target values, such as, for example, the mixed effect regression model of daily ground NO
concentrations from Aura satellite measurements, demographic and thematic maps (e.g., roads and elevations), as well as aggregation of sensor data from the nearest weather station to the given location were examined in [
13]. More recently, Zhan et al. [
14] estimated daily NO
concentrations by additionally considering the daily Planetary Boundary Layer Height (PBLH) and Normalized Difference Vegetation Index (NDVI), while applying co-kriging to interpolate the meteorological data. Alternatively, spatial data alignment of raster data was achieved using interpolation with area weighted averages, while temporal convolution with Gaussian kernels was used to fill the missing values within the satellite images. The interpolated values were used in the combination of the random forest and the spatiotemporal kriging to estimate the daily pollutant exposure. Improved accuracy, however, was reported by using bilinear interpolation for increasing the resolution of the satellite data. IDW was used for interpolation of missing satellite values, together with kriging-based interpolation of meteorological sensors’ data fed into XGBoost regression [
15]. Alternatively, Araki et al. [
16] used Aura satellite measurements to estimate monthly ground NO
concentrations. The data of roads, demography, land use and positions of large combustion sources were considered, in addition to meteorological data. While the satellite data were re-gridded using bilinear interpolation, ordinary kriging was utilized to grid the values of meteorological sensor’s data. The study proposed the combination of land use regression with the random forest to estimate the target variables. They confirmed that Land-Use Random Forest performs better than land-use regression. Nevertheless, interpolation approaches inevitably introduce inaccuracies into the definition of explanatory variables by neglecting the spatially-dependent variance in their behavior. As these may accumulate within the resulting NO
data layer [
10], significant effort was dedicated to the simulation-based approaches.
The numerical-based Chemical Transport Models (CTMs) are the most common amongst different simulation models [
17]. They are employed to simulate atmospheric chemistry by dividing the atmosphere into grid cells and defining the behavior of chemical species of interest within them using a numerical model. The behavior of concentration levels may be dependent on various environmental parameters (e.g., wind direction, temperature or humidity), as well as on the characteristic of the considered pollutant [
18]. Amongst many CTMs, the Goddard Earth Observing System–Chem (GEOS-Chem) and Weather Research and Forecasting (WRF) are the most popular [
19] when considering the fusion of the meteorological sensors’ data. For example, Li et al. [
20] estimated ground NO
concentration levels using GEOS-Chem simulations of meteorological variables and nitric acid surface mass concentrations. They fed the raster layers with NO
Sentinel-5 Precursor (Sentinel-5P) data, NDVI from Terra and Aqua data and a digital elevation model to a geographically and temporally weighted generalized regression neural network. Alternatively, Qin et al. [
21] used the same regression model for an estimation of the ground level NO
concentrations, based on the simulated meteorological parameters from WRF, together with Aura satellite NO
retrievals and the interpolated population data. Recent studies also examined the usage of these models for simulating the behavior of target variables directly. Beloconi and Vounatsou [
22] examined daily NO
estimation using GEOS-Chem. Here, the simulation model was constructed to simulate the vertical distribution of NO
based on retrievals from the Aura satellite. The results were then improved additionally by a Bayesian geostatistical regression model using a variety of other predictors, including land cover, tree cover density, terrain elevation, night-time lights, land surface temperature both day and night, NDVI, data of roads, and meteorological data. Similarly, Yang et al. [
23] applied regression on the results of a simulation model for estimations of ground NO
levels. Here, retrievals of NO
from the Aura satellites were combined with Aerosol Optical Depth data from the Terra, Aqua, OrbView-2, and CALIPSO satellites within the GEOS-Chem model, while meteorological data were simulated by WRF. Additional predictor variables feeding the supervised forward stepwise linear regression model included position, land use variables, traffic, and wind data. Other types of models and their combinations were studied (e.g., Community Multiscale Air Quality and GEOS-Chem) with a variety of regressions (neural network, random forest, gradient boosting algorithms and a generalized additive geographically weighted model) [
24]. However, CTM-based simulation approaches are computationally demanding and difficult to implement, as they require precise definition of the behavior of chemical species within the given grid-cell. Consequently, the results are of low spatial and temporal resolutions [
25].
The methods mentioned above perform interpolation in the pre-processing step to fill the missing spatial gaps. The proposed method omits interpolation during the initialization. Instead, the assessment of an environmental variable is performed by an ensemble of regression models, where each regression model performs the interpolation by different parameters.
The whole process constructs a satellite-like raster image based on the in situ IoT measurements. The resulting image can, therefore, be constructed in times when IoT measurements are given in comparison with other methods, which expose lower temporal resolutions (typically on the dally, or even monthly, scale). For this, the observed area is partitioned into Voronoi cells, based on the locations of the IoT sensors active at the desired time. Each set of pixels located in a Voronoi cell has its own regression model, by which better adaptation can be achieved to the local characteristics. The parameters for the regression models are selected by using the measurements of the neighboring IoT sensors. The nearest neighbor, linear regression, and forward-feeding neural network are used in this paper. Accordingly, the proposed approach brings the following novelties:
a strong theoretical foundation for modeling the relationship between the IoT and satellite data,
the integration of interpolation directly into regression models, yielding a more compact and consistent algorithm,
an ensemble of base regression models constructed by using measurements from the surrounding IoT sensors, and
an increased temporal resolution, dependent only on the sampling rate of the IoT sensors.
The rest of the paper is organized as follows: The details of the proposed method are explained in
Section 2.
Section 3 describes the observed area and data preparation.
Section 4 provides the results of the approach and their evaluation.
Section 5 discusses the obtained results and concludes the article.
3. Study Area and Data Preparation
In order to account for various testing conditions, the area of the Republic of Slovenia was used as the observed area. The country covers 20,271 square kilometers, and is known for its geomorphological diversity. This is due mostly to various natural regions gathered in one place: Alps, Dinaric Alps, the Pannonian Basin and the Mediterranean Basin. Although the area is mostly covered by forests (up to almost 60%), it bears a variety of pollution sources [
31,
32]. As seen in
Figure 5, the area is covered sparsely by IoT sensors to monitor NO
emissions. One of the main pollution sources is traffic, as roads which are part of Trans-European Transport Network, connecting major European and Slovenian cities, cross the country. Namely, Ljubljana and Maribor are places in Slovenia where the population is the most dense [
33]. Furthermore, another polluted area connected to the road traffic is the port of Koper, which is an international connection from Continental Europe to the Mediterranean Sea [
34,
35]. Moreover, Slovenia owns the Šoštanj Thermal Power Plant located in the Šalek Valey, which is a large source of air pollution, as it produces approximately 35% of the electricity [
36]. To obtain different testing scenarios, eight different areas across the whole country were selected as test cases.
Sentinel-5P is dedicated to the monitoring of the Earth’s atmosphere, and has a revisit rate of less than a day [
37]. The measurements are given to users in the form of Network Common Data Form (format NetCDF) [
38]. Each pixel is equipped with a timestamp, the value of the tropospheric NO
column, and a quality assurance value (
), which indicates the validity of the measurement [
37].
corresponds to an invalid measurement, while
represents a value with no errors. Pixels with
were used as recommended in [
39]. Due to the inconsistencies in the satellite pixel position at each revisit time, the observed area was gridded, as seen in
Figure 5. Based on the typical satellite pixel sizes, the size
km was selected for the pixels in the grid. The satellite image was cropped and re-gridded to match the observed gridded area. Additionally, the data were used from the IoT ground sensors measuring NO
seen in
Figure 5. The unified interface providing various geo-biophysical parameters from sensor networks across Slovenia is available in [
40]. The national data provider is the Slovenian Environment Agency [
3]. However, other local providers contribute their data to the mentioned platform, such as, for example, the Šoštanj Thermal Power Plant.
4. Results
The proposed approach was implemented on a personal computer with an CPU, 6 cores, 12 MB of cache and 32 GB of memory. The approach was tested with all three introduced machine learning algorithms.
The implementation of LR was taken from the programming library MLPACK [
41]. Furthermore, the feed-forward neural network was implemented using TensorFlow [
42]. The regression model was built by setting the hyperparameters:
epochs to 250,
batch_size to 16, and
optimizer to
Adam. These hyperparameters, along with the selection of the activation function and the number of neurons in the hidden layer, were obtained by using cross-validation with the Grid Search algorithm [
43]. The parameter
cv, which determines the number of folds was set to three. Additionally, the input data were scaled to a range between 0 and 1. On the other hand, the mean value was subtracted from the target variables in the training dataset. Additionally, they were divided by their Standard Deviation. The calculated outputs were then mapped back to their original range.
The Tables
and
were aligned temporally, and aggregated by applying the natural join
(Equation (
17)). The aggregated dataset was split in order to obtain the training, validation and testing datasets (see
Figure 6).
The training and validation datasets contained the samples between 1 January 2020 and 28 February 2021. As seen in
Figure 7, the training dataset was used to build the prediction model [
44]. The validation dataset was applied only during the tuning of the hyperparameters for the regression model built by the NN.
The testing dataset included the measurements from 1 March 2021, and up to and including 1 June 2021. This dataset was withheld from the machine learning, and was used to evaluate the regression model’s performance, as performed in [
28,
45].
The evaluation of the machine learning model was performed by comparing the calculated values and the actual measurements of
E (i.e., NO
) from the testing dataset using the Root Mean Square Error (RMSE) metric, defined in Equation (
18). The testing dataset included a total of 72,813 valid
E values from the satellite pixels in
. Let us remember that
is a set of
M pixels in the observed area,
The accuracy of the base regression models for each machine learning algorithm was evaluated on the test cases (seen in
Figure 5) and the whole observed area. Additionally, the execution times were measured for calculating the pixel values within the single image (849 pixels). It should be noted that the measured times include only the prediction phase, whereas the time used for data preparation and the machine learning was excluded. The obtained results are given in
Table 3.
As seen in
Table 3, the RMSEs varied between test cases. However, the RMSE ratios between them were similar for each base regression model. Likewise, the execution times to calculate the values of pixels in each observed area were also dependent on the regression model. The 1-NN took the longest, while the fastest was LR. The results prove that the method processing time is under 1 min, and that the proposed method is suitable for practical use. However, the time measurement excluded the data preparation and machine learning, which is largely dependent on the size of the training dataset and machine learning algorithm used. Some of the image samples, which were calculated in the absence of the satellite data, are shown in
Figure 8.
The additional analyses were conducted on the results obtained by the neural network, as it outperformed the other two algorithms. As seen in
Table 4, the RMSE was compared for each test case, based on the number of considered IoT sensors. Some test cases never considered a specific number of IoT sensors. This occurred either due to invalid measurements or positions of the Voronoi cells. These cases are marked with “/” in
Table 4 and
Table 5.
The accuracy of the algorithm’s performance on test cases also varied because of the distances between them and the considered IoT sensors. This is seen in
Table 5, where the average distance was made depending on the number of considered IoT sensors.
Assessment of the NO
was conducted using additional meteorological variables. They were considered by generating a vector of random meteorological parameters 50 times. In each new iteration the proposed method calculated pixel values for the whole test dataset
. The averages of the results are seen in
Table 6, while the best results were obtained when temperature, humidity, wind speed and NO
were considered together (
).
5. Discussion
A new ensemble approach for the assessment of the arbitrary environmental variable is proposed in this article. It utilizes the data fusion paradigm to construct the satellite-like image, based on measurements from the IoT sensors. Unique to other methods, it omits the interpolation of input variables in the initialization, and performs it during the prediction.
The results showed that, when comparing the method’s performance using three different machine learning algorithms, the feed-forward neural network achieved the best performance. The main factor contributing to the performance of the neural network was its architecture, which determines the number of coefficients used to calculate the target variable. This could have caused us to obtain more accurate results than linear regression. There are also some other differences that contribute to the construction of a regression equation. This includes the number of iterations to update weights, and the algorithm used to determine how the weights are updated.
The RMSEs varied between the selected test cases. The best results on average were obtained in the test cases 5, 7 and 8 (
Table 3), which are positioned on the plain terrain (see
Figure 5). These cases also had better results when less and nearer sensors were considered (
Table 4 and
Table 5). This was due to the lack of major geographical barriers between the considered IoT sensors that contributed to the calculated values [
15]. Similarly, good results were also obtained in test case 6. This test case is positioned near the Šoštanj Thermal Power Plant and is surrounded by three or more IoT sensors. The test cases with the worst results had geographical barriers present between them and the considered sensors. However, the obtained results could be improved if more IoT sensors were considered.
The 1-NN took the longest time to perform the calculations. This was due to being unsupervised learning, and most processing was performed in the process of searching for the nearest instance. Furthermore, the approach performed slower in cases where the set of pixels with the smallest distance contained many invalid values. In this case, the algorithm had to search for the next set of pixels. The 1-NN also carries out most of the processing in the prediction phase. On the other hand, linear regression and neural network being the supervised approaches, perform most of the calculations in the phase of machine learning.
Nevertheless, it was shown that the proposed approach performs in a relatively short amount of time, and provides good results. The main advantage of this approach is its assessment of the environmental variable at an arbitrary location within the observed area in times when the IoT measurements are available. On the other hand, other similar approaches assess the variables at lower temporal resolutions. The proposed approach increases the spatial resolution by defining the ensemble of base regression models which are dependent on the neighboring IoT sensors. The proposed approach allows the integration of any machine learning algorithm. Furthermore, it can be applied to an arbitrary variable in any observed area, as long as sufficient and appropriate correlated training data are provided. It may also use other auxiliary meteorological variables in addition to the main predictor (NO in our case).
The main disadvantage of the proposed approach is its inability to include new sensors. This will be improved by incremental machine learning algorithms [
46]. Furthermore, the method performance can also be increased by utilizing different machine learning algorithms to model the relationship between the IoT and the satellite sensor data. We will try to improve efficiency by utilizing recurrent neural networks to make time series predictions [
47].