1. Introduction
Today’s way of life, as has been shaped by existing social and economic conditions, requires the continuous exposure of people to an atmosphere with adverse effects on health. Research findings from all over the world converge in terms of the destructive consequences that burdened atmospheric air has on humans, agriculture, and infrastructure. The continuous degradation of climate makes the need to find a means of predicting the quality of the atmosphere, as well as dealing with the production of atmospheric pollutants, imperative.
Air pollution is defined as the “contamination of the indoor or outdoor environment by any chemical, physical or biological agent that modifies the natural characteristics of the atmosphere” [
1]. The effects of air pollution have been studied in many countries, both within and outside the EU, with each work of research confirming its catalytic role in the degradation of quality of life. Exposure to PM
10 was found to increase the likelihood of hospitalization for bronchitis symptoms, both for adults and children [
2], while exposure to excess concentrations of PM
2.5 was found to be a cause of hospitalization for cardiovascular and respiratory causes, as well as the cause of death [
3,
4]. Also, exposure to this type of particle can negatively affect infants, even in the prenatal stage, increase the probability of birth defects, obesity, type I diabetes, neurological and behavioral dysfunctions, premature birth, and even neonatal death [
5,
6,
7,
8]. Equally important are the excessive concentrations of sulfur dioxide (SO
2), as an excess of 10 ppb can increase the number of hospitalizations by 1.7% [
8]. Prenatal and postnatal exposure to nitrogen dioxide (NO
2) leads to increased chances of bronchopulmonary infections, pneumonia, as well as obesity [
6,
9]. Research that was performed in areas of Athens (Thrakomakedones and Athens center) for the period 2001–2018 showed that 6% and 7.5% of deaths in these areas, respectively, are due to increased concentrations of ozone (O
3) [
10].
However, the effects of air pollution are not limited to human health. Extensive research carried out in the city of Beijing, China, showed a direct correlation between the concentration of ozone (O
3) as well as nitrogen oxides (NOx) and the reduction in the amount of grain yielded [
11]. The percentages of the amount of grain yielded can reach up to 15%, while it is predicted that this will reach a height of 23% in the coming years [
11,
12,
13]. Correspondingly, the effects on infrastructure due to acid rain, as well as the transportation sector, are harmful. In particular, after research carried out by Hernández et al. [
14] regarding the effect of this on car paint, it was found that between the two aging methods, Xenon and a mixture of acid rain, the latter had the most destructive consequences. Finally, Ibrahim et al. [
15] in their research on the surface treatment of concrete after exposure to an acidic environment, reported the phenomenon of gypsum formation and its destructive properties on building materials, gradually weakening the ability of cement to withstand compression.
The need to design a means of forecasting pollutant concentrations is imperative. Over time, many researchers have tried to design such predictive models. In 2015, Xiao Feng et al. [
16] created a model that combined a Multi-Layer Perceptron (MLP) neural network with a geographical model based on the air trajectory and the use of the wavelet transformation method to forecast the average daily PM
2.5 concentration for the next two days in Beijing, China. Similar efforts were made by Madhavi Anushka Elangasinghe et al. [
17] in 2014 to forecast the hourly concentration of nitrogen dioxide (NO
2) using an MLP neural network with one hidden layer, trained with the Levenburg–Marquardt algorithm for the Auckland, New Zealand region. In 2016, Yun Bai et al. [
18], using the static wavelet transform method and the training of an ANN with a backpropagation algorithm, tried to predict the daily concentrations of PM
10 particles and the concentrations of sulfur dioxide (SO
2) and nitrogen dioxide (NO
2). In 2017, Fabio Biancofiore et al. [
19] also tried to predict the average daily concentrations of PM
2.5 and PM
10 particles for the next one to three days, while in 2018, Fabiana Franceschi et al. [
20] tried through a statistical study and AΝΝ integration to predict the concentrations of PM
2.5 and PM
10 particles in Bogotá, Colombia. According to further research published in 2024, Quanchao Chen et al. [
21] designed an innovative model for the prediction of hourly concentrations for PM
2.5 and PM
10 as well as ozone (O
3) in the city of Beijing by creating an AAMGCRN (Adaptive Adjacency Matrix Graph Convolutional Recurrent Network). Accordingly, Sarmad Dashti Latif et al. [
22] applied multiple machine learning methods to predict the ozone (O
3) concentration for the next 1, 3, 5, and 7 h in the Klang Valley, Malaysia.
This research was conducted in an attempt to design a universal air pollution forecasting model. It was also mentioned that over the course of a decade, no corresponding studies have been found that use strictly meteorological variables and no other pollutants as training data. The latest reference was in 2014, in the forecasting model created by Madhavi Anushka Elangasinghe et al. [
17], who argued that models in which the inputs consist, among other things, of pollutant concentrations have limited practical use.
2. Materials and Methods
The present study addresses a wide range of atmospheric pollutants, specifically the concentrations of suspended particulate matter with an aerodynamic diameter of 2.5 μm (PM2.5) and 10 μm (PM10), as well as the concentration of sulfur dioxide (SO2), nitrogen dioxide (NO2), carbon monoxide (CO), and ozone (O3).
The area of interest is the city of Beijing, China, and specifically 12 sub-districts within it. ANNs were developed and used to achieve the requested results. ANNs are a sub-field of machine learning (ML), and what sets them apart from other ML techniques is the special way in which they create their systems. ANNs, as their name suggests, try to imitate both the structure and method in which the human brain operates [
23]. Specifically, they consist of artificial neurons, with “nerves” providing them with the required information (input data). The inputs, when entering the neuron, are multiplied by some “weights” (factors that reduce or enhance the effect that each input has on the final outcome), and after passing through a transfer function, they exit from it (output data) [
23]. The schematic representation of the above process is presented in
Figure 1.
The ANNs that were developed in this work are Multi-Layer Perceptron (MLP), and they stand out for their existence of hidden layers (
Figure 2). Simple “perceptrons” use no hidden layers, making their use limited due to a lack of credibility. MLP, however, by using multiple hidden layers, can process data much faster and efficiently than their predecessors. Their topology and training parameters play a vital role, and if they are not adjusted properly, they can make the training process unsuccessful. If the number of hidden layers is lacking, then the training process will be incomplete, and the credibility of the final ANN will be low. However, if the number of hidden layers exceeds the required amount, then the ANN may produce results of high credibility for the current dataset but will not work properly on different datasets. This is because of a phenomenon called overfitting or overtraining, where the ANN training process fail, making the ANN replicate, not predict, the exact values of the testing and validation dataset [
23].
The way in which the processing of the input data is carried out and their correlation with the output data makes this particular processing method ideal for cases of creating forecasting models. In particular, the present work is called upon to create and evaluate a sufficient number of forecasting ANN models in order to select those that present the best predictive ability.
2.1. Study Area and Data Availability
The study area is the city of Beijing, China (39°54′13″ N, 116°23′17″ E), for which hourly measurements of the concentrations of the pollutants of interest, as well as meteorological conditions, were collected for the time period 1 March 2013 to 28 February 2017. The corresponding values were obtained from a free dataset by Chen Song [
25] and were obtained through the following link:
https://archive.ics.uci.edu/dataset/501/beijing+multi+site+air+quality+data, accessed on 20 December 2023. The following data were given:
Concentration of particulate matter with an aerodynamic diameter of up to 2.5 micrometers, PM2.5.
Concentration of particulate matter with an aerodynamic diameter of up to 10 micrometers, PM10.
Concentration of sulfur dioxide, SO2.
Concentration of nitrogen dioxide, NO2.
Concentration of carbon monoxide, CO.
Ozone concentration, O3.
Ambient air temperature, TEMP.
Dew temperature, DEWP.
Atmospheric pressure, PRES.
Precipitation, RAIN.
Wind speed, WSPM.
Wind direction, WD.
The above values were available for nine (9) total locations of Beijing, including the following:
Location 1: Aotizhongxin;
Location 2: Changping;
Location 3: Dongsi;
Location 4: Guanyuan;
Location 5: Gucheng;
Location 6: Nongzhanguan;
Location 7: Tiantan;
Location 8: Wanliu;
Location 9: Wanshouxigong.
The above locations are also shown in
Figure 3 (locations with an asterisk). Then, it was decided that the distance chosen between the examined locations should be no more than 20 km. This decision was based on the assumption that within the distance of 20 km, the influence of the examined locations to each other would be significant, and the selected locations would be closer to the city’s center. Before any data processing, it was necessary to check the available datasets for non-available/missing values.
All preliminary screenings were conducted using Microsoft Excel 2021. For example, in
Table 1, the data completeness for location 3 is presented. In total, in all nine (9) areas, more than 7000 missing values were found out of a total of 420,780 items in each area. The standard methodology dictates that, when checking the data for missing values, the researcher is advised to delete the entire row in which they are located. However, such a treatment is only possible when there is a large number of available data and/or a small number of missing values. In cases such as the specific dataset, which is not of sufficient size, if the specific methodology is applied, it can lead to a data reduction of 20%. Therefore, it was decided not to follow the specific directive and to make an attempt to cover the missing data through a combination of various methods, which are analyzed below.
2.1.1. Method 1: Using Existing Values and Microsoft Excel Commands
In order to apply this method, a basic assumption is made that consecutive values within 5 h do not differ greatly from each other. Having made this assumption, what must be conducted is quite simple. Initially, a check is made for the existence of values within the previous 5 h and within the next 4 h. From this check, four (4) individual scenarios emerge as follows:
The existence of both values within the time limits given. In this case, the missing value is the average of the two existing values.
The existence of only the previous value within the time limits. In this case, the missing value is equal to the previous value plus/minus a certain number, which is listed in
Table 2 below.
The existence of only the next value within the time limits. In this case, the missing value is equal to the next value plus/minus a specific number, which is listed in
Table 2 below.
The simultaneous absence of two values within the time limits. In this case, the element receives the value “NA” so that it can be processed later.
Although the above method is quite effective, it is not able to cover 100% of the missing values; however, it presents an average efficiency of more than 60%. The number of missing values covered are presented in detail in
Table 3.
2.1.2. Method 2: Development of a Code within the MATLAB Programming Environment
This method also requires an assumption that the concentrations of pollutants and the values of meteorological factors do not change significantly in an area of about ten (10) kilometers or less. Initially, the distance between the nine (9) study areas is calculated, and for each one, the closest areas are selected. Let us take location 3 (Dongsi) as an example. If the distances of the other locations are calculated in relation to location 3, it is found that locations 6, 4, 8, and 1 (in order of proximity) are within a range of 10 km or less. It is logical that the values of the areas closest to location 3 have a greater influence on the values of interest. In the next step, the relative position of locations 1,4,6, and 8 with respect to location 3 (north, east, north-east, etc.) needs to be determined. Finally, the appropriate code is written within the MATLAB environment, version R2022a, and based on this, the wind direction is taken into account, and the appropriate location is selected. In each case, the missing value in location 3 becomes exactly the same value as the corresponding value of the selected area. This method, though much more complicated and detailed than the first one, does not cover a large number of missing values, with a success rate of less than 20%. The remaining values are once again set to “NA” so as to be processed at a later time.
2.1.3. Method 3: Applying the Linear Regression Methodology
The ideology of this method is directly related to the basic idea behind Method 2. The same as before, the basic assumption here is that the meteorological and atmospheric conditions prevailing in nearby areas describe with relative accuracy the conditions in the study area. First, the areas are categorized in order of proximity. Then, their graphical representation is carried out and is followed by the extraction of graphs, which are structured so that the linear correlation between the values is evident. Finally, these linear equations are extracted and used in order of proximity with the missing values of each region. This method offers 100% coverage of the remaining “NAs”.
Figure 4 shows indicatively the graphs concerning the linear correlation between PM
2.5 concentrations for location 3 in relation to locations 1, 4, 6, and 8, respectively.
2.2. Data Preparation
Before the training process of the ANNs can begin, the appropriate processing of the input data is necessary. Initially, it was necessary to check the correlation of the data variables in order to reveal any existing relationships among them. Therefore, the correlation coefficient between the variables was calculated for each region, and in the end, the average was taken into account, as presented in
Table 3. However, despite the attempt to extract these “hidden” relationships through the correlation coefficient,
Table 3 does not offer much insight. As an alternative method for determining optimal combinations, the Principal Components Analysis (PCA) technique was used, in combination with the k-means clustering method, which showed that 40% of the information can be described solely with the set of temperatures. In practice, however, such a thing is not valid. Although based on methodology we should be satisfied only with the results of the PCA, in the present paper, further scenarios were analyzed, which proved to be better than the proposed one. After all, Fabiana Franceschi et al. [
20] found a positive correlation between PM
10 concentration and wind direction, while a negative correlation was observed between the concentration of the same pollutant in terms of temperature and wind speed. These findings are further reinforced by Zhang et al. [
26] in earlier research, in Beijing City, China. Accordingly, for PM
2.5, a positive correlation was observed between them and relative humidity [
20], which is also verified by this specific research work.
2.3. Creation of Scenarios
The scenarios that were created were the same for each location; however, each of them was studied separately. The goal was to train ANN forecasting models for each location and merge them into a universal algorithm. Each location “includes” six (6) pollutants (PM
2.5, PM
10, SO
2, NO
2, CO, and O
3), where each pollutant consists of eight (8) scenarios, and each scenario includes ten (10) different ANN models. The total amount of developed ANNs comprised 4320 forecasting models. The training data included a different combination of scenarios of meteorological variables for the hourly values during the three previous days and predictions for the next 24 hourly concentrations. More specifically, in this work, the developed ANN models were able to forecast the hourly air pollutant concentration for the next 24 h, based on the hourly values from the previous three (3) days. To make this more understandable, let us assume that today is Sunday. At any hour on Sunday, the developed ANN models are able to give the concentration of each pollutant for the next 24 h (the hourly forecasting step) until Monday (the next day’s forecasting horizon). In any case, the hourly values of the necessary parameters (see the scenarios in
Table 4) of the three previous days, which are Thursday, Friday, and Saturday, are taken into account. The specific structure gives the developed ANN forecasting models an operational interest since the forecast can be made at any time during the day for the next day, giving an advantage in making correct, valid, and timely decisions by the competent agencies.
For a better study and analysis of the data, it was considered appropriate to add one more variable, relative humidity (RH), as it combines dew point temperature (DEWP) and dry bulb temperature (TEMP) data. The developed scenarios are described in
Table 4. RH was calculated using Equation (1) [
27]:
As mentioned, for each scenario, 10 different ANNs were trained, which were then evaluated, and the best one was selected. For reasons of repeatability in the experiment, the architecture of the developed ANNs is listed in
Table 5. The training functions were chosen so that a secondary evaluation on the ways in which the training process could be optimized was executed. Specifically, the software’s default training function (Levenburg–Marquardt) was chosen as the function to create the first ANN, while its individual parameters were adjusted to require “medium” computational power. The next four (4) ANNs were chosen to be trained with the Bayesian Regularization training algorithm, which is suitable for training ANNs with the aim of pattern recognition. ANNs number 6 to 10 are examples of other train functions, which were chosen in an effort to find possible functions with better performance. Among the ten (10) developed ANN models, the first five proved to be the most “demanding” as they needed a lot of computational power. However, as represent in the next section, they consistently offered the most valid results. On the contrary, the second half of
Table 5, although less “demanding”, provided very unstable results, with close to zero utilization capability. However, it is worth mentioning that none of the last five (5) training functions were suitable for training ANN models of this kind. The topologies/architectures of the developed ANN models were created in such a way that they could be evaluated by the same standards while allowing the training process to be commenced by a conventional desktop computer.
Furthermore, the initial dataset was split (randomly) into three subsets. The first was the training subset, containing 70% of the total data volume; the second was the cross-validation subset; and the third was the testing subset containing 15% of the total data volume, respectively. Finally, for all of the ten developed ANNs in
Table 5, the number of training epochs was equal to 200.
2.4. Software and Infrastructure
The following software were used for the preparation of this work:
MATLAB R2022a;
Microsoft Excel 2021.
The training of ANNs was carried out on a home desktop computer with the following specifications:
CPU: AMD Ryzen 7 5700G (Advanced Micro Devices, Inc., Santa Clara, CA, USA);
RAM: G.Skill Ripjaws V 16GB DDR4-3200MHz (G.SKILL International Enterprise, Taipei, Taiwan);
GPU: N/A;
SSD Kingston NV1 500GB M.2 NVMe (SNVS/500G) (Kingston Technology Corporation, Fountain Valley, CA, USA).
In order to be able to evaluate the above neural networks, it was deemed necessary to calculate eight (8) statistical indices [
28]. These indices and their equations are listed below: the equations (Equation (2)) to (Equation (5)) describe the statistical evaluation indices for the performance of models in order to predict the next 24 hourly concentrations, whereas equations (Equation (6)) to (Equation (9)) represent the evaluation indices that determine the predictive ability of each developed ANN model regarding a certain threshold. More specifically, if the developed model is able to predict the exceedances, in other words, the cases where the pollutant concentration was over a specific threshold value or not, these threshold concentrations can be determined by the reference values for each pollutant according to the WHO [
29].
where (O
i) and (P
i) represent the observed and the predicted values, respectively, (O
mean) is the mean of the observed values, and (n) is the number of observations in each case.
3. Results and Discussion
Table 6 depicts the values of the statistical evaluation indices (MAE, RMSE, R, and IA) for the best-developed ANN model for each one of the six air pollutants among the nine examined locations and for all of the eight examined scenarios. According to
Table 6, it seems that for particulate matters and NO
2, ANN#5 has the best predictive performance. For SO
2, O
3, and CO, ANN#4 presents the best predictive ability. In all cases, the most suitable scenarios are S2 and S1 (see
Table 4). Location 8 seems to have the best forecasting ability, especially for air pollutants NO
2, SO
2, O
3, and CO. This may lead to the conclusion that sites with similar zoning to location 8 could provide more reliable data, thus assisting with the better training of ANN forecasting models can be the basis for air pollution forecasting in other locations. Concerning the general forecasting ability, we can say that the correlation coefficient (R) lies between 0.911 and 0.954, as well as the index of agreement (IA), which lies between 95.31% and 97.64%. Both of these indicate a very good forecasting ability for the next 24 h with an hourly forecasting step.
Table 7 shows the values of the exceedance’s statistical evaluation indices (TPR, FPR, FAR, and SI) for the best-developed ANN model and for each one of the six air pollutants among the nine examined locations and for all of the eight examined scenarios.
Concerning the forecasting of exceedances, in other words, this refers to the cases where the concentration of the examined pollutants is greater than a given threshold value based on the WHO directives and whether forecasted correctly or not, it seems that in all cases, model ANN#5 gives the best prediction. Also, location 8 seems could be considered as a reference location in future works. Furthermore, S1 and S2 were found to be the best training scenarios among the eight examined scenarios. Finally, TPR gives the rate of exceedances that are observed and correctly forecasted, lying between 88.16% and 97.75%, while SI shows the overall ability of the exceedances to forecast correctly, which lies between 93.113% and 99.86%. Both indicate that the developed ANN models are able to give very good and sufficient forecasting of the exceedances.
In the effort to be derived the general behavior of the developed ANs models in terms of their forecasting ability, appropriate Box and Whisker graphs were created. Necessary data for the design of these graphs were the maximum, average, and minimum values of IA (
Figure 5) and SI (
Figure 6), respectively, concerning all the developed ANNs for each air pollutant and for each one of the examined nine locations within the greater area of Beijing, China. More concretely, data were composed of the best performance values for each pollutant and for the optimal training function; therefore, 72 total values were used (9 locations × 8 input data training scenarios).
Figure 5 shows that the mean values of IA for all of the developed ANN models and all of the forecasted air pollutants lies between 0.92 and 0.95, indicating an extremely good forecasting performance. In addition, it seems that the developed ANN models are able to forecast next-day 24 h concentrations of SO
2, PM
2.5, CO, and O
3 in a more sufficient manner than NO
2 and PM
10.
In the same fashion as
Figure 5, but concerning the ability of the developed ANNs to forecast air pollutant concentration exceedances, in
Figure 6, the mean values of SI for all of the developed ANN models and all the examined air pollutants lie between 0.88 and 0.95, indicating an extremely good forecasting performance. In addition, it seems that the developed ANN models are able to forecast the exceedances of next-day SO
2, PM
2.5, CO, and O
3 at a more sufficient level than NO
2 and much more than PM
10.
4. Conclusions
The aim of this work was to create a universal forecasting model of air pollutant concentrations, specifically PM2.5 and PM10, as well as SO2, NO2, CO, and O3. A basic condition for the training of these models is the appropriate pre-processing of input data in order to extract any hidden relationships, which can facilitate the subsequent creation of scenarios. The gap-filling process was an innovation that was not found in any of the aforementioned literature and offered satisfactory results. The algorithms created for this process are also universal models, and their application is feasible to any other dataset, with minor adjustments. Additionally, the comparison of eight possible input scenarios and the training of each with ten different training functions (ANN model architectures) was conducted. The results offered valuable insight regarding the optimization of the training of ANN models with a constant variable pollutant of interest. The results were acceptable for the majority of pollutants, with average prediction values well above 80.0% accuracy. Through further statistical analysis, it was shown that the use of all of the available meteorological variables enhanced the training performance of ANNs and did not “confuse” them. It was also observed that the BR4 (ANN#5) structure was the most suitable. It seems that the exported model can be used by both public and private bodies as obtaining immediate information from the exported model can assist with protecting themselves from harmful environmental conditions.
In conclusion, the developed ANN forecasting models show an operational interest since the forecast can be made at any time during the day or for the next day, giving an advantage to competent agencies in making correct, valid, and timely decisions.
Finally, it is suggested that further research is required in order to improve the prediction of air pollution so that the developed models have an optimal design and performance for public and private authorities’ decision making, aiming for the protection of public health and also the avoidance of adverse health effects, as well as adverse effects on construction and infrastructure, taking into account the climatic crisis in addition.