1. Introduction
In recent years, with the development of urbanization and industrialization, air quality issues have been on the agenda [
1], and air pollution has been placed on an increasingly important position in policy formulation and implementation. Air pollution in cities is mainly caused by industrial emissions and transportation, which can produce pollutants such as NO
, O
, and SO
[
2]. According to the 2021 China Ecological Environment Status Bulletin, only 64.3% of China’s 339 cities at the prefecture level and above met the ambient air quality standards in 2021. In the Yangtze River Delta region, the average percentage of days with air quality exceeding standards was 13.3%, with O
and PM
as the primary pollutants accounting for 55.4% and 30.7% of the total exceedance days, respectively. The synergistic treatment of multiple pollutants has become the focus of air pollution prevention and control in China [
3].
Air pollution prediction refers to the extraction of information and characteristics from historical air pollution data to predict the future trend of air pollution [
4]. Multivariate time prediction means that, for the predicted time series, there may be very many external factors that affect the predicted target, such as the value of some pollutants related to the historical value of the target pollutant, as well as the closely related external factors including temperature, humidity, wind direction etc. [
5]. Many cities have established monitoring stations in various locations to detect ozone (O
), nitric oxide(NO), PM
, and other detection data. The main source of air pollution from industrial emissions, human activities, transportation, natural causes (wild fires), and other factors, such as weather factors, wind speed, temperature, and humidity, will also affect the settlement of pollutants and, thus, affect their monitoring values. One contaminant may also be a precursor to another, and
Figure 1 shows the correlation relationships of different pollutants. Thus, air pollution is affected by many complex factors, and these factors can interact with each other, thus making the prediction of air pollution a difficult problem. If it were possible to predict the place with high pollution probability one or two days in advance, more efficient actions could be taken to alleviate the potential regional pollution [
6].
Traditional air pollution prediction research are based on statistical methods, such as the Autoregressive Model (AR), the Moving Average Model (MA), and the Auto-Regression and Moving Average Model (ARMA), as well as the Autoregressive Integrated Moving Average (ARIMA) [
7]. Although these methods can model the time series well, they all need the time series to have large smoothness, which requires high requirements on the dataset. However, in air pollution monitoring, there will often be problems such as missing data due to sensor failure, so pre-processing of the data is usually required in the application [
8]. However, most statistical methods focus only on air pollution values in their predictions, without considering the changes in pollutant concentrations themselves by other factors, such as weather factors and the effects of other pollutants on them. With the development of artificial intelligence approaches and big data, many research projects have utilized machine learning and deep learning techniques for air pollution prediction. In traditional machine learning methods, Fan et al. [
9] used a heuristic algorithm combined with SVM to predict daily diffuse solar radiation in air-polluted regions. S. Gocheva-Ilieva et al. [
10] proposed a novel framework for stacked regression based on machine learning to predict the daily average concentrations of particulate matter (PM
), where four base models were built and evaluated. Johansson, C et al. [
11] applied different machine learning (ML) algorithms—including Random Forest (RF), Extreme Gradient Boosting (XGB), and Long Short-Term Memory (LSTM)—to improve deterministic predictions of PM
, NO
, and O
for 1, 2, and 3 days at different locations in Greater Stockholm, Sweden.
In the problem of air pollution prediction, the use of sample features in traditional machine learning methods mainly requires expert knowledge in air pollution, which is usually time-consuming and laborious. In particular, different regions have different environmental conditions and different characteristics of air pollution changes, and different structural characteristics of atmospheric flow due to topography and population density [
12] make it more difficult to extract relevant features. Moreover, air pollutants have very complex chemical reactions; for example, NO
are important precursor pollutants in O
formation and produce complex photochemical reactions, which makes it difficult to construct complex nonlinear feature mappings through traditional machine learning.
Deep learning has made many promising advances in the field of air pollution prediction and analysis. Shikhovtsev A. Yu et al. [
13] used a deep neural network based on GMDH to estimate and predict the characteristics of turbulence intensity in the stratosphere. M. Catalano et al. [
14] used Autoregressive Integrated Moving Average with Explanatory Variable(ARIMAX) and Artificial Neural Network (ANN) models to compare the result of urban transportation networks for air pollution peak predictions. The results showed that neural networks were predicting peaks at a superior rate to the ARIMAX model. However, the ANN does not reflect the temporal characteristics of air pollution variation well, where the air pollution variation is highly time-dependent and is closely related to the recent air pollution observation, as well as the previous period of variation. Recurrent Neural Networks (RNNs) can use the output of the previous moment as the input of the next moment to achieve feature extraction and the learning of time series. B.T. Ong et al. [
15] proposed a Deep Recurrent Neural Network (DRNN) with a novel pre-training method to predict PM
in Japan. However, RNNs suffer from gradient disappearance and gradient explosion when dealing with long time sequence problems. M. Krishan et al. [
16] used the Long Short-Term Memory (LSTM) approach to predict O
, PM
, NO
, and CO concentrations at a location in the NCT of Delhi. Li et al. [
17] proposed a hybrid CNN-LSTM model by combining a Convolutional Neural Network (CNN) with a Long Short-Term Memory (LSTM) neural network for forecasting the next 24 h of PM
concentration in Beijing. In recent years, Transformer has made incredible progress in sequential problem processing by using a multi-headed self-attentive mechanism to obtain point-in-time correlations. Chen et al. [
18] combined a CNN with Transformer to predict O
concentrations and achieved good results in both short- and long-term predictions. Wenfeng Zheng et al. [
19,
20,
21,
22,
23,
24] used various deep learning methods to achieve better results on both haze time scale and spatio-temporal prediction.
Several contributions have been made in combining genetic algorithms with neural network models to explore hyperparameters; for example, Rana Muhammad Adnan et al. used ALO to optimize the number of hidden layer neurons and the learning rate of an LSTM [
25]. The ANFIS-GBO model used two operators to optimize the learning parameters to improve the prediction accuracy of the ANFIS [
26]. In addition, the PSOGWO and PSOGSA algorithms have also been used to optimize the control parameters of the ELM model [
27]. The SVR-SAMOA model integrates the Simulated Annealing (SA) algorithm with the Mayor Optimization Algorithm (MOA) to determine the optimal hyperparameters for Vector Regression (SVR) [
28], and the ANN-EMPA combines mutation and crossover operators with the ANN to produce robust hybrid prediction model [
29]. The CNN-INFO is highly efficient in optimizing complex phenomena with unacknowledged search areas [
30].
However, all these machine-learning-based methods are hard to interpret and require manual feature engineering based on a priori knowledge, which is prone to prediction errors. For the models built by deep learning training, although the prediction accuracy is high, their trained weights are of little use to us, because they have little physical meaning for real-world problems. For the air pollution prediction problem, the external variables have a very strong correlation with the prediction target, as shown in
Figure 1, where PM
, NO
, and O
are significantly negatively correlated, and, as the PM2.5 concentration increases, the chemical reaction is suppressed, thus reducing the rate of O
production. In addition, the non-homogeneous chemical reactions occurring on the surface of the particles due to the increase of PM
concentration also affect the concentration of O
, while NO
is a precursor of O
and undergoes photochemical reactions to produce ozone. These correlations can help the authorities in predicting atmospheric pollution while helping them to develop effective policies to mitigate pollution, but this information is difficult to obtain in deep learning, and a model is needed that can explore the impact of current external factors on atmospheric pollution while also predicting them.
All these challenges inspire us to rethink the air pollution prediction problems based on deep learning models with model interpretability. Specifically, a Hybrid Autoformer Network with a Genetic Algorithm Model (GA-Autoformer) was proposed to predict air pollution temporal variation, as well as explore the relationship between external variables and target pollution. The main contributions of the proposed method are summarized as follows:
- (1)
A Hybrid Autoformer Network with a Genetic Algorithm Model was proposed to predict the air pollution variation, where the genetic algorithm was used to optimize the external variable weighting problem for different variables that have different effects on the target pollution.
- (2)
The Elite Variable Operator was proposed to vote at fixed intervals of generations to find out the variables that have a greater impact on the target prediction to be selected as elite variables, which are explored to perform a more refined search.
- (3)
The proposed Archive Storage Operator, using genetic algorithms, led to deviations in the final results due to the influence of the initialization of individual models, where the individuals with better value may be less effective due to initialization and vice versa. The archive mechanism was used to store the individuals with good results and to filter them to get the really good ones.
- (4)
We conducted comprehensive experiments on the Ma’anshan air pollution dataset to verify the proposed model, where the prediction accuracy was greatly improved, and the selection of model influencing factors was more interpretable.
The rest parts of this paper is arranged as follows. We describe our study area and data set in detail in
Section 2.
Section 3 and
Section 4 review the related work and detail our model. The experiments and result analysis are presented in
Section 5. Finally, we discuss and conclude the paper in
Section 6 and
Section 7.
4. Methodology
In this section, we will give a detailed description of the Hybrid Autoformer Network with a Genetic Algorithm model (GA-autoformer). A genetic algorithm was used to explore the influence of external factors on the prediction target, determine the degree of influence of each factor on the target, and then put the approximate optimal results into the neural network model. The back-bone neural network model adopted in this paper is Autoformer, which can obtain the best particles and the best prediction accuracy through
n iterations. The overall structure of the model is as shown in
Figure 5. We will explain the process of the whole structure in detail in
Section 4.1,
Section 4.2,
Section 4.3,
Section 4.4,
Section 4.5,
Section 4.6 and
Section 4.7. The pseudo code for the whole algorithm is shown in Algorithm 1.
Algorithm 1 Frame process of the GA-autoformer model |
Require: D: Multivariate time series dataset on air pollution used in this iteration, |
t: Number of iterations, |
p: Population number, |
k: Interval iteration of executive operator, |
M: Neural network model |
Ensure: : Best weight |
Randomly generate W = |
_ = ∅ |
_ = ∅ |
= ∅ |
for to t do |
|
|
W = tournament_selection(W,) |
|
if i%k==0&&i!=0 then |
← elite_voting(,) |
← archive_storage(_, _,,,) |
end if |
← crossover(,,) |
← mutation(,,) |
W← generate_population(,) |
end for |
← select_best() |
return |
4.1. Generate Random Individuals
Firstly,
n individuals are randomly generated as the population, and each individual represents a candidate solution, namely, the weight of external factors.
where
L is the number of external variables
, and
p is the number of populations.
4.2. Calculation of Fitness Values
The weight value of each individual is cross multiplied with the multivariate variable value in the current air pollution data set D, and the obtained result is input into the neural network as a new data set to obtain the corresponding prediction accuracy, so the prediction accuracy can be used as the fitness value of the current individual.
4.3. Selection
In order not to lose information, we put the best particle into the new population. Then, we used the Tournament Selection Algorithm to randomly select n individuals from the rest of the population and let these n individuals compete. Then we put the best individuals into the new population. We keep iterating until the population of the new population meets the requirements. This new population is the produced offspring and has a higher fitness value than before. Furthermore, n is usually set as 2.
4.4. Elite Variable Voting Operator/Archive Storage Operator
Elite Variable Voting and Archive Storage are performed every
k generations. The Elite Variable Voting Operator can find elite variables from excellent individuals and search more finely in the subsequent mutation and crossover. The Archive Storage operator can reduce the instability caused by the randomness of neural model initialization by sending the weight into the neural network model. The specific implementation details are expanded in
Section 4.8 and
Section 4.9.
4.5. Crossover
Although the average fitness value is improved in the process of selection, it cannot produce new individuals. Crossover mimics the method of biological clock hybridization to produce new varieties, transposes some parts of chromosomes, and uses the random pairing method to determine the parents of individuals.
We adopted the following strategy on the crossover of elite variables: elite variables tend to have high weight values, and if they are crossed with some non-elite, it may lead to the loss of information preserved by elite variables, thus leading to population non-convergence, but moderate elite crossover with non-elite variables may also lead to an increase in population diversity, for which we proposed that elite variables be crossed with non-elite variables in the early stage and only with elite variables in the later stage, thus maintaining population convergence. We treated the first 50% of the iteration as the early stage, where elite variables can cross with non-elite variables, and, in the later stage, elite variables could only cross with elite variables. We adopted shuffle crossover as our crossover method [
41]. The flow chart of the crossover is shown in
Figure 6.
4.6. Mutation
Crossover and selection can ensure that excellent genes are left in each evolution, but this may lead to the local optimization of the whole population. When we cross generate a new chromosome, we can randomly select several genes on the chromosome, and randomly modify the value of genes [
42].
Furthermore, we performed a Gaussian variation on
, wherein the Gaussian distribution random perturbation term
was added to the original state
. Equation (
3) is the general form of the Gaussian distribution, where
is the standard normal distribution,
is 0, and
is 1. In order to reflect the difference between elite and non-elite variables, different weighting coefficients
were used as in Equation (
4). When non-elite variables are selected for variation,
= 1. Furthermore, when elite variables are selected for variation,
= 1.2. We can find that the elite variables are more important to find the global optimum. It can make the population jump out of the local optimum and improve the convergence speed.
4.7. Iteration
Through n iterations of 1–6, the highest fitness value of the individual in the archive in the last generation was calculated, and the individual was taken as the optimal weight as the return value.
4.8. Elite Voting Operator
In the time series prediction of air pollution, certain variables will have more influence on time series forecasting; we call them elite variables. The elite variables vary in different problems. This operator can automatically find the elite variables and optimize them more finely. The pseudo code for the whole algorithm is in Algorithm 2. The following is the step of the Elite Variable Voting Operator.
Elite variable voting was conducted every k generations.
- (1)
Candidates are selected from the top 30% of the population based on population fitness, and these individuals are the best candidates in the population; they represent the evolutionary direction of the population.
- (2)
Among the candidates, we want to have some particles that can lead the candidates to the optimization direction more effectively; we call these particles “chairman”. We appoint two particles with highest fitness value among the candidates as “chairman”. In order to maintain diversity, the candidates that differ most from the “chairman” are added in “chairman”. We use the Euclidean distance as Equation (
5) to measure the distance between two particles.
- (3)
The elite variables chosen by vote are able to give special treatment in the process of mutation and crossover: there is a higher probability of becoming larger in the process of mutation and crossover.
Algorithm 2 Elite Voting Operator |
Require: W: Population, |
: Population fitness value |
Ensure: : Elite Variables |
←∅ |
←∅ |
← Take the top 2 fitness values in as |
←+Take the particle in W that is most different from the particle in ←W- |
← voting(,) |
return |
4.9. Archive Storage Operator
During the process of neural network training, the result can be bad due to different initialization weights, which can inter the judgment about the effect of weights of different external variables. Therefore, we introduced an archive storage mechanism to put potentially optimal solutions into the archive. The pseudocode for the whole algorithm is in Algorithm 3, and the process is as follows:
Algorithm 3 Archive Storage Operator |
Require: _, |
_, |
W: population, |
: Population fitness value, |
: archive fitness value |
Ensure: W: population |
d← size() |
= /size() |
for to d do |
if > then |
W← rejoing_population(, W) |
else |
Discard(, _,_) |
end if |
end for |
return W |
- (1)
Every k generations, the best particle in all k generations is copied into the archive.
- (2)
In the next k iterations, individuals in archive will not be involved in the process of the genetic algorithm, but only in the calculation of the fitness.
- (3)
Every k generations, we perform an examination. If the fitness value of the individual is greater than the average fitness value of the current population, this individual will replace the individual with the lowest fitness in the population; otherwise, it will be discarded.
4.10. Prediction and Optimization
For the populations obtained in
Section 4.1,
Section 4.2,
Section 4.3,
Section 4.4,
Section 4.5,
Section 4.6 and
Section 4.7, each particle represents a candidate solution denoted as
, which represents the weight of each external factor. For the external factor input
, by multiplying each weight with its input counterpart,
.
is fed into the Transformer network along with
y.
Unlike traditional forecasting methods that decompose into seasonal parts and trend parts, we gradually decomposed trend and periodic parts from hidden variables in the learning process. This was based on the idea of sliding average, as shown in (
6), to achieve progressive decomposition.
After
decomposition into seasonal terms
and trend terms
, the similarity of the different seasonal terms is further aggregated at the encoder pair periodicity using an autocorrelation mechanism. The autocorrelation coefficients can be obtained using the fast Fourier transform, and, finally, the similar subsequence information is aggregated in (
7) and (
8), where the
. Here, the multi-headed form of query, key value, is still used so that the self-attentive mechanism can be replaced seamlessly. Furthermore, the most probable cycle length is
to avoid fusion of irrelevant or even opposite subsequences.
In the decoder, the trend and seasonal terms are predicted separately. For the seasonal term, the feature information obtained by using encoder is aggregated into predicted seasonal values. For the trend term, the information is gradually extracted from the predicted hidden variables using the cumulative method. Finally, the predicted values of the trend and periodicity terms are summed to obtain the predicted values.
By calculating the empirical loss between the trained predicted pollutant value and the real pollutant value , we train the entire model. Our loss function is Root Mean Square Error(RMSE); the loss is not only propagated back from the decoder’s outputs across the entire transformer model, it is also involved in the selection of the new generation of the population as the fitness value.
5. Experiments and Results
In this section, we give the parameter settings and experimental results, compare them with some current baselines, and attempt to translate the results into interpretable conclusions. We also try to prove and explain the role of each operator through a series of ablation experiments.
5.1. Dataset Descriptions
The details of the datasets are shown in
Table 1. We cut the dataset into a training set and test set by 70% and 30%, respectively, in chronological order. In particular, when we made a prediction for one of the targets, the other two targets were entered into the model as external variables. To measure the importance of external factors, we normalized the dataset.
The units of CO (carbon monoxide) is mg/m
, the units of O
, NO
,PM
and PM
are
g/m
, and TSP is total suspended particulate matter (mg/L). The unit of wind speed is m/s, which is the average wind speed in 10 min. Furthermore, wind direction is measured by an anemometer, which is projected to the [0, 360°] interval. Precipitation (mm) refers to the amount of precipitation per hour, visibility (m) is the 10-min average visibility, humidity (%) is the relative humidity, pressure (hPa) is the atmosphere pressure measured by the monitoring point, and temperature (°C) is measured in degrees Celsius. In addition, we performed statistical analysis of the data in
Table 2 to avoid the presence of extreme data.
The available time period was from 1 January 2020 to 6 October 2020, and
Figure 7 shows the distribution of the amount of data by seasons of the available time period.
5.2. Parameter Setting
The length of the input sequence in the autoformer was 96, the length of the predicted sequence was 24, the number of headers was 8, the value of dropout was 0.05, the batch_size was 32, and the learning rate was 0.0001. In the genetic algorithm, the size of population was 20, the number of iterations was 50, the probability of crossover was 0.8, the probability of variation was 0.1, and the size of the archive was 5. We set k to be 10.
5.3. Evaluation Metrics
We chose the Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) as the criteria for evaluating the prediction performance.
where
n is the length of the time series prediction,
is the target value of the model prediction, and
is the actual target value.
5.4. Baselines
To verify the performance of our proposed model, we compared GA-autoformer with the following baseline models.
- (1)
RNN: RNN is a classical time series prediction model that is capable of extracting time series features. Unlike feed-forward neural networks, which use the output of the previous neuron as input to the next neuron, RNN involves a structure that gives the network the ability to remember information about trends and cycles [
15].
- (2)
LSTM: LSTM belongs to the class of RNNs and also belongs to the recurrent network model. LSTM solves the problem that RNNs cannot extract long-term time dependence and uses multiple gate mechanisms to alleviate the gradient explosion and gradient disappearance problems that exist in RNNs [
16].
- (3)
EA-LSTM: EA-LSTM is based on the attention LSTM and uses the genetic-algorithm-based competitive random search (CRS) instead of gradient-based approach to explore the attention layer weights; thus, it better assigns the weights of features within the time window [
43].
- (4)
Transformer: Recently, the Transformer model has made a big breakthrough in time series prediction. Unlike the RNN and LSTM, Transformer is not a cyclic sequence model. Its prediction efficiency and its ability to predict long-term time series are greatly improved [
32].
- (5)
Informer: The authors designed an efficient transformer-based long-time prediction model, named Informer, by proposing a ProbSparse self-attentive mechanism, which utilizes self-attentive distillation to highlight dominant attention by halving the cascade layer input with a generative decoder for the one-time prediction of long-time sequence sequences. A new solution to the long-time sequence prediction problem is provided [
34].
- (6)
Autoformer: The authors used a deep decomposition architecture. The authors designed sequence decomposition units to embed deep models, implement progressively predictive, auto-correlation mechanisms, discard point-wise connected self-attention mechanisms, and implement series-wise connected autocorrelation mechanisms to break information utilization bottlenecks [
35].
5.5. Analysis of Prediction Result
We trained the GA-autoformer for 50 iterations.
Table 3 shows the prediction accuracy comparison with different baselines, and the prediction line graphs are put in
Figure 8. From
Table 3, we can see that our model had a higher prediction accuracy compared to other baseline models, and in comparison with several baseline models, 23 out of 27 comparisons achieved the first, and the remaining four were in the second. For the LSTM and RNN, which all belong to recurrent network structure, they all had a big gap with our model. The EA-LSTM uses the genetic algorithm to optimize the attention layer, combined with the LSTM, but there was still a gap relative to Transformer. Most of the advantages over the Transformer and its other variants were also achieved, which shows that external variables do affect the time series prediction, and this model successfully found the approximate optimal solution of external variables by using a genetic algorithm to jump out of the local optimum. Furthermore, it can be seen from the
Figure 9 that the Archive Storage Operator can effectively reduce the impact of the initialization of the neural network on the prediction model.
Furthermore, we analyzed the effects of
, archive size, and k on the experiment. Here, we used the dataset Location A, and the prediction target was O
. From
Figure 10, we can see that the best results were obtained when lambda was set to 1.2, archive size was set to 5, and k was set to 10.
And the results of the training process are visualized in
Figure 11. It can be clearly seen in
Figure 11 that, as the number of iterations increased, the population gradually converged and evolved in the right direction. Regarding the interpretability of the experimental results, we will elaborate in
Section 5.6.
5.6. Model Interpretability
Figure 12 shows the external factor optimization weights from the last generation of the population, as well as the elite variables that were taken at the last time in multiple experiments. Furthermore, the color is more red, which indicates that the factor was more important, while the yellow color indicates the less important factor to the target predicted pollution.
When selecting O
as the predicted target pollution, it can be seen that the optimization individuals found all had higher values on NO
, temperature, etc. The control of O
pollution mainly involves the control of its precursors, which are mainly nitrogen oxides and carbon monoxide. Nitrogen oxides react with surrounding atmospheric ozone and subsequently form nitric acid [
44]. The increase in temperature represents a flanking increase in solar radiation, which leads to an increase in ozone levels, but high temperatures also lead to increased vertical convective activity in the atmosphere, which facilitates local ozone and precursor diffusion dilution. High humidity facilitates O
pollution removal. On the other hand, water vapor in the atmosphere affects solar ultraviolet radiation and, thus, slows down photochemical reactions, and the humidity has a large negative correlation with O
. Furthermore, O
pollution was mainly negatively correlated with wind speed, mainly because wind speed enhances the horizontal diffusion of ozone and contributes to ozone dilution. Those factors’ weights were relatively large in the experiments and were all selected as elite variables several times, which is in accordance with our a prior studies on O
[
45].
In terms of meteorological factors, PM
was positively correlated with air temperature and relative humidity and negatively correlated with wind speed. When the wind speed is low and the humidity is high, the intensity of inversion temperature increases, which is unfavorable to the diffusion of PM
and other pollutants in the vertical and horizontal directions and aggravates the accumulation of particulate matter pollution, thus making its mass concentration remain high, and when the temperature and relative humidity are both at high levels in autumn and winter, fog is easily produced. The suspended fog droplets easily adsorb and capture gaseous pollutants and particulate matter pollutants, which is favorable to the formation of secondary particles. The hourly concentration of NO
had a good positive correlation with the hourly concentration of PM
, thus indicating that the contribution of traffic pollution emissions to PM
was larger. Traffic exhaust emissions are transformed into secondary particles after a period of chemical reaction, which affects the concentration level of PM
. The results can also be clearly seen in the heat map [
46].
When selecting the AQI as the predicted target, it is easy to know that there is a large relationship with SO
, O
, PM
, and PM
. As an indicator to measure air pollution, the AQI is closely related to the content of each pollutant. Not only that, wind speed, temperature, and humidity can usually affect the diffusion rate of atmospheric pollutants, and they had strong correlations with atmospheric pollutants, thus leading to strong correlations between the AQI and these factors. It can be seen that various variables were selected as elite variables several times [
47].
In addition, the weights derived from the data sets at different locations differed when predicting the same target. In the industrial area (Location C), nitrogen oxide emissions were much larger than in the residential area (Location A) and had a greater weighting compared to the residential area where nitrogen oxide had a greater impact on pollutants. The main sources of pollutants in residential areas are domestic cookers and winter heating, which mostly consume coal and produce carbon monoxide and sulfide, and their corresponding weights were also higher.
From the above analysis, it can be seen that, in the prediction of different targets, our models successfully identified the relationship between external variables and the predicted target pollution, which is consistent with the knowledge of the relevant research, thus proving that the evolutionary direction of the final population of the genetic algorithm is correct. Through the exploration of different external variables, we can analyze and identify the sources of pollutants and help the government develop effective pollution mitigation policies.
5.7. Ablation Experiment
To verify the effects of different genetic algorithm operators, we compared the base transformer with three variant models combined with genetic algorithm optimization, including autoformer (base model), GA-autoformer (our proposed model), autoformer using only the unchanged genetic algorithm (denoted as autoformer-GA(u)), the model using only the Elite Variable Operator(autoformer-GA(elite)) and the Archive Storage Operator (autoformer-GA(archive)), respectively, which are presented in
Table 4.
From
Table 4, we can find that the Transformer used only the traditional genetic algorithm (autoformer-GA(u)), which had some improvement comparing with the base Transformer. However, the improvement was not significant, and the effects were not as good as our proposed model(autoformer-GA). Furthermore, the two models using only one operator (autoformer-GA(elite) and autoformer-GA(archive)) were not as effective as the proposed model.
For the Elite Variable Operator, as can be seen in the heat map
Figure 12, the variables selected as elite variables for many times were larger than the other variables, thus indicating that elite variable operators can find variables that have a greater impact on prediction accuracy and give special treatment to make the search more refined when crossover and mutation occur.
For the Archive Storage Operator, it can be seen in
Figure 13 that the model using the Archive Storage Operator not only improved the overall prediction, but also the variance was greatly reduced. This is because the fluctuations caused by the randomized weights of the neural network model could potentially affect the accuracy of the air pollution prediction. Therefore, we stored the potentially good particles and evaluated them several times to filter out the individuals which had better fitness and put them back into the population.
6. Discussion
For the air pollution time series prediction problem, Johansson, C et al. [
11] used various machine learning methods (e.g., Random Forest (RF), Extreme Gradient Boosting (XGB), and Long Short-Term Memory (LSTM)) for multiple pollutants (PM
, NO
, and O
) at multiple locations with multi-temporal predictions. It can be seen that, in the time-series prediction problem, different pollutants and different locations had different environmental conditions and air pollution change characteristics, which require certain a priori knowledge. Prediction models using metaheuristics combined with neural networks are also evolving, such as ANN-EMPA [
29], which combines crossover and mutation operators with ANN to enhance the prediction models. However, most of these models generally use metaheuristics to optimize the hyperparameters and the structure of the neural network model, but the results obtained from the optimization are not interpretable and do not explain why the optimized hyperparameters can better improve the pollution prediction model.
To address the above problem, we proposed a Hybrid Autoformer Network with a Genetic Algorithm model to predict air pollution temporal variation, as well as explore the relationship between external variables and target pollution. Unlike the above model, our model combines genetic algorithm with autoformer, wherein autoformer has the ability of long-time series prediction, and the genetic algorithm can explore the influence of external variables on the predicted target, which makes our model interpretable.
From
Table 3 and
Figure 8, we can find that the prediction accuracy of the GA-autoformer was higher than other baseline models. As shown in
Figure 9, the standard deviation of the GA-autoformer was also lower than the rest of the models, which indicates that our Archive Storage Operator was able to preserve excellent particles in the iterations. Meanwhile,
Figure 12 shows the effects exhibited by different external variables for different prediction targets in different prediction locations, such as industrial areas, residential areas, and suburban areas, which demonstrates the robustness and interpretability of our proposed model.