1. Introduction
In recent decades, air pollution has been a serious environmental issue, and several developed and developing countries have suffered from heavy air pollution [
1]. The identification of atypical pollution in the quantified concentrations of these compounds has been a significant problem for health [
2]. Compared to other pollution, air pollution has a direct impact on people’s health, and the major causes of air pollution are natural disasters, residential heating, exhaust from industries and factories, and the burning of fossil fuels [
3,
4]. Therefore, predicting the mass concentrations of air pollution is essential and plays a crucial role in atmospheric management decisions [
5]. Additionally, existing epidemiological research studies state that PM
2.5 causes negative human health effects, like respiratory diseases and cardiovascular diseases [
6,
7]. Therefore, effective forecasting of air pollutant concentrations strengthen the prevention of air pollution, which helps in achieving efficient environmental management [
8]. In addition, it has great significance for government decision making and people’s health [
9]. Poor AQP not only affects the human physical condition but also produces a key impact on societal and economic controls [
10].
Recently, several research studies have been carried out on AQP, but the majority of the existing studies face difficulty in predicting future air quality for the monitoring stations [
11]. In this scenario, AQP is influenced by several factors, like dust, coal burning, industrial emissions, vehicle exhaust, spatial distribution, and time patterns [
12]. According to atomic science research, the major factors for the dissipation and accumulation of atmospheric pollutants are weather conditions, regional transport, and local emissions. These factors are categorized into indirect and direct factors based on their impact on air quality [
13]. Compared to traditional machine learning models, deep learning models have gained more attention among researchers, especially in time series analysis. Deep learning models are effective in exploring longer-term dependencies and implicit features from time series data for effective AQP. Yet, several deep learning models have problems like overfitting and the vanishing gradient problem in time series forecasting. Therefore, a new optimization-based regression model is proposed in this research in order to overcome the above-stated problems and to achieve better AQP.
The main contributions of our paper are given as follows:
Initially, this study implemented a Min-Max normalization technique that efficiently preserves the relationship between the data values with low standard deviation. In time series forecasting, the Min-Max normalization technique forecast the next hour’s concentration and reduces the effect of outliers by using different sizes of sliding windows.
Then, we performed a correlation analysis to select the optimal meteorological variables (wind speed, temperature, dew point, wind direction, and historical PM2.5) from the collected datasets.
After that, an RSO algorithm was developed for selecting discriminative features from the selected meteorological variables. This action greatly reduces the model’s complexity and computational time. The RSO algorithm is the integration of the BSO algorithm and reinforcement learning, which overcomes the optimization problems like poor convergence rate and local optima problems.
Finally, the Bi-GRU model was used for effective forecasting of air quality, and its efficacy was tested using the performance measures, including , , , symmetric mean absolute percentage error (SMAPE), MAPE, and coefficient of determination ().
The manuscript is organized as follows: The existing papers on the topic of AQP are reviewed in
Section 2. The methodology details, numerical investigation, and conclusion of this research are mentioned in
Section 3,
Section 4 and
Section 5, respectively.
2. Literature Review
For predicting the air quality in Tripoli [
14], Esager and Ünlü proposed an evaluation of deep learning models for hourly PM
2.5 surface mass concentrations. Since the analyzed data are a time series, the Box–Jenkins methodology is generally used to model such a dataset. This study gave particular attention to the LSTM and GRU with CNN types of recurrent neural networks. The result analysis demonstrates the strong forecasting power of the used algorithms. This type of model’s key benefit is that it does not call for the same exact assumptions that other traditional models do. These algorithms were also quite effective in simulating the data’s nonlinear behavior.
Du et al. [
15] implemented a hybrid deep learning architecture for effective air pollution forecasting. The implemented hybrid architecture, Bi-LSTM and a convolutional neural network (CNN), learns multivariate, temporal, and spatial correlation features from the collected time series data for effective forecasting of air quality. The experiments conducted on the two real-world datasets demonstrated that the implemented hybrid architecture was effective in dealing with PM
2.5 air pollution prediction with better accuracy. The integration of deep learning models increased the time complexity and computational cost because it required an enormous amount of data to obtain satisfactory results.
Usually, the dynamics of air pollution is reflected by dissimilar factors, like rainfall, snowfall, wind speed, wind direction, humidity, and temperature. These factors increase the difficulty in understanding the changes that occurred in the air pollutant concentration. Tao et al. [
16] integrated the CNN and Bi-GRU models for effective forecasting of air pollution. The experiments conducted on the UCI machine-learning repository Beijing PM
2.5 dataset demonstrated the effectiveness of the hybrid deep learning models, as they achieved better results than traditional models. As mentioned earlier, the integration of two deep learning models leads to high time complexity.
Ma et al. [
17] used a Bi-LSTM network with transfer learning for forecasting air pollution in Anhui, China. The numerical results showed that the Bi-LSTM network with transfer learning achieved a 35% lower error rate than the existing models on a real-time dataset. The developed Bi-LSTM network with transfer learning was not scalable and was time-consuming while performing experiments on a real-time dataset.
Chang et al. [
18] implemented a new aggregated LSTM network for effective air pollution forecasting. The aggregated LSTM network combines information about external pollution sources, stations nearby industrial areas, and the stations with local air pollution monitoring systems. Here, three LSTM models were aggregated in order to improve prediction accuracy, but it was a computationally complex process.
Castelli et al. [
19] employed a machine learning technique called support vector regression (SVR) for forecasting air quality index (AQI) and pollutant levels. After the acquisition of time series data, data preprocessing (data transformation, outlier removal, and imputation of missing data) and feature engineering were accomplished. Finally, the air pollution prediction was carried out by utilizing the SVR technique. However, the SVR will underperform when the number of feature vectors for every data point exceeds the number of training samples.
Xayasouk et al. [
20] integrated a deep autoencoder and an LSTM network for air pollution prediction. In addition to this, Wen et al. [
21] combined a CNN and an LSTM network for effective forecasting of air pollution in China. Wang et al. [
22] implemented a two-layer air pollution prediction model based on a GRU and an LSTM network. The numerical outcomes confirmed that the presented hybrid models obtained higher prediction performance than existing ones at different regional scales. The hybrid deep learning model has the ability to handle complex and large data, but it was computationally expensive.
Air pollution is becoming a serious problem due to the rapid growth of industrialization. In the present scenario, predicting air pollution is crucial in determining prevention measures for avoiding disasters. Zhang et al. [
23] utilized a light gradient boosting technique for selecting discriminative features from real-time datasets. Further, the selected 500 feature vectors were given to the eXtreme Gradient Boosting (XGBoost) technique for air pollution forecasting.
Wang et al. [
24] initially adopted the Hampel identifier and variational mode decomposition (VMD) technique for detecting and eliminating outliers from the acquired datasets. Then, the optimal feature vectors were selected from the denoised data by employing a sine-cosine algorithm, and finally, an extreme learning machine (ELM) was implemented for accurate forecasting of air pollution. Generally, standard machine learning techniques, such as XGBoost and ELM, exhibit outliers and overfitting problems when analyzing complex time series data.
The PM of the Turkish city Ankara was modeled using a hybrid deep learning methodology, which was analyzed by Akbal and Ünlü [
25]. According to the WHO’s criteria, PM levels were categorized to provide a prediction problem. Further, by using the ensemble machine learning methodology of random forest regression (RFR), extra tree regression (ETR), and multiple linear regression (MLR), the impact of various contaminants and meteorological variables on the prediction of PM has been examined. The findings indicated that other substances, the Earth’s surface temperature, wind speed, and PM’s own lagged values were the most crucial predictor variables for PM.
Li et al. [
26] employed the Hampel filter and least square support vector machine (SVM) regression for AQI forecasting. Maleki et al. [
27] implemented an artificial neural network (ANN) for air pollution forecasting. However, the ANN was a simpler deep learning mode and required more training data to obtain satisfactory results. Mao et al. [
28] implemented a temporal sliding LSTM network for effective prediction of air quality. The presented temporal sliding LSTM network achieved higher prediction results with strong atmospheric decision making.
Zhang et al. [
29] integrated empirical mode decomposition (EMD) and a Bi-LSTM network for effective forecasting of AQI. Firstly, the EMD technique was employed for decomposing PM
2.5 time series data and extracting the amplitude and frequency features. Secondly, the obtained features were given to the Bi-LSTM network for AQI forecasting. The experiments conducted on the PM
2.5 and Beijing hourly datasets demonstrated the efficacy of the developed EMD-Bi-LSTM model by means of error rate. In the time series analysis, the Bi-LSTM network was slower and consumed more time for model training.
Zeinalnezhad et al. [
30] integrated an adaptive neuro-fuzzy inference system (ANFIS) and semi-experimental nonlinear regression for predicting the concentration of important pollutants. However, the standard ANFIS models include a few problems, such as the curse of dimensionality, high computational expense, and loss of data interpretability.
Aarthi et al. [
31] initially used a Min-Max normalization technique for filling in the missing attributes in the collected dataset, and then, the optimal attributes were selected from the preprocessed data by implementing a balanced spider monkey optimization (BSMO) algorithm. Based on the balancing factor, the BSMO algorithm selects the relevant attributes, which are given to the Bi-LSTM network for AQP. The developed BSMO algorithm efficiently finds the optimal solution but has a poor convergence rate. To highlight the aforementioned concerns and to achieve precise AQP, an effective optimization-based regression model (RSO and Bi-GRU) is introduced in this paper.
Several models have been examined to improve air quality, which is essential for preventing or reducing the consequences of pollution. We will be prompted by the air quality to be careful, and it may even motivate individuals to carry out their daily activities in less polluted areas. However, it is still challenging to analyze the data and provide improved outcomes. Air pollution forecasting is among the fields in which deep learning technologies have a substantial impact and penetration rise. The authors use complex and advanced methods to accurately anticipate the air quality. External factors, such as weather, geographic features, and temporal characteristics, must be taken into account. For pollution reduction, human health monitoring, and sustainability, an accurate air quality prediction model is necessary. Due to overfitting in the prediction model and local optima trap in feature selection, the current air quality forecast models (state-of-the-art methods) are inefficient.
3. Methodology
Eight pollutants, namely particulate matter (PM) 10, PM2.5, ozone (O3), sulfur dioxide (SO2), nitrogen dioxide (NO2), carbon monoxide (CO), lead (Pb), and ammonia (NH3), act as major parameters in deriving the AQI of an area. While using the annual data, this research uses 24 lags in time series analysis. In this time series analysis, the proposed framework includes five phases:
- (1)
Dataset description—Beijing PM2.5 dataset and a real-time dataset;
- (2)
Data normalization—Min-Max normalization technique;
- (3)
Correlation analysis;
- (4)
Feature optimization—RSO algorithm;
- (5)
Prediction—Bi-GRU model.
The diagram of the developed regression model is shown in
Figure 1.
3.1. Dataset Description
The introduced optimization-based regression model’s (RSO and Bi-GRU) performance was validated on a Beijing PM
2.5 dataset and a newer real-time dataset. The Beijing PM
2.5 dataset comprised PM
2.5 meteorological data, which were recorded from 1 January 2010 to 31 December 2014 [
16]. Here, in this dataset, 70% of data are used for training, and the remaining 30% are used for testing. From this ratio (70:30), until 2 July 2013, the data has been trained. Then, the testing process started and lasted until 31 December 2014. This dataset has eight characteristics: wind speed, rainfall, wind direction, snowfall, dew point, PM
2.5 concentration, air pressure, and temperature. Among 43,800 rows, 30,000 rows were utilized as a training set, 8000 rows were utilized as a validation set, and the remaining 5800 rows were utilized as a testing set. In this dataset, the wind direction had four features (southwest, northwest, southeast, and northeast), which were encoded as float data (−10, 0, 10, and 20) [
32].
Additionally, a newer real-time dataset was acquired from the central pollution control board for four Indian cities: Cochin, Hyderabad, Chennai, and Bangalore. In this collected dataset (two times a week during a 24 h time period), the pollutants were monitored, and 104 observations were provided annually [
31].
3.2. Data Normalization
The acquired time series data were normalized by implementing a Min-Max normalization technique. This helps in removing the units in the acquired data or the impact of differing scales [
33,
34]. The Min-Max normalization technique is used for scaling the data values within a fixed range (zero to one). Initially, the Min-Max normalization technique subtracts the minimum value from data points
and further divides by its range. The formula of the Min-Max normalization technique
is presented in Equation (1).
In this scenario, the calculation of the normalization is performed only for the training set, and the validation set and the testing set are unknown. The actual PM
2.5 concentration in the test set is shown in
Figure 2.
3.3. Correlation Analysis
The correlation between all potential pairs of values in a table is shown in the matrix. It is an effective tool for compiling a sizable dataset and for locating and displaying data patterns. A correlation matrix simplifies the process of selecting different assets by tabulating their correlation with one another. It is vital to identify the correlations between PM concentrations and influencing factors for developing a good prediction model. It guarantees that the proposed regression model utilizes the efficient features for AQP. PM
2.5 is affected by several factors, but all the factors are important in effective AQP. On the other hand, the irrelevant/inactive factors affect the proposed model’s performance by means of time complexity. Therefore, it is important to compute the correlation coefficients (CCs) for every factor that helps in selecting the optimal features for effective forecasting of air pollution. Let us consider characteristic time series data as
and other data as
. The CC between the factors
is computed as described in Equation (2).
where
indicates a positive correlation,
represents a negative correlation, and
represents the number of samples. The correlation is greater and the space between
and
is limited if the absolute value of
is closer to 1.
The CCs between PM concentrations and every feature were calculated for the Beijing PM
2.5 dataset and a real-time dataset.
Table 1 shows that the snowfall, wind direction, and dew point have positive correlations with PM
2.5 concentration, whereas the wind speed, temperature, rainfall, and air pressure have negative correlations with PM
2.5 concentration.
Table 1 clearly shows that all variables are weakly correlated with each other, and it shows that there are no duplicate variables. The obtained meteorological variables are directly utilized as the input of the proposed optimization-based regression model. As specified in
Table 1, the CCs of rainfall, snowfall, and air pressure are small, and the unrelated input increases the difficulty in learning useful features and the model’s complexity. Therefore, the wind direction, temperature, dew point, wind speed, and historical PM
2.5 were chosen as the input for the proposed optimization-based regression model.
3.4. Feature Optimization
From the selected variables, namely wind direction, temperature, dew point, wind speed, and historical PM2.5, the important features were chosen by implementing the RSO algorithm, which is a combination of the bee swarm optimization (BSO) algorithm and reinforcement learning.
3.4.1. BSO Algorithm
The BSO algorithm [
35] is one of the effective metaheuristic feature selection algorithms; it mimics the hierarchical task management, adaptation, and self-organization behavior of natural bees. The BSO is an iterative algorithm that resolves optimization problems by imitating the probabilistic decision-making mechanism and foraging behavior of bees for exploiting and selecting optimal food sources. First, the heuristic is utilized for generating the reference solution, which is considered as the reference for determining other solutions in the search space. In the BSO algorithm, the search space is defined as the distance that is inversely proportional to the flip parameter that helps in finding the convergence in the search process. In the local search, a bee agent is assigned to each of these solutions. Every bee’s search result is saved to the dance table when it is completed. One of the solutions is picked to serve as the new reference solution in the following iteration. In order to avoid cycles, the reference solutions are kept in the dance table. From the dance table, the fittest and best solutions are passed to the congeners, which are further utilized for selecting the next reference solution.
In order to avoid congestion problems, the selected reference solutions are placed in a table “Tab”. Then, a parameter (Chance-Max) is defined for avoiding the local optima problem. In the BSO algorithm, maximum chances are given to a bee agent in order to explore a reference solution. In the next step, intensification is performed if a better reference solution is found within the Chance-Max range; otherwise, diversification is carried out. The search stops after identifying the global optimal best solution or reaching the maximum number of iterations.
3.4.2. Reinforcement Learning
Reinforcement learning [
36] tackles the issue of autonomous entities needing to learn control techniques with little or no data. It strengthens or reinforces the behavior. Since positive reinforcement does not require taking something away or imposing a negative consequence, people frequently find it simpler to accept than other teaching techniques. Additionally, it is far simpler to reward behaviors than to penalize them, which makes reinforcement generally a more effective tool. In machine learning, reinforced learning is called Q learning, and how a specific task is achieved is defined as a programming agent. In this scenario,
represents the set of states, and
specifies the set of actions. For each action
a reward
is received, and this is performed in a set
. This algorithm maps
for maximizing the reward function, and it is mathematically specified in Equation (3).
where the discount parameter is defined as
, which ranges between 0 and 1. Generally, the search agents tend towards the longer-term rewards when
is equal to 1. On the other hand, the search agents tend towards the immediate or shorter-term rewards when
is equal to 0. In residual learning, the temporal difference is an extensively utilized method that integrates the features of the Markov decision process and the Monte Carlo algorithm. In this scenario, the temporal difference method is used in the recursive Q learning for computing the immediate reward
, and it is mathematically described in Equation (4).
where
indicates the resulting state and
represents another
action. Therefore, after modifying Equation (4), we obtain a new formula, expressed as Equation (5).
where
and is represented as the learning rate. The pseudocode (Algorithm 1) of the reinforcement learning process is as follows:
Algorithm 1 Reinforcement Learning |
1. | Initialize table elements |
2. | Initialize actions |
3. | Initialize states |
4. | For do |
5. | Presents state |
6. | Present action |
7. | Execute over |
8. | Immediate reward and the new state is obtained from |
9. | |
10. | |
11. | Update |
12. | End for |
3.4.3. RSO Algorithm
As specified earlier, the RSO algorithm [
37] is the integration of the BSO algorithm and reinforcement learning, and it improves the learning process by making the agents learn from prior experiences. The main issue in the BSO algorithm is the absence of memory or intelligence in the local search, which results in local optima problems. This makes the BSO algorithm ineffective compared to other optimization algorithms. In order to highlight the aforementioned issue, the local search algorithm is replaced by Q learning.
In the context of feature selection, the deletion and inclusion of a feature from the optimal features is assumed as an action, the reward is considered as the selection of optimal features, and the improvement of AQP is considered as a secondary constraint. Let us assume
is an action performed in the
iteration. The reward obtained in the set
leverages the prediction accuracy
and the number of selected features
in the feature subsets (selected variables). The reward is mathematically specified in Equation (6).
In this scenario, the RSO algorithm selects 4522 features from the selected variables: wind speed, wind direction, temperature, dew point, and historical PM2.5, which are given to the Bi-GRU model for effective forecasting of air quality. The parameters considered in the RSO algorithm are mentioned as follows: the maximum number of iterations is equal to 100, Chance-Max is equal to 5, flip is equal to 5, the number of bees is equal to 100, the learning rate is equal to 0.001, is set to 0.2, and is set to 0.1.
3.5. Air Quality Prediction
GRU has fewer gates than LSTM, which makes it less complicated. Sequential data’s long-term dependencies can be successfully maintained using GRUs. They can also deal with the so-called short-term memory problem. The selected 4522 features were given to the Bi-GRU model for effective forecasting of air pollution [
38]. The GRU model has reset and update gates for effective AQP; these gates reduce computational loss and gradient dispersion and enable the ability of longer-term memory. The update gate
replaces forget and input gates of the LSTM network; it determines the retention degree of the prior information in the present forecasting, and it is mathematically presented in Equation (7).
where
represents the hidden state at the prior time step
;
denotes the sigmoid activation function, which ranges between 0 and 1;
indicates the input matrix at time step
; and
and
denote the weight matrix and bias matrix of the update gate
[
39]. On the other hand, the reset gate
controls the historical time series data, and it is mathematically specified in Equation (8).
where
and
denote the weight matrix and bias matrix of the reset gate
[
40]. Further, the candidate hidden state
is mathematically denoted in Equation (9).
where
indicates dot multiplication operation,
and
represent the weight matrix and bias matrix of the memory cell state, and
denotes the tangent activation function. The linear interpolation between
and
results in output
, which is mathematically specified in Equation (10). The flow diagram of the Bi-GRU model is shown in
Figure 3.
Generally, an effective prediction model is required for AQP for extracting complex variances and implicit features from sequence data. The conventional GRU model only extracts feature information from the forward direction, and it ignores the backward time series data. Therefore, the Bi-GRU model was used in this study, which proved to be effective in mining the knowledge between the meteorological variables from both backward and forward directions. The Bi-GRU model, composed of backward and forward GRUs, is shown in
Figure 3. The backward GRU obtains future information from the input data, and the forward GRU captures past information from the input data. The Bi-GRU model
is mathematically denoted in Equation (11).
where
indicates the output of two directions (summation function, average function, multiplication function, and so on), and
represent the hidden state of both forward and backward GRUs. The assumed parameters of the Bi-GRU model are represented as follows: the learning rate is equal to 0.001, the optimizer is set to Adam, the loss function is the MSE loss, the number of epochs is set to 100, the batch size is set to 50, the dropout rate is equal 0.5, the number of neurons (NoN) is set to 80, and look-back is set to 8. The numerical analysis of the proposed regression model (RSO and Bi-GRU) is discussed in
Section 4.
4. Numerical Analysis
The proposed regression model (RSO and Bi-GRU) was simulated using a custom-built Python 3.7 software tool and tested on a computer with an NVidia GeForce RTX 3080, 128 GB of random-access memory (RAM), a Linux operating system, and an Intel core-i5 12th generation processor. The Beijing PM2.5 dataset and a real-time dataset were utilized for evaluating the effectiveness of the proposed regression model (RSO and Bi-GRU), and the proposed model was compared with six existing regression models. For this article, all machine learning and deep learning models were trained on scikit-learn, TensorFlow, and Keras libraries. All the regression models were trained with a learning rate of 0.001, the Adam optimizer type, and an loss function.
4.1. Performance Measures
The proposed model’s (RSO and Bi-GRU) efficacy was evaluated using different loss functions, such as
. The
performance measure effectively reflected the actual situation of the forecasting error. In addition, the other performance measures, such as
and
, effectively evaluate the degree of data change and measure the prediction quality of the proposed model. On the other hand, the
is determined as the average or mean square difference between the estimated and actual values. The mathematical formulas of the performance measures
,
,
,
,
, and
are stated in Equations (12)–(17).
where
represents the number of samples,
indicates the predicted time series value, and
denotes the measured time series value.
4.2. Experimental Setup with Ablation Analysis
In this scenario, six comparative regression models, namely SVR, random forest, recurrent neural network (RNN), LSTM, extra tree regression (ETR), and multiple linear regression (MLR), were developed for investigating the efficacy of the proposed model (GRU and Bi-LSTM). Here, all regression models were trained for 100 epochs with a batch size of 50. Additionally, a dropout layer with a probability of 0.5 was extensively applied between the layers in order to avoid the overfitting problem. The weight matrices were stored when the loss value of the previous epoch was higher compared to the present epoch. All regression models utilized an early stopping condition that stops the model training when the validation loss does not change within 10 training epochs. The Adam optimizer was utilized as the optimizer in the regression models because it iteratively updates its learning rate and effectively handles sparse gradients in noisy problems. Further, the Adam optimizer addresses two major concerns, including the local minima and convergence speed. Each data point in the testing set was verified by means of after the trained models were obtained.
Numerous ablation experiments were performed in terms of
and
as specified in
Table 2,
Table 3,
Table 4 and
Table 5. A couple of hyper-parameters, namely NoN and look-back, were tuned in the Bi-GRU model for achieving better prediction performance. The NoN indicates which neurons have a high prediction effect, and the look-back represents the previous time steps needed by the normalized data. Here, the NoNs were chosen from different candidate sets {256, 128, 80, 64, and 32}.
Table 2 and
Table 3 represent the effect of the NoN on the Bi-GRU model for a Beijing PM
2.5 dataset and a real-time dataset.
Table 2 indicates that the Bi-GRU model with 80 neurons obtained a minimum error rate with
of 9.11,
of 0.16,
of 9.82,
of 2.82,
of 13.76, and
of 2.45 on a Beijing PM
2.5 dataset. Correspondingly, as depicted in
Table 3, the Bi-GRU model with 80 neurons obtained a lower error rate with
of 0.19,
of 0.44,
of 0.48,
of 0.26,
of 16.59, and
of 1.86 on a real-time dataset.
Table 4 and
Table 5 represent the effect of look-back on the Bi-GRU model for a Beijing PM
2.5 dataset and a real-time dataset, respectively.
Table 4 and
Table 5 show that the Bi-GRU model with a look-back of 8 has obtained a minimum error rate in terms of
,
,
,
,
, and R
2. The optimal selection of look-back and NoN effectively fits the model on historical data for better AQP.
4.3. Analysis on a Beijing PM2.5 Dataset
Quantitative analysis was performed on a Beijing PM
2.5 dataset by varying the regression models and the optimization algorithms. In this research, the importance of the normalization technique (Min-Max normalization technique) is specified in
Table 6. The table clearly shows that preprocessing data using the Min-Max normalization technique accomplished better results in terms of
(9.11),
(10.16),
(9.82),
(12.82),
(12.27), and R
2 (0.78) on a Beijing PM
2.5 dataset. For non-preprocessed data, an
of 12.37,
of 14.27,
of 20.38,
of 11.83,
of 19.92, and R
2 of 0.43 were obtained. The above values demonstrate that the preprocessing of the data using the Min-Max normalization technique effectively preserves the relation between the original data values with limited standard deviations that effectively suppress the effect of outliers. On the other hand, the numerical analysis of different deep learning models on a Beijing PM
2.5 dataset is represented in
Table 7. Compared to other regression models (SVR, random forest (RF), RNN, LSTM, ETR, MLR, GRU, and Bi-LSTM), the Bi-GRU model obtained better forecasting performance with the minimal
of 9.11,
of 0.16,
of 9.82,
of 2.82,
of 10.46, and R
2 of 0.84 on a Beijing PM
2.5 dataset. The graphical evaluation of different prediction models on a Beijing PM
2.5 dataset is shown in
Figure 4.
The comparison of different feature selection algorithms on a Beijing PM
2.5 dataset is described in
Table 8. As mentioned in
Table 8, the RSO algorithm with Bi-GRU model obtained higher forecasting performance with minimal error rate compared to other optimization algorithms, such as the butterfly optimization algorithm (BOA), firefly optimization algorithm (FOA), whale optimization algorithm (WOA), genetic algorithm (GA), grey wolf optimization (GWO) algorithm, and particle swarm optimization (PSO) algorithm. The selection of optimal features by the RSO algorithm significantly decreases the computational time. The proposed regression model (RSO and Bi-GRU) consumed a computational time of 43.22 s, which is efficient in comparison to other combinations. The graphical evaluation of different optimization algorithms on a Beijing PM
2.5 dataset is shown in
Figure 5. Additionally, the plot of actual and predicted PM
2.5 values and the boxplots of actual and predicted PM
2.5 values are presented in
Figure 6,
Figure 7 and
Figure 8. The boxplot of prediction errors from deep learning models is illustrated in
Figure 9.
4.4. Analysis on a Real-Time Dataset
This subsection presents the results of the quantitative analysis performed on a real-time dataset by varying the deep learning models and optimization algorithms. The importance of the normalization technique is clearly described in
Table 9. The table clearly shows that the results of non-preprocessed data provide an
of 0.80,
of 0.76,
of 0.78,
of 0.81,
of 17.93, and R
2 of 0.46. Once the data are preprocessed using the Min-Max technique, minimal error rate values are obtained (
of 0.39,
of 0.44,
of 0.48,
of 0.41,
of 12.31, and R
2 of 0.74). As specified in
Table 10, the Bi-GRU model achieved the minimal
of 0.19,
of 0.44,
of 0.48,
of 0.26,
of 10.98, and R
2 of 0.83, which are better when compared to those of the six others regression models. In time series forecasting, the Bi-GRU model uses special gates (reset and update gates) that reduce the computational loss, enable the ability of long-term memory, and reduce the gradient dispersion.
Correspondingly, the comparison of different feature optimization algorithms on a real-time dataset is described in
Table 11. As shown, the combination of the RSO algorithm with the Bi-GRU model obtained a minimal error rate compared to other optimization algorithms. The RSO algorithm effectively selects the optimal features from the highly correlated variables (wind speed, wind direction, temperature, dew point, and historical PM
2.5) with a better convergence rate. This action improves the prediction performance with limited computational time. In this scenario, the proposed regression model (RSO and Bi-GRU) consumed a computational time of 19.28 s on a real-time dataset.
4.5. Comparative Analysis
As reviewed in the literature section, Tao et al. [
16] combined the CNN and Bi-GRU models for effective AQP. The experiments performed on a Beijing PM
2.5 dataset demonstrated the efficacy of the developed model. The developed CNN-Bi-GRU model obtained the
of 14.53,
of 10.47, and
of 0.20 on a Beijing PM
2.5 dataset. Compared to this existing model, the proposed regression model (RSO and Bi-GRU) obtained better AQP with the
of 9.82,
of 9.11, and
of 0.16 on a Beijing PM
2.5 dataset, as shown in
Table 12. Aarthi et al. [
31] used the Min-Max normalization technique, BSMO, and Bi-LSTM network for AQP. The developed model proved to be effective in AQP with an
of 0.31,
of 0.56, and
of 0.22 on a real-time dataset. As specified in
Table 13, the proposed regression model (RSO and Bi-GRU) obtained the minimal
of 0.19,
of 0.48, and
of 0.26 on a real-time dataset, and these results are better than those of the existing model.
4.6. Discussion
As discussed earlier, feature selection and prediction are the two integral parts of this research. The selection of optimal features from the highly correlated variables, namely wind direction, temperature, dew point, wind speed, and historical PM2.5, significantly increases the prediction performance with limited computational time and complexity. In this research, the RSO algorithm was utilized for feature selection, and the Bi-GRU model was implemented for AQP. Compared to other deep learning models, the Bi-GRU model utilizes reset and update gates for AQP, and these gates reduce the gradient dispersion and computational loss and enable the ability of long-term memory. Correspondingly, the RSO algorithm significantly selects the optimal features with a better convergence rate. The RSO algorithm has better exploration and exploitation abilities in achieving better feature selection performance. Additionally, the Diebold Mariano (DM) test was conducted for this manuscript to assess the superiority of the proposed regression model statistically. The DM test defines the loss differential between forecasts. Here, the probability p value of the DM test was equal to 0.01, which shows that the proposed regression model is statistically efficient. The numerical study revealed that the suggested model on a Beijing PM2.5 dataset and a real-time dataset produced values of 9.11 and 0.19 for and 2.82 and 0.26 for . On a Beijing PM2.5 dataset and a real-time dataset, the suggested regression model (RSO and Bi-GRU) required the least amount of processing time, 43.22 and 19.28 s, respectively.
5. Conclusions
In this research, a new optimization-based regression model (RSO and Bi-GRU) was implemented for effective AQP. In the present scenario, effective AQP assists the government in controlling pollution. After collecting Beijing PM2.5 and real-time data, normalization and correlation analysis were accomplished to eliminate the outliers and select the highly correlated variables: wind direction, temperature, dew point, wind speed, and historical PM2.5. From the selected variables, the optimal and relevant features were selected by implementing the RSO algorithm. Finally, the selected features from the variables were given to the Bi-GRU model for AQP. Here, the proposed model’s (RSO and Bi-GRU) performance was validated on a Beijing PM2.5 dataset and a real-time dataset, and it was evaluated using different performance measures, such as MAE, SMAPE, RMSE, and MSE. The numerical analysis showed that the proposed model obtained MAE values of 9.11 and 0.19 and MSE values of 2.82 and 0.26 on a Beijing PM2.5 dataset and a real-time dataset. Additionally, the proposed regression model (RSO and Bi-GRU) consumed minimal computational time of 43.22 and 19.28 s on a Beijing PM2.5 dataset and a real-time dataset.
Still, the proposed regression model faces difficulty in analyzing real-time data due to their dynamic nature and high variability. Therefore, as an extension, hyper-parameter tuning was performed in the Bi-GRU model to further enhance the prediction efficiency. In addition to this, the present research work can be further extended by conducting both parametric and non-parametric statistical analysis using the Wilcoxon test, t-test, Z-test, etc. In upcoming research, the high-pollution Indian cities (Delhi and Ghaziabad) will be also considered in experiments.