An Intelligent Time Series Model Based on Hybrid Methodology for Forecasting Concentrations of Significant Air Pollutants

Cheng, Ching-Hsue; Tsai, Ming-Chi

doi:10.3390/atmos13071055

Open AccessArticle

An Intelligent Time Series Model Based on Hybrid Methodology for Forecasting Concentrations of Significant Air Pollutants

by

Ching-Hsue Cheng

¹

and

Ming-Chi Tsai

^2,*

¹

Department of Information Management, National Yunlin University of Science and Technology, Douliu 640301, Taiwan

²

Department of Business Administration, I-Shou University; Kaohsiung City 840301, Taiwan

^*

Author to whom correspondence should be addressed.

Atmosphere 2022, 13(7), 1055; https://doi.org/10.3390/atmos13071055

Submission received: 11 May 2022 / Revised: 24 June 2022 / Accepted: 30 June 2022 / Published: 2 July 2022

(This article belongs to the Section Meteorology)

Download

Browse Figures

Versions Notes

Abstract

:

Rapid industrialization and urban development are the main causes of air pollution, leading to daily air quality and health problems. To find significant pollutants and forecast their concentrations, in this study, we used a hybrid methodology, including integrated variable selection, autoregressive distributed lag, and deleted multiple collinear variables to reduce variables, and then applied six intelligent time series models to forecast the concentrations of the top three pollution sources. We collected two air quality datasets from traffic and industrial monitoring stations and weather data to analyze and compare their results. The results show that a random forest based on selected key variables has better classification metrics (accuracy, AUC, recall, precision, and F1). After deleting the collinearity of the independent variables and adding the lag periods using the autoregressive distributed lag model, the intelligent time-series support vector regression was found to have better forecasting performance (RMSE and MAE). Finally, the research results could be used as a reference by all relevant stakeholders and help respond to poor air quality.

Keywords:

air pollution; variable selection; autoregressive distributed lag model; air quality rules; time-series forecast model

1. Introduction

Pollution caused by rapid industrialization is the main reason for the deterioration of air quality. According to the global energy and carbon dioxide (CO₂) status report [1], global energy-related CO₂ emissions increased to a historic high of 33.1 gigatons in 2018. Air quality monitoring stations were set up in densely populated areas of Taiwan where high pollution was likely to occur or reflect more significant air quality problems. The air quality data of the monitoring stations across Taiwan [2] show that the air quality index (AQI) of western Taiwan is greater than 100 days in a year (unhealthy level). Northern Taiwan has a better AQI, ranging from 23 to 64 days. The central region comes next, ranging from 69 to 84 days. Southern Taiwan is the worst, with 97 to 160 days. As stated in [3], PM_2.5 is the main source of pollution in Taiwan, among which traffic pollution accounts for 36%, Chinese imports 27%, industrial pollution 25%, and natural diffusion 12%.

The negative effects of long-term exposure to air pollution on human health have been extensively researched [4,5]. According to the World Health Organization (WHO), air pollution causes 7 million deaths every year, and 4.2 million people died from environmental or outdoor pollution in 2016 alone [6]. Air pollution caused by traffic and industrial emissions is a significant problem in many cities, and the pollutants that are detrimental to human health include suspended particulate matter (PM), ozone (O₃), sulfur dioxide (SO₂), carbon monoxide (CO), volatile organic compounds (VOCs), and nitrogen oxides (NO_x).

Many studies have shown the unexpected health effects caused by air pollution, and it has been shown that long-term exposure to air pollution increases the risk of cardiovascular and respiratory diseases, type 2 diabetes, cancer, and premature death [7,8]. In terms of the global burden of disease [9], the main factors contributing to deteriorating health because of air pollution are mainly PM_2.5 and O₃. WHO also warned that 90% of the global population is currently affected by toxic air and that damage to children is particularly serious [10]. Therefore, all countries have begun to pay attention to the problems of health and national economic development affected by air pollution, and related policies have been introduced.

Air pollution research needs to consider natural environmental factors and the related knowledge of causality, and most previous studies analyzed it from a statistical perspective [11]. Air pollution data require a huge amount of climate information, and statistical analysis of air pollution data cannot effectively catch the interactions of environmental and air quality factors. Algorithms based on artificial intelligence (a broad definition that includes machine learning and deep learning) have big data classification and prediction capabilities, which can be applied to study air quality [12,13]. Data mining can extract and discover the truth that is hidden in large amounts of data. The use of classification rules is a data mining technique to extract frequent patterns embedded between classes and observations in a specific dataset. Variable selection is a method that can reduce input variables to a manageable size for processing and analysis, which reduces the number of variables used and predetermines the cut-off point for the number of variables considered when building a model [14]. As there are many environmental factors involved in building a model, this study used variable selection to effectively filter out the important variables that affect air quality.

The AQI can provide decision-makers with the information needed to implement pollution mitigation measures and make air quality management decisions; therefore, accurate forecasting is essential for early control of air pollution and the protection of public health. Based on this, in the present study, we used an artificial intelligence algorithm and variable selection to build a forecasting model for generating rules that meet air pollution conditions to predict air quality levels. Next, we made numerical forecasts for the top three pollution sources to understand the influence of environmental conditions on the concentrations of air pollutants. In summary, in this study, we carried out the following:

(1): Applied five variable selection methods to filter out the important variables for the collected datasets and used the integrated variable selection method (IVSM) to find the key variables.
(2): Used four rule-based classifiers to classify air quality and generate classification rules, in which we found the top three pollutants (PM_2.5, PM₁₀, and O₃) from the generated rules.
(3): Deleted collinear variables and added lag periods of variables by the autoregressive distributed lag (ARDL) test.
(4): Forecast concentrations of PM_2.5, PM₁₀, and O₃ by using four intelligent time-series forecast methods based on IVSM-selected variables, ARDL-selected variables, and full variables.
(5): Gave appropriate explanations to provide the results to stakeholders for reference and enact countermeasures for dynamic environmental factors.

The remaining sections of the paper are arranged as follows. Section 2 reviews the literature on air pollution, variable selection, and machine learning techniques. The study’s concept, proposed research procedure, and computation steps are introduced in Section 3. Section 4 introduces the experimental environment, datasets, and experimental results. Finally, Section 5 summarizes the conclusions and provides recommendations.

2. Related Works

This section introduces air pollution, variable selection, and machine learning techniques.

2.1. Air Pollution

WHO [10] has warned that air pollution is the largest single environmental health risk in the world. According to WHO statistics [6], 4.2 million people died prematurely in 2016 from long-term exposure to environmental (outdoor) air pollution. In densely populated urban areas, transportation and industry are often the main causes of local environmental air pollution. Philinis and Seinfeld [15] divided pollution sources into primary and secondary pollutants: primary pollutants are directly discharged into the air through combustion, emissions, dust droplets, and so forth, and secondary pollutants are formed by photochemical and condensation reactions between chemical molecules in the air. Among them, nitrogen dioxide (NO₂), ozone (O₃), and suspended particulate matter (PM) are the main causes of air pollution.

The AQI is usually used to determine the safety of outdoor activities, especially for individuals [16]; it is a nonlinear, dimensionless index that quantitatively describes air quality conditions. The larger the AQI value, the higher the class level, which means the air pollution situation is more serious and the health hazard to the human body is greater. Among the Environmental Protection Agency (EPA) Criteria Air Pollutants [17], six pollutants (carbon monoxide, lead, nitrogen dioxide, ozone, PM of different size fractions, and sulfur dioxide) are common in outdoor air and can harm human health and the environment. The computational results of the AQI largely depend on the individual air quality index of the corresponding region and pollutant concentration index table. The computation of AQI [18] based on individual pollutants is given in Equation (1):

{AQI}_{p} = \frac{{AQI}_{Hi} - {AQI}_{Lo}}{{BP}_{Hi} - {BP}_{Lo}} (C_{P} - {BP}_{Lo}) + {AQI}_{Lo}

(1)

where AQI_P is the AQI of individual pollutant p, C_P is the input concentration for pollutant p, BP_Hi is the concentration breakpoint that is greater than or equal to C_P, BP_Lo is the concentration breakpoint that is less than or equal to C_P, AQI_Hi is the AQI value corresponding to BP_Hi, and AQI_Lo is the AQI value corresponding to BP_Lo.

Finally, the AQI takes the maximal AQI value of individual pollutant p, as shown in Equation (2):

AQI = {\max {AQI}_{1} {, AQI}_{2} {, AQI}_{3}, \dots {, AQI}_{n}}

(2)

The Taiwan Environmental Protection Agency (TEPA) began classifying pollutant concentrations and AQI categories in December 2016 [19,20]. Climatic factors such as rainfall, atmospheric temperature, and wind speed are important to consider with regard to their effect on air pollution concentrations. In addition, intense precipitation enhances the effect of wet deposition, which helps to remove the source of air pollution [21,22]. Atmospheric temperature can also have a considerable effect; high temperature enhances airflow, which helps disperse pollutants into the air [23,24].

In terms of machine learning and feature selection on air quality, Šimić et al. [12] applied machine learning methods to estimate mass concentrations of traffic-related pollutants, and they applied the five machine learning methods to calculate the feature’s importance after training the model. Sethi and Mittal [25] used linear regression to find the most relevant features which affect pollution, and they applied machine learning algorithms to forecast air quality. Chen et al. [26] applied self-organizing maps to cluster different cities’ data to find the same trends in air pollutants, and then they used the ReliefF feature selection to extract the climate factors that are helpful for the AQI prediction. Kumar and Pande [27] used a correlation-based feature selection to select the key features and applied five machine learning algorithms to predict air quality. From these researches, they had presented many existing machine learning methods to forecast air quality, but they did not integrate the selected features of different feature selection methods to obtain the key features. Therefore, this study proposed an integrated variable selection method (IVSM) to synthesize the important features, and we used at least four of the variable selection methods, selecting the same variables as the key variables.

2.2. Variable Selection

The main aim of variable selection is to reduce the size of the variable subsets and find the key variables to improve classification accuracy, reduce model complexity, and reduce the processing costs of high-dimensional data [14,28]. Based on these advantages, variable selection is often applied to machine learning. Five variable selection methods used in the study are described as follows:

(1): Correlation-based feature selection (CFS)

CFS is a variable selection method proposed in [29] that measures the correlation between each variable and its class [29]. CFS mainly uses a heuristic search to filter the best candidate subset, and the concept of heuristics comes from the theory proposed in [30].

(2): Correlation

Correlation evaluates the importance of attributes by measuring the degree of correlation between attributes and categories. Past studies used Pearson correlation coefficients for attribute selection [31].

(3): Information gain (IG)

IG is a variable evaluation method that is based on entropy. It is used in variable selection and is defined as the amount of information provided by each variable for the class. A higher IG value means more information can be provided [32]. IG is often used for high-dimensional data, and a higher value means better discriminative power for decision-making. IG is a good way to determine correlations between variables and classes [33].

(4): Gain ratio (GR)

The GR is an IG extension first proposed in [34]. When IG has many different values, errors will occur; therefore, it was improved [35]. GR is the ratio of IG to entropy (a), where a is a variable.

(5): ReliefF

The Relief algorithm calculates each variable score, ranks it, and selects the highest one, which is simple and effective. Kononenko [36] proposed the ReliefF algorithm to improve the classification problem of the Relief algorithm because ReliefF is only applied to binary data [37]. ReliefF is usually used to select variables in data preprocessing, and it is one of the most successful preprocessing algorithms [38].

2.3. Autoregressive Distributed Lag (ARDL) Model

A time series is a series of data points sorted by time. In the old concept, a time series was a single independent variable, and the goal was usually to forecast the future. One of the simplest ARIMA-type models [39] uses a linear model to forecast the value of the current time using the value of the previous time. This is the first-order autoregressive model or the AR(1) model. In ARIMA, the most common types are autoregressive (AR), integrated (I), and moving average (MA) models.

The distributed-lag model is a time-series model used in statistics and econometrics [40], in which the dependent variable is affected by the explanatory variables at multiple lag periods. The ARDL model [41] is an infinite lag model that is both flexible and parsimonious, in which the dependent variable is affected by the lag period of the explanatory variable and depends on the lag period of the dependent variable. The ARDL model, including the p lags of the explanatory variable and q lags of the dependent variable, is represented as ARDL(p, q). The multiple ARDL model can be described by Equation (3):

Y_{t} = c + b_{1} y_{t - 1} + b_{2} y_{t - 2} + \dots + b_{p} y_{t - p} + a_{1, 0} x_{1} (t) + a_{1, 1} x_{1} (t - 1) + \dots + a_{1, q_{1}} x_{1} (t - q_{1}) + a_{2, 0} x_{2} (t) + a_{2, 1} x_{2} (t - 1) + \dots + a_{2, q_{2}} x_{2} (t - q_{2}) + a_{m, 0} x_{m} (t) + a_{m, 1} x_{m} (t - 1) + \dots + a_{m, q_{m}} x_{m} (t - q_{m}) + e_{t}

(3)

where the regression has m independent variables, c is a constant, t represents time, a_m,qmx_m(t−q_m) denotes the q_m-th lag of the m-th attribute x_m(t−q_m) multiplied by the coefficient a_m,qm, and b_ky_t−p represents the k-th lag y_t−p multiplied by the coefficient b_p. The conditions of the ARDL model are that the independent and dependent variables are stationary and e_t is white noise error in time t.

2.4. Machine Learning Techniques

This section introduces six machine learning techniques for classification and forecasting: tree C4.5, decision tree, random tree, random forest, extra trees, and support vector regression (SVR).

(1): Decision tree (DT)

DT was developed in [42] and is an extension of Quinlan’s ID3 algorithm. Tree C4.5 is frequently used in data mining; it can be used to analyze data and make predictions [42]. The main advantage of DET is that it provides a meaningful way to represent acquired knowledge; hence classification rules can be easily extracted [35]. DT has been successfully applied in many fields, such as making seating charts, evaluating power systems, and predicting hard drive failure [43].

(2): Random tree (RT)

DT is easy to conceptualize but usually has high variance in terms of accuracy. To overcome this limitation, many variants of a single decision tree can be generated based on different subsets of the same training set in randomization-based ensemble methods [44]. Breiman [44] noted that the RT algorithm could handle classification and regression problems. Usually, a pure RT is used, or RT is merged with a random forest in machine learning [45]. Many RT-based algorithms are used in natural gas modeling [46], brain tumor detection [47], and eye pupil localization [48].

(3): Random forest (RF)

Breiman [44] first introduced RF and used a procedure similar to classification and regression trees. RF can be used for classification, regression, and other ensemble learning methods. It applies random node optimization and bagging to build a forest of unrelated trees [44]. When the RF is established to assign a new instance, each tree in the RF is voted on and classified, and the classification with the most votes is used as the outcome [49]. The advantages of RF are extremely fast training and prediction with no overtraining, and it is not affected by noise [50]. RF has many applications, such as target detection, target tracking, and language and semantic analysis [51].

(4): Extra tree (ET)

An extremely randomized tree, or ET, was proposed in [52]. ET is an extension of RF. There are two main differences between ET and RF [53]: (1) RF uses bootstrap duplications (bagging) and samples the training data with a replacement, whereas ET uses all of the original training data; and (2) RF chooses the best cut points of a variable in a random subset, while ET is completely random when it comes to getting the cut points. The advantage of ET is its low variance and computational efficiency [52].

(5): Support vector regression (SVR)

A support vector machine (SVM) is a supervised algorithm for classification that was first proposed in [54]; it is used to find a hyperplane in high-dimensional space for classifying data points. SVR is an extension of the original SVM that can handle continuous prediction problems. The SVM finds a plane that can discriminate the data, while SVR seeks a plane that can accurately forecast the data. The advantages of SVR are that it is robust to outliers, the decision model can be easily updated, and it has excellent generalization capability with high forecast accuracy.

(6): Multilayer perceptron regression (MLPR)

Multilayer perceptron (MLP), the simplest form of a feed-forward neural network and binary linear classifier, consists of input, hidden, and output layers [55]. The advantages of MLP are that it is a nonlinear learning model that can be processed in parallel and has good fault tolerance. MLP neurons can freely perform classification or regression based on their activation functions. In deep learning, MLP is a feed-forward artificial neural network with high performance in the random scheme, fitness approximation, and regression analysis [56].

3. Proposed Method

Climate abnormalities have caused more air pollution, and serious air pollution problems have increased human health concerns. To find important pollutants in the current study, we used CFS, correlation, IG, GR, and ReliefF to select the important variables, and we synthesized the results of these five selection methods based on the IVSM. That is, at least four of these methods selected the same variables as the key variables. Next, we applied four classifiers (DT, RT, RF, and ET) to classify air quality and generate rules for determining important pollutants. The study was based on the following: (1) the chosen classifiers must be rule-based because DT is a commonly used baseline method and ET is an extension of RF; (2) rule-based ensemble classifiers generally perform better than the individual classifiers they are constructed of and overcome the limitations of the individual classifiers [57]; hence the selected classifiers must have been used in the literature and shown excellent performance, and (3) the computational cost of a rule-based classifier is very low and is less computationally expensive than neural networks and deep learning.

In forecasting the important pollutants, we first deleted the collinear variables and tested the lag periods of the independent and dependent variables. Next, we applied four intelligent time series models to forecast the concentrations of O₃, PM_2.5, and PM₁₀. We used intelligent time series MLPR, RF, ET, and SVR models to forecast the concentrations of important pollutants for the following reasons: (1) RF is an ensemble learning method with excellent performance, and its advantages are that it provides extremely fast training and prediction, requires no overtraining, and is not affected by noise [50]; (2) the advantage of ET is low variance and computational efficiency [52], and it is an extension of RF; (3) SVR can handle nonlinear data and provide proficient forecast models, and it is robust to outliers; and (4) MLP has the capability of learning nonlinear models and learning models in real-time.

As mentioned above, in this study, we propose an intelligent time series model based on variable selection and autoregressive distributed lag to forecast significant pollutants, and we collected two datasets from traffic and industrial monitoring stations set up by the TEPA. In addition, we collected weather data, including atmospheric temperature (TEMP), rainfall, relative humidity (RH), and wind speed (WS_HR) as research variables. From the forecast results, we can further understand the impact of environmental factors on the concentrations of pollutants and provide countermeasures.

3.1. Proposed Computational Procedure

To easily understand the proposed method, we present a clear computational procedure to explain the proposed method, which includes six steps: data collection, preprocessing, variable selection, classification and evaluation, rule generation, and forecast and evaluation, as shown in Figure 1. The below section introduces each step in detail.

Step 1. Data collection

The TEPA air quality monitoring network divides its monitoring stations into six categories [20]. Because some categories are not significant indicators, in this study, we selected the data of industrial and traffic monitoring stations as research datasets. Based on the literature [15,17,18], we selected 19 air pollution quality-related variables as the original air pollution data. In this step, two types of data from 2019 were collected: pollution source data from the TEPA traffic and industrial monitoring stations, including SO₂, CO, O₃, PM₁₀, PM_2.5, NO_x, NO, NO₂, THC, NCHN, and CH₄, and weather data from the TEPA database, including TEMP, rainfall, RH, and WS_HR. Finally, there were 30 variables with 8760 records, including pollution sources, weather, time, season, and related pollution variable data.

Step 2. Preprocessing

This step contains three sub-steps:

(1): Integrate pollutant and weather data into a single dataset.

Based on the same time period, we concatenated pollutant and weather data into a single dataset; the two concatenated datasets were the traffic and industrial datasets. After data concatenation, the two datasets were the same, with 30 variables (not including AQI) and 8760 records from 2019.

(2): Impute missing data.

Because the monitoring stations occasionally encounter machine failure or operator negligence in data inspections, the hourly air quality changes have a few lags. Hence, we used 8-h MA to impute (replace) the missing data of the integrated traffic and industrial datasets.

(3): Calculate AQI and set AQI classes.

First, in this step, we compute the AQI of individual pollutant p by Equation (1). According to Equation (2), we take the maximal AQI of individual pollutant p as the AQI of each record. Next, according to [19,20], we set four AQI classes: class A, good (AOI < 50), class B, moderate (50 < AOI < 100), class C, unhealthy for sensitive groups (100 < AQI < 150), and class D, unhealthy (AQI > 150).

Step 3. Select variables.

Variable selection not only reduces the input data to a manageable size for processing and analysis but it is also a trade-off point for considering the number of important variables when building a model. In this study, we used five variable selection methods: CFS, correlation, IG, GR, and ReliefF. Among them, CFS only generates important variables, while the other four generate important variables and weight the variables. In variable selection, some variables are less related to class because their weight values are lower than 0.01. Hence, we did not select variables with weight values lower than 0.01. We propose an IVSM to synthesize the results of the five variable selection methods; that is, we used at least four of the variable selection methods, selecting the same variables as the key variables. The IVSM results are shown in Section 4.

Step 4. Carry out classification and rule generation.

Classification is a supervised learning method that builds a model based on data and a class variable. A classification model can allow us to understand the data characteristics of each class and can be used to identify the class of new data. We used DT, RT, RF, and ET classifiers to perform 10 cross-validations and produce the best model of air quality. After variable selection, we calculated the AQI of the two monitoring stations and coded them as class A, B, C, or D; we used DT, RT, RF, and ET classifiers to evaluate their performance and generate the rules of the best model for the two datasets. Then we summarized the important rules of the best model to find the important sources of pollution that affect air quality.

Step 5. Perform ARDL test.

ARDL is a least-squares regression that uses the lag periods of the dependent variable and the lag periods of many explanatory variables as a regression model. From step 4, we found that the top three pollution sources from the generated rules with the most impact on air pollution are PM_2.5, PM₁₀, and O₃. Therefore, we used the ARDL model to test the lag periods of PM_2.5, PM₁₀, and O₃ regression models, and their independent variables were IVSM-selected variables.

Step 6. Construct the forecasting model.

Based on the IVSM-selected variables and ARDL test, in this step, we used four forecasting methods (MLPR, RF, ET, and SVR) to forecast the three most serious pollution sources (PM_2.5, PM₁₀, and O₃), and we used the root mean square error (RMSE) and mean absolute error (MAE) to evaluate their performance. A good and accurate forecast can respond to relative measures early and help further understand the impact of environmental factors on the concentrations of pollutants. The results can offer related organizations and the public with the opportunity to make early responses and suggestions in advance.

3.2. Evaluation Metrics

Evaluation is a standard way to measure model performance. In the current study, we used accuracy, the area under the receiver operating characteristic curve (AUC), precision, recall, and F1 to evaluate classification performance. A confusion matrix was used to calculate these metrics, which is shown in Table 1. Next, the five metrics [58] are introduced.

Accuracy: Accuracy is the most commonly used metric for classification performance [59] because it is easy to compute, has less complexity, and is easy for us to understand. The computational equation for accuracy is as follows:

Accuracy = \frac{t p + t n}{t p + f n + f p + t n} \times 100

(4)

2.: AUC: AUC is the area under the receiver operating curve. From [60], classification performance is determined by the AUC. An excellent classifier has an AUC near 1.0.
3.: Precision: This criterion is also called positive predictive value and is calculated as

Precision = tp/(tp + fp)

(5)

4.: Recall (sensitivity): This measures the proportion of correctly identified positives and is also called the true positive rate; it is calculated by Equation (6):

Recall = tp/(tp + fn)

(6)

5.: F1-score: This metric is the weighted average of precision and recall and is calculated by Equation (7). F1 is usually more useful than accuracy, especially in class imbalance data.

F1 = 2 × Precision × Recall/(Precision + Recall)

(7)

In the forecast evaluation, we used RMSE and MAE to evaluate forecast performance; their computational formulas are given in Equations (8) and (9):

RMSE = \sqrt{\sum_{i = 1}^{n} \frac{{(y_{i} - f_{i})}^{2}}{n}}

(8)

MAE = \sum_{i = 1}^{n} | y_{i} - f_{i} | / n

(9)

where

y_{i}

denotes the actual value at time i,

f_{i}

is the forecast value at time i, and n is the number of forecast data.

4. Experiment and Comparison

We employed two sets of air quality data from industrial and traffic monitoring stations to verify the proposed method. This section describes the experimental environment and parameter setting, the determination of significant pollutant sources, forecasting, and evaluation, followed by a discussion.

4.1. Experimental Environment and Parameter Setting

In the current study, we collected two types of data: data from TEPA traffic (Fengshan) and industrial (Mailiao) monitoring stations and weather data. After preprocessing, each dataset had 31 variables with 8760 records, as shown in Table 2. Next, we computed the AQI of individual pollutant p by Equation (1) for the two datasets and took the maximal AQI of individual pollutant p as the AQI of each record by Equation (2). Based on [19,20], we assigned four AQI classes to the two collected datasets.

To verify the proposed method, we used five intelligent algorithms to experiment and compare the results of the two collected datasets, including four classifiers (DT, RT, RF, and ET) and four forecasting techniques (RT, RF, ET, and SVR). The experimental environment was a Python 2.7 version on Intel i7-4710MQ with a 2.5 GHz CPU running the Windows 10 operating system. The parameter setting of the five intelligent algorithms is shown in Table 3.

4.2. Finding Significant Pollutants

We collected two types of data: data from TEPA traffic (Fengshan) and industrial (Mailiao) monitoring stations and weather data. After preprocessing, each dataset had 31 variables with 8760 records, as shown in Table 2. Next, we computed the AQI of individual pollutant p by Equation (1) for the two collected datasets and took the maximal AQI of the individual pollutant p as the AQI of each record by Equation (2). Based on [19,20], we assigned four AQI classes to the two collected datasets.

To find important pollutants, we used IVSM, air quality classification, and rule generation to determine the important pollutants. The experimental processes and results were as follows:

(A): Select key variables by IVSM.

We applied five variable selection methods (CFS, correlation, IG, GR, and ReliefF) to screen the important variables and used IVSM to integrate the results of the five methods (at least four of the methods selected the same variables as the key variables). The IVSM results are shown in Table 4.

(B): Classify air quality.

After selecting the key variables, we used four classifiers to classify air quality for the full variable and selected variable datasets. The two collected datasets underwent tenfold cross-validation, and an average of 100 repeats was taken to present the results. Based on the accuracy, AUC, recall, precision, and F1 metrics, the results of the classification are shown in Table 5. DT had the best results among the five metrics for the two datasets, as shown in Table 5.

(C): Generate rules.

In the classification of air quality, DT had the best result among the five metrics. Therefore, we used DT to generate air quality rules for the two collected datasets, and tree diagrams of the Fengshan and Mailiao datasets (selected variables) are shown in Figure 2 and Figure 3.

(D): Determine the significant pollutants.

Based on the generated rules, the important pollution sources affecting air quality at the Fengshan traffic monitoring station are PM_2.5 and O₃, as shown in Figure 2, and at the Mailiao industrial monitoring station, they are PM_2.5 and PM₁₀, as shown in Figure 3. O₃ and PM_2.5 are mainly derived from secondary pollutants at the traffic monitoring station, and the main pollutants at the industrial monitoring station are PM_2.5 and PM₁₀.

4.3. Forecast and Evaluation

Figure 2 and Figure 3 show that the important pollution sources in the Fengshan dataset (traffic monitoring station) are PM_2.5 and O₃, and those in the Mailiao dataset (industrial monitoring station) are PM_2.5 and PM₁₀. The AQI is based on monitoring data to calculate the concentrations of O₃, PM_2.5, PM₁₀, CO, SO₂, and NO₂ in the air on a given day. Therefore, we used the three main pollutants, O₃, PM_2.5, and PM₁₀, to forecast the concentrations in the two collected datasets. Before forecasting, we had to delete the collinear variables and test the lag periods of the independent and dependent variables; then, we could forecast the concentrations of O₃, PM_2.5, and PM₁₀ by using intelligent time series of RF, RT, ET, and SVR. The three processes and results were as follows:

(A): Collinearity diagnosis

We used the variance inflation factor (VIF) to diagnose the collinearity problem of the independent variables [62], with VIF_i = 1/(1 − R_i²), where R_i² is the coefficient of determination for the regression of independent variable x_i. The collinearity tests of O₃, PM_2.5, and PM₁₀ regression models are shown in Table 6. From [62], if the VIF is greater than 10, then the collinearity is high. Therefore, we deleted the independent variables with VIF > 10 in Table 6 to forecast the concentrations of O₃, PM_2.5, and PM₁₀.

(B): ARDL test of variable lag periods

ARDL models often analyze dynamic relationships in time series data in a single-equation framework. The dependent variable allows the current and lag periods of the variable, and the autoregressive part allows the current and lag periods of the explanatory variable. We use ARDL to test the optimal number of lags based on a p-value of ≤0.05 (statistically significant) and the Akaike information criterion (AIC) for the O₃, PM_2.5, and PM₁₀ forecast models. The smaller the AIC value, the better the forecasting ability [41]. The co-integration of the nonstationary variables is equal to an error correction (EC) process and can be tested by the ARDL/EC test in the Stata package [63]. We used Stata to run the ARDL/EC model, and the results of O₃, PM_2.5, and PM₁₀ forecasting models show AIC < −8760, p < 0.0001, and F > 25,794. The test results indicate that the time series model has stationarity and satisfies the ARDL conditions. Therefore, we only list the significant lag periods of the dependent and independent variables (including the variable itself) in Table 7.

(C): Forecasting of concentrations of main pollutants

After clearing the collinearity of independent variables and conducting the ARDL test of variable lag periods, we applied the intelligent time series MLPR, RF, ET, and SVR methods to forecast the concentrations of O₃, PM_2.5, and PM₁₀ based on IVSM-selected variables, ARDL lag variables, and full variables. We maintained the ordering of data to partition each dataset into 90% training data and 10% testing data because air quality data are time-series observations. The forecast results of the two collected datasets, shown in Table 8, indicate that SVR is the best time series forecast model in the RMSE metric for the Fengshan (traffic monitoring station) and Mailiao datasets (industrial monitoring station). In different variable datasets, the ARDL with lag periods gives a better forecast result in RMSE and MAE metrics for Fengshan and Mailiao datasets.

4.4. Discussion

From [13], the most important factors of air pollutants are their lag period, other pollutants, temperature, and wind speed. However, we found that most key factors of air pollutants have their own lag periods, other pollutants’ lag periods, temperature lag periods, relative humidity lag periods, and wind speed lag periods, as shown in Table 7. Here, we discuss our experimental results and provide some findings.

(A): Finding air pollutants

We experimented with two monitoring station datasets, traffic and industrial, and discuss their results and differences in the following sections.

(1): Traffic air pollutants: Rapid economic development and urbanization have led to a rapid increase in vehicle ownership and usage, which has caused traffic-related air pollution problems. Vehicle emissions greatly impact CO, HC, THC, NO_x, and PM, and these pollutants pose a serious threat to the environment and people’s health [64,65]. From the traffic monitoring station (Fengshan), as shown in Table 4, we found that the pollutants O₃, PM_2.5, PM₁₀, CO, HC, THC, SO₂, NO₂, and NO_x impact air quality, and these pollutants were covered in [64,65].
(2): Industrial air pollutants: The mean AQI includes O₃, PM₁₀, PM_2.5, NO₂, SO₂, and CO concentrations (Tan et al., 2021). In addition to the six pollutants used as air quality indicators, industrial pollutants include other pollutants, and NO_x, SO₂, PMs, CO, and CO₂ are the most commonly released substances [66]. Based on the industrial monitoring station (Table 4), we found that PM_2.5, PM₁₀, O₃, SO₂, CO, NO_x, and NO₂ impact air quality, and these pollutants were listed in [19,66]. From the PM₁₀ forecast of the Mailiao dataset (Table 8), AR(p) = 11.55 indicates the best performance, showing that the industrial air pollutant PM₁₀ has nothing to do with climatic factors and related pollutants but is only related to its own lag period. Further, Figure 3 shows that PM₁₀ is dependent on the day (weekday or weekend). This experimental result tells us that PM₁₀ is caused by the operation of factories in industrial areas.
(3): Differences between traffic and industrial pollutants: As with the traffic and industrial pollutants, the main difference in the current study is HC and THC, because they are produced by the incomplete combustion of substances from vehicles (mobile pollutants). That is, traffic monitoring stations have emissions of two more pollutants (HC and THC) than industrial monitoring stations.
(B): Interaction of pollutants and related variables

As reported in [67], air pollution presents obvious seasonal and regional characteristics, and the concentration of most air pollutants is affected by weather conditions, including wind speed, precipitation, RH, atmospheric pressure, and temperature. Among the main pollutants, the top two nodes of high pollution are PM_2.5 at the traffic monitoring station, as shown in Figure 2, and PM_2.5 and PM₁₀ at the industrial monitoring station, which are the top two pollutants, as shown in Figure 3. PM_2.5 and PM₁₀ are the main pollutants in the present study. Therefore, we used the selected key variables, air temperature, wind speed, RH, month, and season (Table 4), to explore the interactions of the main pollutants (PM_2.5 and PM₁₀) and the related variables. The following are descriptions of the pattern analysis:

(1): From the patterns of PM_2.5 and PM₁₀ versus six key variables in the traffic dataset, we note the following: (a) there are lower PM_2.5 and PM₁₀ levels at 00:00–03:00, and the air quality is unhealthy at other times because Fengshan is a nightlife district; (b) there are lower PM_2.5 levels in June–July, and lower PM₁₀ levels in May–July (the lower PM_2.5 level occurs in the third season, and the lower PM₁₀ level occurs in the second season); and (c) the weather variables show that PM_2.5 and PM₁₀ are negatively correlated with wind speed, RH, and air temperature, indicating that these three variables can reduce the concentration of air pollutants.
(2): Based on the patterns of PM_2.5 and PM₁₀ versus the six key variables in the industrial dataset, we find the following: (a) peak PM_2.5 and PM₁₀ levels occur during work hours (08:00–17:00) because Mailiao is a high-pollutant district with a naphtha cracking plant; (b) the lowest PM_2.5 levels occur in July, the lowest PM₁₀ levels occur in June (the lower PM_2.5 and PM₁₀ levels occur in the second season); (c) the weather variables also show that PM_2.5 and PM₁₀ levels are negatively correlated with wind speed, RH, and air temperature, indicating that these three variables can reduce the concentrations of air pollutants.
(3): Every year in Taiwan, the northeast monsoon carries dust and haze from China, reducing the influence of the Pacific subtropical high pressure and vertical diffusion capacity of the atmosphere [68]. The severely polluted seasons are winter and spring, and high temperatures can reduce the concentration of air pollutants in the summer; these results are the same as those in [23,24]. The vertical convection in the atmosphere is enhanced, and the vertical diffusion capacity of the atmosphere is better, which reduces the amount of pollutants. Therefore, better air quality in Taiwan occurs in the summer season, which is consistent with pattern analysis points (1) and (2).
(C): Forecast improvement

Based on [39], when there is collinearity, the uncertainty associated with a single regression coefficient will be large because it is difficult to estimate. Therefore, the statistical test of the regression coefficient is unreliable. In addition, it is impossible to make an accurate statement about the contribution of each individual predictor variable to the forecast. Hence, in the current study, we deleted the collinearity of the independent variables. The results show that deleting the collinearity of the independent variables and adding lag periods by the ARDL test result in better forecast performance in RMSE and MAE metrics for the Fengshan and Mailiao datasets, as shown in Table 8. From the descriptive statistics of the main pollutants in Table 9, PM₁₀ has a large fluctuation (larger standard deviation). Therefore, PM₁₀ has a larger RMSE and MAE, as shown in Table 8.

5. Conclusions

In the current study, we propose an intelligent time series model based on variable selection and autoregressive distributed lag to forecast the concentrations of the top three pollutants. After selecting the variables, RF has better classification metrics (accuracy, AUC, recall, precision, and F1). From the generated rules and the important pollutants reported in [19,20], we find that the important pollutants affecting air quality are PM_2.5, PM₁₀, and O₃. In forecast preprocessing, we deleted the collinearity of the independent variables and added lag periods through an ARDL test, and the intelligent time series SVR had better forecasting performance (RMSE and MAE). Based on the experimental results and discussions, the contributions of this study are as follows:

(1): Due to previous research [25,26,27] did not integrate the selected features of different feature selection methods to obtain the key features; hence, we synthesized the key features using the proposed integrated variable selection method. For researchers, we can propose a novel method to improve the integrated variable selection methods.
(2): The generated classification rules are based on DT with the best results, which shows that the top three pollutants (PM_2.5, PM₁₀, and O₃) are determined. We suggest applying different algorithms to find the important air pollutants in future work.
(3): We forecast the top three pollutants (PM_2.5, PM₁₀, and O₃) based on IVSM selecting variables, deleting collinear variables, and ARDL test obtaining lag periods of dependent and independent variables, and the three screening variables methods can improve forecast performance. Therefore, a combined feature selection method is an important process for air quality prediction.
(4): The advantage of ARDL-selected variables is that ARDL only runs one time for all variables, and the lag periods of all variables can be found, but ACF and PACF need to test lag periods 31 times because this study has 31 variables.

In future work, we can increase the number of monitoring station locations to compare pollutants in different environmental conditions. In addition, we can apply deep-learning-based classifiers and forecast methods to verify air quality issues.

Author Contributions

Conceptualization, C.-H.C. and M.-C.T.; methodology, C.-H.C. and M.-C.T.; software, M.-C.T.; validation, C.-H.C. and M.-C.T.; formal analysis, M.-C.T.; investigation, M.-C.T.; resources, C.-H.C.; data curation, C.-H.C.; writing—original draft preparation, C.-H.C.; writing—review and editing, M.-C.T.; visualization, M.-C.T.; supervision, C.-H.C. and M.-C.T.; project administration, M.-C.T.; funding acquisition, C.-H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

This study does not involve human participants and animal studies.

Data Availability Statement

The collected data are open data from the TEPA traffic and industrial monitoring stations (https://www.epa.gov/report-environment/outdoor-air-quality).

Conflicts of Interest

The authors declare no conflict of interest.

References

International Energy Agency (IEA). Global Energy & CO₂ Status Report, The LATEST Trends in Energy and Emissions in 2018, Flagship Report. March 2019. Available online: https://www.iea.org/reports/global-energy-co2-status-report-2019/emissions (accessed on 19 February 2021).
TAQI. Taiwan Air Quality Annual Report. 2018. Available online: https://www.epa.gov.tw/DisplayFile.aspx?FileID=9FDF33456FA1DB1F (accessed on 24 February 2021).
Taiwan PM_2.5. Main Pollution Sources of PM_2.5 in Taiwan, Reported on 14 September 2018. Available online: https://www.fpg.com.tw/tw/issue/1/115 (accessed on 19 February 2021).
Leeuwen, F.X.R.V. A European perspective on hazardous air pollution. Toxicology 2002, 181, 355–359. [Google Scholar] [CrossRef]
Nagel, G.; Stafoggia, M.; Pedersen, M.; Andersen, Z.J.; Galassi, C.; Munkenast, J.; Jaensch, A.; Sommar, J.; Forsberg, B.; Olsson, D.; et al. Air pollution and incidence of cancers of the stomach and the upper aerodigestive tract in the European Study of Cohorts for Air Pollution Effects (ESCAPE). Int. J. Cancer 2018, 143, 1632–1643. [Google Scholar] [CrossRef]
WHO. Fact Sheet—Ambient Air Quality and Health. Updated May 2018. Available online: https://www.who.int/health-topics/air-pollution#tab=tab_1 (accessed on 19 February 2021).
Hoek, G.; Krishnan, R.M.; Beelen, R.; Peters, A.; Ostro, B.; Brunekreef, B.; Kaufman, J.D. Long-term air pollution exposure and cardio- respiratory mortality: A review. Environ. Health 2013, 12, 43. [Google Scholar] [CrossRef] [Green Version]
Brook, R.D.; Newby, D.E.; Rajagopalan, S. Air Pollution and Cardiometabolic Disease: An Update and Call for Clinical Trials. Am. J. Hypertens. 2017, 31, 1–10. [Google Scholar] [CrossRef] [Green Version]
Global Burden of Disease Study Risk Factors Collaborators. Global, regional, and national comparative risk assessment of 84 behavioural, environmental and occupational, and metabolic risks or clusters of risks for 195 countries and territories, 1990–2017: A systematic analysis for the Global Burden of Disease Study 2017. Lancet 2018, 392, 1923–1994. [Google Scholar] [CrossRef] [Green Version]
WHO. 2018. Available online: https://www.who.int/news/item/29-10-2018-more-than-90-of-the-worlds-children-breathe-toxic-air-every-day (accessed on 19 February 2021).
Núñez-Alonso, D.; Pérez-Arribas, L.V.; Manzoor, S.; Caceres, J. Statistical Tools for Air Pollution Assessment: Multivariate and Spatial Analysis Studies in the Madrid Region. J. Anal. Methods Chem. 2019, 2019, 9753927. [Google Scholar] [CrossRef]
Šimić, I.; Lovrić, M.; Godec, R.; Kröll, M.; Bešlić, I. Applying machine learning methods to better understand, model and estimate mass concentrations of traffic-related pollutants at a typical street canyon. Environ. Pollut. 2020, 263, 114587. [Google Scholar] [CrossRef]
Akbal, Y.; Ünlü, K. A deep learning approach to model daily particular matter of Ankara: Key features and forecasting. Int. J. Environ. Sci. Technol. 2021, 19, 5911–5927. [Google Scholar] [CrossRef]
Remeseiro, B.; Bolon-Canedo, V. A review of feature selection methods in medical applications. Comput. Biol. Med. 2019, 112, 103375. [Google Scholar] [CrossRef]
Philinis, C.; Seinfeld, J.H. Development and evaluation of an Eulerian photochemical gas-aerosol model. Atmos. Environ. 1988, 22, 1985–2001. [Google Scholar] [CrossRef]
Grace, R.K.; Manju, S. A comprehensive review of wireless sensor networks based air pollution monitoring systems. Wirel. Pers. Commun. 2019, 108, 2499–2515. [Google Scholar] [CrossRef]
EPA. Report on the Environment, Outdoor Air Quality. 2019. Available online: https://www.epa.gov/report-environment/outdoor-air-quality (accessed on 24 February 2021).
Heidarinejad, Z.; Kavosi, A.; Mousapour, H.; Daryabor, M.R.; Radfard, M.; Abdolshahi, A. Data on evaluation of AQI for different season in Kerman, Iran, 2015. Data Brief 2018, 20, 1917–1923. [Google Scholar] [CrossRef] [PubMed]
Tan, X.; Han, L.; Zhang, X.; Zhou, W.; Li, W.; Qian, Y. A review of current air quality indexes and improvements under the multi-contaminant air pollution exposure. J. Environ. Manag. 2021, 279, 111681. [Google Scholar] [CrossRef] [PubMed]
TEPA. 2021. Available online: https://airtw.epa.gov.tw/CHT/TaskMonitoring/Traffic/TrafficIntro.aspx (accessed on 19 February 2021).
Yao, X.; Chan, C.K.; Fang, M.; Cadle, S.; Chan, T.; Mulawa, P.; He, K.; Ye, B. The water-soluble ionic composition of PM2.5 in Shanghai and Beijing, China. Atmos. Environ. 2002, 36, 4223–4234. [Google Scholar] [CrossRef]
Glavas, S.D.; Nikolakis, P.; Ambatzoglou, D.; Mihalopoulos, N. Factors affecting the seasonal variation of mass and ionic composition of PM2.5 at a central Mediterranean coastal site. Atmos. Environ. 2008, 42, 5365–5373. [Google Scholar] [CrossRef]
Arnfield, A.J. Two decades of urban climate research: A review of turbulence, exchanges of energy and water, and the urban heat island. Int. J. Climatol. 2003, 23, 1–26. [Google Scholar] [CrossRef]
Fallmann, J.; Forkel, R.; Emeis, S. Secondary effects of urban heat island mitigation measures on air quality. Atmos. Environ. 2016, 125, 199–211. [Google Scholar] [CrossRef] [Green Version]
Sethi, J.K.; Mittal, M. A new feature selection method based on machine learning technique for air quality dataset. J. Stat. Manag. Syst. 2019, 22, 697–705. [Google Scholar] [CrossRef]
Chen, B.; Zhu, G.; Ji, M.; Yu, Y.; Zhao, J.; Liu, W. Air Quality Prediction Based on Kohonen Clustering and ReliefF Feature Selection. Comput. Mater. Contin. 2020, 64, 1039–1049. [Google Scholar] [CrossRef]
Kumar, K.; Pande, B.P. Air pollution prediction with machine learning: A case study of Indian cities. Int. J. Environ. Sci. Technol. 2022. [Google Scholar] [CrossRef]
Cai, J.; Luo, J.; Wang, S.; Yang, S. Feature selection in machine learning: A new perspective. Neurocomputing 2018, 300, 70–79. [Google Scholar] [CrossRef]
Hall, M.A. Correlation Based Feature Selection for Machine Learning. Ph.D. Thesis, University of Waikato, Hamilton, New Zealand, 1999. [Google Scholar]
Ghiselli, E.E. Theory of Psychological Measurement; McGraw Hill: New York, NY, USA, 1964. [Google Scholar]
Rodriguez-Lujan, I.; Huerta, R.; Elkan, C.; Cruz, C.S. Quadratic Programming Feature Selection. J. Mach. Learn. Res. 2010, 11, 1491–1516. [Google Scholar]
Lai, C.M.; Yeh, W.C.; Chang, C.Y. Gene selection using information gain and improved simplified swarm optimization. Neurocomputing 2016, 218, 331–338. [Google Scholar] [CrossRef]
Jadhav, S.; He, H.; Jenkins, K. Information gain directed genetic algorithm wrapper feature selection for credit rating. Appl. Soft Comput. 2018, 69, 541–553. [Google Scholar] [CrossRef] [Green Version]
Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef] [Green Version]
Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques, 3rd ed.; Morgan Kaufmann Publishers: Burlington, ON, Canada, 2011. [Google Scholar]
Kononenko, I. Estimating attributes: Analysis and extensions of RELIEF. In European Conference on Machine Learning; Springer: Berlin/Heidelberg, Germany, 1994; pp. 171–182. [Google Scholar] [CrossRef] [Green Version]
Kira, K.; Rendell, L.A. A Practical Approach to Feature Selection. Mach. Learn. Proc. 1992, 1992, 249–256. [Google Scholar] [CrossRef]
Robnik-Šikonja, M.; Kononenko, I. Theoretical and Empirical Analysis of ReliefF and RReliefF. Mach. Learn. 2003, 53, 23–69. [Google Scholar] [CrossRef] [Green Version]
Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and Practice; OTexts: Melbourne, Australia, 2015. [Google Scholar]
Judge, G.G.; Griffiths, W.E.; Hill, R.C.; Lütkepohl, H.; Lee, T.-C. The Theory and Practice of Econometrics; John Wiley & Sons: New York, NY, USA, 1980. [Google Scholar]
Pesaran, H.; Shin, Y. An Autoregressive Distributed Lag Modeling Approach to Co-integration Analysis. In Econometrics and Economic Theory in the 20st Century: The Ragnar Frisch Centennial Symposium; Strom, S., Ed.; Cambridge University Press: Cambridge, UK, 1995; Volume 31. [Google Scholar]
Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers: San Francisco, CA, USA, 1993. [Google Scholar]
Guggari, S.; Kadappa, V.; Umadevi, V. Non-sequential partitioning approaches to decision tree classifier. Future Comput. Inform. J. 2018, 3, 275–285. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Mishra, A.K.; Ratha, B.K. Study of Random Tree and Random Forest Data Mining Algorithms for Microarray Data Analysis. Int. J. Adv. Electr. Comput. Eng. 2016, 3, 5–7. [Google Scholar]
Yarveicy, H.; Ghiasi, M.M. Modeling of gas hydrate phase equilibria: Extremely randomized trees and LSSVM approaches. J. Mol. Liq. 2017, 243, 533–541. [Google Scholar] [CrossRef]
Pinto, A.; Pereira, S.; Rasteiro, D.; Silva, C.A. Hierarchical brain tumour segmentation using extremely randomized trees. Pattern Recognit. 2018, 82, 105–117. [Google Scholar] [CrossRef]
Markuš, N.; Frljak, M.; Pandžić, I.S.; Ahlberg, J.; Forchheimer, R. Eye pupil localization with an ensemble of randomized trees. Pattern Recognit. 2014, 47, 578–587. [Google Scholar] [CrossRef] [Green Version]
Shipway, N.J.; Barden, T.J.; Huthwaite, P.; Lowe, M.J.S. Automated defect detection for Fluorescent Penetrant Inspection using Random Forest. NDT E Int. 2018, 101, 113–123. [Google Scholar] [CrossRef]
Khalilia, M.; Chakraborty, S.; Popescu, M. Predicting disease risks from highly imbalanced data using random forest. BMC Med. Inform. Decis. Mak. 2011, 11, 51. [Google Scholar] [CrossRef] [Green Version]
Hu, Z.; Wang, Y.; Zhang, X.; Zhang, M.; Yang, Y.; Liu, X.; Zheng, H.; Liang, D. Super-resolution of PET image based on dictionary learning and random forests. Nucl. Instrum. Methods Phys. Res. Sect. A Accel. Spectrometers Detect. Assoc. Equip. 2019, 927, 320–329. [Google Scholar] [CrossRef]
Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef] [Green Version]
John, V.; Liu, Z.; Guo, C.; Mita, S.; Kidono, K. Real-time Lane Estimation Using Deep Features and Extra Trees Regression. Image Video Technol. 2016, 9431, 721–733. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Rosenblatt, F. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Am. J. Psychol. 1963, 76, 705–707. [Google Scholar]
Lee, S.-J.; Tseng, C.-H.; Lin, G.T.-R.; Yang, Y.; Yang, P.; Muhammad, K.; Pandey, H.M. A dimension-reduction based multilayer perception method for supporting the medical decision making. Pattern Recognit. Lett. 2020, 131, 15–22. [Google Scholar] [CrossRef]
Fan, F.M.; Collischonn, W.; Meller, A.; Botelho, L.C.M. Ensemble streamflow forecasting experiments in a tropical basin: The São Francisco river case study. J. Hydrol. 2014, 519, 2906–2919. [Google Scholar] [CrossRef]
Ballabio, D.; Grisoni, F.; Todeschini, R. Multivariate comparison of classification performance measures. Chemom. Intell. Lab. Syst. 2018, 174, 33–44. [Google Scholar] [CrossRef]
Hossin, M.; Sulaiman, M.N. A review on evaluation metrics for data classification evaluations. Int. J. Data Min. Knowl. Manag. Process 2015, 5, 1. [Google Scholar] [CrossRef]
Melo, F. Area under the ROC Curve. In Encyclopedia of Systems Biology; Dubitzky, W., Wolkenhauer, O., Cho, K.-H., Yokota, H., Eds.; Springer: New York, NY, USA, 2013; pp. 38–39. [Google Scholar]
Chang, C.-C.; Lin, C.-J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 1–27. [Google Scholar] [CrossRef]
Kutner, M.H.; Nachtsheim, C.J.; Neter, J. Applied Linear Regression Models, 4th ed.; McGraw-Hill Irwin: New York, NY, USA, 2004. [Google Scholar]
Kripfganz, S.; Schneider, D.C. ARDL: Estimating Autoregressive Distributed Lag and Equilibrium Correction Models. In Proceedings of the 2018 London Stata Conference, London, UK, 6–7 September 2018. [Google Scholar]
Oduro, S.D.; Metia, S.; Duc, H.; Hong, G.; Ha, Q.P. Multivariate adaptive regression splines models for vehicular emission prediction. Vis. Eng. 2015, 3, 13. [Google Scholar] [CrossRef] [Green Version]
Zhang, L.; Long, R.; Chen, H.; Geng, J. A review of China’s road traffic carbon emissions. J. Clean. Prod. 2019, 207, 569–581. [Google Scholar] [CrossRef]
Eslami, S.; Sekhavatjou, M.S. Introducing an application method for industries air pollutants emission control planning by preparing environmental flow diagram maps. J. Clean. Prod. 2018, 178, 768–775. [Google Scholar] [CrossRef]
Liu, Y.; Zhou, Y.; Lu, J. Exploring the relationship between air pollution and meteorological conditions in China under environmental governance. Sci. Rep. 2020, 10, 14518. [Google Scholar] [CrossRef]
Griffith, S.M.; Huang, W.S.; Lin, C.C.; Chen, Y.C.; Chang, K.-E.; Lin, T.H.; Wang, S.H.; Lin, N.H. Long-range air pollution transport in East Asia during the first week of the COVID-19 lockdown in China. Sci. Total Environ. 2020, 741, 140214. [Google Scholar] [CrossRef]

Figure 1. Computational procedure.

Figure 2. Tree of Fengshan dataset (selected variables) by DT.

Figure 3. Tree of Mailiao dataset (selected variables) by DT.

Table 1. Confusion matrix.

		Actual Situation
		True	False
Prediction	Positive	True positive (tp)	False positive (fp)
Prediction	Negative	False negative (fn)	True negative (tn)

Table 2. The number of records of each class in two datasets.

Dataset	Class A: Good	Class B: Moderate	Class C: Unhealthy for Sensitive Groups	Class D: Unhealthy
Fengshan (traffic)	1260	5023	1982	495
Mailiao (industry)	2450	3584	2182	544

Table 3. Parameter setting of five intelligent algorithms.

Algorithm	Parameter	Reference
RF	bagsize = 100; irerations = 100	[44]
RT	minimal variance proportion = 0.001	[44]
ET	irerations = 10	[52]
DT	confindence factor: 0.25	[42]
SVR	kernel function: RBF; epsilon = 0.001; gamma = 1/n (n = #variables); C = 1.	[61]
MLPR	activation function: sigmoid; loss function: square error; learning rate = 0.001; hidden layer sizes = 200.	[56]

Table 4. Results of selected key variables by IVSM.

Dataset	Key Variable
Fengshan Traffic Monitoring Station	PM_2.5	PM₁₀	TEMP	RH	O₃8 h	O₃ontime	SO₂24 h-Ave
	O₃	CH₄	WS_HR	THC	month	PM_2.5-ave	NO₂ontime
	NO₂	NO_x	CO-8 h-ave	day	highPul	PM₁₀-ave	season
Mailiao Industrial Monitoring Station	PM_2.5	PM₁₀	TEMP	RH	O₃8 h	O₃ontime	SO₂24 h-Ave
	O₃	season	WS_HR	CO	month	PM_2.5-ave	NO₂ontime
	NO₂	NO_x	CO-8 h-ave	day	highPul	PM₁₀-ave

Note: the highPul denotes high pollution, and TEMP represents atmospheric temperature.

Table 5. Results of the four classifiers based on full and selected variables for the two datasets.

Dataset	Metrics	DT	RT	RF	ET
Fengshan (Full var.)	accuracy	99.98 (0.06)	96.56 (1.67)	99.82 (0.15)	99.65 (0.23)
	AUC	1.00 (0.00)	0.96 (0.02)	1.00 (0.00)	1.00 (0.00)
	recall	1.00 (0.00)	0.94 (0.03)	1.00 (0.00)	1.00 (0.00)
	precision	1.00 (0.00)	0.94 (0.03)	0.99 (0.01)	0.99 (0.01)
	F1	1.00 (0.00)	0.94 (0.03)	1.00 (0.00)	0.99 (0.00)
Fengshan (Selected var.)	accuracy	99.98 (0.06)	98.43 (1.03)	99.93 (0.09)	99.87 (0.13)
	AUC	1.00 (0.00)	0.98 (0.01)	1.00 (0.00)	1.00 (0.00)
	recall	1.00 (0.00)	0.97 (0.02)	1.00 (0.00)	1.00 (0.00)
	precision	1.00 (0.00)	0.97 (0.02)	1.00 (0.00)	1.00 (0.00)
	F1	1.00 (0.00)	0.97 (0.02)	1.00 (0.00)	1.00 (0.00)
Mailiao (Full var.)	accuracy	99.92 (0.10)	96.42 (1.66)	99.68 (0.19)	99.49 (0.26)
	AUC	1.00 (0.00)	0.98 (0.01)	1.00 (0.00)	1.00 (0.00)
	recall	1.00 (0.00)	0.98 (0.01)	1.00 (0.00)	1.00 (0.00)
	precision	1.00 (0.00)	0.98 (0.01)	1.00 (0.00)	1.00 (0.00)
	F1	1.00 (0.00)	0.98 (0.01)	1.00 (0.00)	1.00 (0.00)
Mailiao (Selected var.)	accuracy	99.92 (0.10)	98.41 (0.96)	99.81 (0.14)	99.76 (0.17)
	AUC	1.00 (0.00)	0.99 (0.01)	1.00 (0.00)	1.00 (0.00)
	recall	1.00 (0.00)	0.99 (0.01)	1.00 (0.00)	1.00 (0.00)
	precision	1.00 (0.00)	0.99 (0.01)	1.00 (0.00)	1.00 (0.00)
	F1	1.00 (0.00)	0.99 (0.01)	1.00 (0.00)	1.00 (0.00)

Note: bold denotes the best result for the four classifiers in terms of accuracy, AUC, recall, precision, and F1 metrics.

Table 6. Collinearity tests of O₃, PM_2.5, and PM₁₀ regression models for two datasets.

Fengshan Dataset						Mailiao Dataset
DV: O₃		DV: PM_2.5		DV: PM₁₀		DV: O₃		DV: PM_2.5		DV: PM₁₀
IV	VIF	IV	VIF	IV	VIF	IV	VIF	IV	VIF	IV	VIF
month	1.948	month	1.948	month	1.944	month	2.267	month	2.264	month	2.257
day	1.042	day	1.042	day	1.041	day	1.037	day	1.037	day	1.037
TEMP	2.771	TEMP	2.770	TEMP	2.736	TEMP	2.173	TEMP	2.173	TEMP	2.159
CH₄	6.852	CH₄	6.841	CH₄	6.819	CO	4.976	CO	4.457	CO	4.975
NO_x	11.436	NO_x	11.429	NO_x	11.424	NO_x	11.678	NO_x	11.597	NO_x	11.599
PM₁₀	8.757	PM₁₀	7.002	PM_2.5	6.952	PM₁₀	4.107	PM₁₀	3.473	PM_2.5	6.272
PM_2.5	8.693	RH	1.922	RH	1.927	PM_2.5	7.417	RH	1.951	RH	1.969
RH	1.927	THC	10.335	THC	10.345	RH	1.971	WS_HR	2.105	WS_HR	2.084
THC	10.356	WS_HR	1.547	WS_HR	1.549	WS_HR	2.111	O₃8 h	3.290	O₃8 h	3.254
WS_HR	1.549	O₃8 h	2.579	O₃8 h	2.632	O₃8 h	3.307	O₃ontime	3.783	O₃ontime	3.687
O₃8 h	2.643	O₃ontime	2.954	O₃ontime	3.009	O₃ontime	3.790	PM_2.58 h	2.892	PM_2.58 h	7.406
O₃ontime	3.019	PM_2.58 h	5.437	PM_2.58 h	10.809	PM_2.58 h	8.337	PM₁₀8 h	5.063	PM₁₀8 h	2.716
PM_2.58 h	12.486	PM₁₀8 h	11.611	PM₁₀8 h	5.856	PM₁₀8 h	5.363	CO_8 h	4.076	CO_8 h	4.338
PM₁₀8 h	13.224	CO_8 h	2.998	CO_8 h	3.028	CO_8 h	4.349	SO₂24 h	1.496	SO₂24 h	1.499
CO_8 h	3.028	SO₂24 h	1.386	SO₂24 h	1.381	SO₂24 h	1.500	NO₂ontime	12.033	NO₂ontime	12.080
SO₂24 h	1.388	NO₂ontime	13.302	NO₂ontime	13.338	NO₂ontime	12.170	highPul	1.621	highPul	1.620
NO₂ontime	13.382	highPul	2.448	highPul	2.445	highPul	1.621	season	1.699	season	1.700
highPul	2.448	season	1.761	season	1.763	season	1.700
season	1.763

Note: DV denotes a dependent variable, IV represents an independent variable, and the bold text denotes VIF > 10 with a collinearity problem.

Table 7. Results of ARDL test for variable lag periods of two datasets.

DV	Independent Variables and Lag Periods
Fengshan Dataset
O₃	O₃ (1, 2, 3, 4, 5)	TEMP (0, 1, 3, 4, 5)	NO₂ (0, 1, 2, 5)	NO_x (0, 1)
	PM₁₀ (0)	CH₄ (1)	THC (0)	WS_HR (0)
	O₃-8 h (0, 1, 2, 3, 4)	CO-8 h (0, 1)	highPul (2, 3)
PM_2.5	PM_2.5 (1, 2, 3, 4, 5)	TEMP (0)	O₃ (2, 4)	THC (2)
	PM₁₀ (0, 1, 2, 3, 4, 5)	O₃-8 h (2, 3)	PM_2.5-8 h (0, 1, 2, 3, 4, 5)	CO-8 h (1, 2, 3)
	PM₁₀-8 h (0, 1, 2, 5)
PM₁₀	PM₁₀ (1, 2, 3, 4, 5)	TEMP (3)	NO₂ (0)	O₃ (0)
	PM_2.5 (0, 1, 2, 3, 4, 5)	RH (0)	THC (3)	THC (4)
	WS_HR (2, 5)	highPul (2, 3)	PM_2.5 (0, 1, 2, 3, 5)	PM₁₀-8 h (0, 1, 2, 3, 4, 5)
	CO-8 h (3, 4, 5)	season (4)
Mailiao dataset
O₃	O₃ (3, 4)	TEMP (1)	CO (1)	NO₂ (0, 1)
	O₃-8 h (0, 1)	O₃-ontime (0, 3, 4)	NO_x (0)	CO-8 h (5)
PM_2.5	PM_2.5 (1, 2, 3, 4, 5)	TEMP (1, 5)	CO (1)	NO_x (3)
	PM₁₀ (0)	WS_HR (0)	RH (0, 1, 2)	O₃-8 h (0, 1, 4, 5)
	PM_2.5-8 h (0, 1, 2, 4, 5)	PM₁₀-8 h (0, 1, 2, 4, 5)
PM₁₀	PM₁₀ (1, 2, 3, 4, 5)	NO_x (3)	PM_2.5 (0, 1, 2, 3, 4, 5)	WS_HR (0)
	RH (0, 1)	O₃-8 h (2)

Note: variable (0, 1, ..., k) represents the k-th lag of the variable, and variable (0) represents the variable itself.

Table 8. Forecast results of PM_2.5, PM₁₀, and O₃ for ARDL and IVSM-selected variables and AR(p).

Dataset	Metric	Target	MLPR	RF	ET	SVR	AR(p)
Fengshan (ARDL)	RMSE	PM_2.5	10.68	7.07	7.24	1.52	5.60
		PM₁₀	6.76	12.18	12.9	2.25	9.10
		O₃	11.14	4.54	5.09	2.29	6.90
Fengshan (IVSM)		PM_2.5	39.94	9.04	9.56	8.61	5.60
		PM₁₀	50.08	17.53	18.45	19.79	9.10
		O₃	0.42	1.58	1.90	0.10	6.90
Fengshan (Full)		PM_2.5	47.8	9.61	10.16	8.46	5.60
		PM₁₀	65.85	18.48	18.68	18.44	9.10
		O₃	0.37	2.30	2.98	0.06	6.90
Mailiao (ARDL)	RMSE	PM_2.5	6.16	3.73	3.88	0.91	4.88
		PM₁₀	70.21	22.4	25.23	21.41	22.38
		O₃	0.61	0.65	1.22	0.09	12.03
Mailiao (IVSM)		PM_2.5	43.73	5.16	5.61	5.09	4.88
		PM₁₀	99.82	23.19	30.43	27.03	22.38
		O₃	0.94	0.83	1.24	0.08	12.03
Mailiao (Full)		PM_2.5	44.2	5.07	5.88	5.04	4.88
		PM₁₀	88.17	22.82	25.63	26.75	22.38
		O₃	0.75	1.54	1.78	0.12	12.03
Fengshan (ARDL)	MAE	PM_2.5	2.03	5.06	5.3	1.14	4.23
		PM₁₀	2.67	8.06	8.78	1.65	6.32
		O₃	1.55	3.4	3.74	1.72	4.98
Fengshan (IVSM)		PM_2.5	6.67	39.13	42.09	38.71	4.23
		PM₁₀	15.66	36.24	38.38	39.14	6.32
		O₃	0.06	6.99	8.57	0.53	4.98
Fengshan (Full)		PM_2.5	7.91	58.08	45.4	37.78	4.23
		PM₁₀	19.01	54.86	40.58	40.05	6.32
		O₃	0.06	31.04	16.11	0.32	4.98
Mailiao (ARDL)	MAE	PM_2.5	0.67	2.56	2.79	0.68	3.23
		PM₁₀	14.72	14.23	15.79	12.72	11.55
		O₃	0.04	0.45	0.84	0.07	9.36
Mailiao (IVSM)		PM_2.5	4.26	32.9	35.87	33.75	3.23
		PM₁₀	22.96	50.59	61.12	53.58	11.55
		O₃	0.09	6.35	9.1	0.51	9.36
Mailiao (Full)		PM_2.5	4.35	67.56	41.78	35.84	3.23
		PM₁₀	24.32	109.07	52.33	54.61	11.55
		O₃	0.06	22.36	14.6	0.95	9.36

Note: bold number denotes the best RMSE and MAE in four classifiers of the three different variables datasets, and AR(p) is autoregression with p lags of the dependent variable.

Table 9. Descriptive statistics of O₃, PM_2.5, and PM₁₀.

	Range	Mean	Standard Deviation
PM_2.5	120	32.52	15.63
PM₁₀	540.4	80.89	43.42
O₃	122	31.94	19.10

Note: Range = Max. value—Min. value.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, C.-H.; Tsai, M.-C. An Intelligent Time Series Model Based on Hybrid Methodology for Forecasting Concentrations of Significant Air Pollutants. Atmosphere 2022, 13, 1055. https://doi.org/10.3390/atmos13071055

AMA Style

Cheng C-H, Tsai M-C. An Intelligent Time Series Model Based on Hybrid Methodology for Forecasting Concentrations of Significant Air Pollutants. Atmosphere. 2022; 13(7):1055. https://doi.org/10.3390/atmos13071055

Chicago/Turabian Style

Cheng, Ching-Hsue, and Ming-Chi Tsai. 2022. "An Intelligent Time Series Model Based on Hybrid Methodology for Forecasting Concentrations of Significant Air Pollutants" Atmosphere 13, no. 7: 1055. https://doi.org/10.3390/atmos13071055

APA Style

Cheng, C.-H., & Tsai, M.-C. (2022). An Intelligent Time Series Model Based on Hybrid Methodology for Forecasting Concentrations of Significant Air Pollutants. Atmosphere, 13(7), 1055. https://doi.org/10.3390/atmos13071055

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Intelligent Time Series Model Based on Hybrid Methodology for Forecasting Concentrations of Significant Air Pollutants

Abstract

1. Introduction

2. Related Works

2.1. Air Pollution

2.2. Variable Selection

2.3. Autoregressive Distributed Lag (ARDL) Model

2.4. Machine Learning Techniques

3. Proposed Method

3.1. Proposed Computational Procedure

3.2. Evaluation Metrics

4. Experiment and Comparison

4.1. Experimental Environment and Parameter Setting

4.2. Finding Significant Pollutants

4.3. Forecast and Evaluation

4.4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI