1. Introduction
Air pollution is a global issue that threatens the public health and economic activities of the worldwide population [
1,
2,
3]. Without exception, Malaysia has experienced public health issues and economic losses due to air pollution problems [
4,
5]. Research by Tajudin et al. [
6] reported that two air pollutants, namely Nitrogen Dioxide (NO
2) and Ozone (O
3), have an immediate effect on hospital admissions related to cardiovascular disease in Kuala Lumpur. Meanwhile, Ab Manan et al. [
7] stated that the haze episode in 2013 cost Malaysians approximately MYR 410 million, accumulated from the medical expenses and income opportunity losses due to medical leave. Thus, the air pollution problem must be appropriately addressed to minimize its health effects. One solution is to predict air quality in advance. Knowing the air quality in advance can help the local administration issue early warning alerts to the residents so they can plan their activities accordingly.
Malaysia uses the Air Pollutant Index (API) to determine air quality. Malaysia, through APIMS (Air Pollutant Index Management System), has yet to develop a mechanism to predict API values in advance. There are, however, several apps that can provide the forecasted air quality index (AQI) for Malaysian cities; one such is Plume Labs: Air Quality Apps. This app uses real-time data from the Malaysia Department of Environment (MDOE) to predict future AQI, but its accuracy is questionable. A brief comparison between the actual AQI for the Kuala Lumpur region provided by IQAir with the values predicted API by Plume Labs for 24 h (from 1 a.m., 28 August 2022 to 12 a.m., 29 August 2022) is plotted in
Figure 1. The plots disagree, with large differences and an R
2 value of −0.2300. The low R
2 value indicates that the prediction made by Plume Labs has an accuracy issue.
Researchers around the world have proposed many air quality prediction methods [
8,
9,
10,
11,
12]. Among them, a technique based on the Nonlinear Autoregressive Exogenous (NARX) Neural Network was found superior in many publications. A study by Gündoğdu [
13] established that NARX outperforms Multilayer Perceptron (MLP) in the one-step-ahead prediction of Particulate Matter 10 (PM
10) and Sulphur Dioxide (SO
2) concentrations. The RMSE values for NARX prediction of PM
10 and SO
2 concentrations were 0.0191 and 0.0070, respectively, while MLP produced values of 0.0489 and 0.1121. Concurrently, NARX prediction of PM
10 and SO
2 produced R
2 values of 0.9773 and 0.9984, while MLP produced values of 0.8530 and 0.6048. In another study, a popular machine learning algorithm called the Support Vector Machine (SVM) was used to predict the monthly average PM
10 concentration seven months in advance [
14]. The prediction performance was compared to MLP, Autoregressive Integrated Moving-Average (ARIMA), and Vector Autoregressive Moving-Average (VARMA). The results showed that SVM performs better than the other methods in one-step ahead and multi-step ahead predictions. The one-step-ahead prediction performances of SVM, ARIMA, MLP, and VARMA measured by RMSE were 2.061, 2.283, 3.432, and 3.451, respectively. Meanwhile, for multi-step ahead prediction, the RMSE of SVM was 1.990, followed by ARIMA (2.453), VARMA (3.121), and MLP (3.408).
A study employed NARX and SVM to predict the Air Quality Index (AQI) and concluded that NARX was better than SVM in one-step-ahead prediction [
15]. The NARX gave an R
2 value of 0.9701, in contrast with SVM, which gave 0.8891. Another study compared the one-step-ahead prediction performance of NARX and SVM, amongst other methods, to predict PM
2.5 concentrations [
16]. They concluded that NARX has better prediction performance than SVM, with R
2 and RMSE values of 0.99 and 0.72, respectively, while SVM gave 0.70 and 5.75.
Despite the superiority of NARX over SVM reported in the latter two publications, Kumar et al. [
17] proved that SVM outperformed NARX in hourly wind speed prediction. The prediction performance measured by Mean Squared Error (MSE) was 52.32 for SVM and 56.43 for NARX. Leong et al. [
18] also achieved excellent API prediction using the SVM model. The research was conducted using the air quality data from 2009 to 2014 collected at eight monitoring stations in northern Malaysia. Prediction performance was measured in the R
2 value, and the SVM method achieved an R
2 of 0.9843 for one-step-ahead prediction. The superiority of NARX over other methods motivates this research to evaluate its performance in predicting the API in Malaysia’s industrial areas. Since the SVM method was also proven to have excellent prediction performance using the Malaysia API, it will be evaluated and compared to NARX.
At present, scholars are more interested in proposing new methods to predict air quality [
19,
20,
21,
22]. Often, studies use the one-step-ahead prediction performance to evaluate the superiority of the proposed methods. We believe the evaluation should not stop at only comparing the prediction accuracy but rather extend it as if the proposed methods will be implemented on-site. Issues that might affect the prediction performance from the perspective of actual on-site implementation, such as input normalization, input parameters, practical predictability limit, and robustness, should be evaluated.
This paper addresses these four on-site implementation issues by comparing the performance of two established predictors, the NARX and SVM for regression (SVR). A careful analysis was designed and performed for each issue, providing valuable insight to researchers proposing new prediction methods. Apart from that, the outcomes of this study will make suggestions on how a multi-step-ahead API predictor for Malaysia API monitoring stations in industrial areas should be developed.
2. Materials and Methods
2.1. Study Area
Industrial activity is one of the major sources of air pollution [
23,
24]. Approximately 85% of air pollution in Malaysia comes from power plants emission [
25]. Accordingly, this research focuses on air quality in three renowned industrial areas in Malaysia: TTDI Jaya, Larkin, and Pasir Gudang (
Figure 2).
These industrial areas are located nearby or surrounded by residential areas with a more than 1.2 million total population. The TTDI Jaya is in the Shah Alam district of Selangor. It is situated nearby Saujana Indah and the Hicom-Glenmarie industrial park, among many other industrial areas. Food, cosmetics, and machinery are among the products manufactured in this industrial area. Larkin and Pasir Gudang are in Johor Bharu, south of peninsular Malaysia. The Larkin industrial area houses factories for plastic and metal fabrication, food products, glass manufacturing, electronic components, and mechanical machines. Most of the companies operating in the Pasir Gudang industrial area are heavy industries. This includes shipbuilding, palm oil storage and distribution, transportation and logistics, petrochemical, and construction.
2.2. Data Pre-Analysis and Treatment
The air quality data collected in 2018 and 2019 at these three industrial areas were provided by the Malaysia Department of Environment (MDOE). Each dataset contains hourly air quality parameters of Nitrogen Dioxide (NO2), Ozone (O3), Particulate Matter 2.5 (PM2.5), Particulate Matter 10 (PM10), Sulphur Dioxide (SO2), Carbon Monoxide (CO), and API. The hourly meteorological parameters, such as the ambient temperature (T), wind direction (WD), and wind speed (WS), were also provided in each dataset. A pre-analysis of the 2018 API parameter shows that the series does not exhibit seasonality for all three monitoring stations. The API values fluctuated randomly, mainly within the moderate level (50 to 100), with a maximum of 77 points and a minimum of 39 points. It can be concluded that the 2018 data represent the typical air quality in the three monitoring stations. Similar variations were observed in most parts of the 2019 data, except between September and November, when Malaysia was hit by a severe haze caused by the regional and transboundary haze from Indonesia. During the haze episode, the API reached an unhealthy level (101 to 200) and a very unhealthy level (201 to 300) for several weeks at the three monitoring stations.
Some missing values and outliers (less than 3.5%) were found in the raw air quality data provided by the MDOE. For the purposes of developing an optimized predictor, the missing values and outliers were replaced by the interpolated values using the Linear Interpolation Imputation method [
26,
27]. The Linear Interpolation Imputation method is explained by Equation (1), where
f(x) is the interpolated value of the missing value and the outlier
x is the point at which the interpolation is performed. Variables
x0 and
x1 are the known values before and after the missing value, respectively.
The outliers were determined by comparing them with the median data. The values that are more than three Median Absolute Deviations (MADs) away from the median value were replaced [
28]. The scaled MAD is defined by Equation (2) where
xa is the average of the past values and
xi is the past values for each time step in the dataset.
Table 1 presents the data range and the correlation between each air quality parameter to the API for the three monitoring stations. The PM
10 and PM
2.5 parameters show quite an obvious correlation with the API parameter compared to the other parameters in all three monitoring stations.
2.3. Multi-Step Ahead Predictor
Three common strategies can be adapted in machine learning to perform multi-step-ahead prediction: Recursive, direct, and multiple outputs. The recursive strategy is the simplest and requires a single model with a single output. In the recursive approach, the predicted output at (
t) is fed back as input to predict the output at (
t + 1). Then the predicted output at (
t + 1) is fed back as input to predict the output at (
t + 2). The process continues until the desired step is achieved. The direct strategy requires
n models to predict the outputs at (
t + 1) to (
t +
n). Each model has a single output and is trained to predict a specific number of steps ahead of the output. Hence, ten models will be developed if the system wants to predict one to ten steps ahead. In many studies, the direct strategy produced more accurate multi-step ahead predictions [
29,
30]. On the other hand, a single model with
n outputs is utilized in the multiple-outputs strategy to predict the (
t + 1) to (
t +
n) values.
This paper employed the direct strategy to obtain the multi-step ahead prediction. In this study, 24 optimized models were used to obtain the hourly 1- to 24-step-ahead predictions, equivalent to a day-ahead prediction.
2.3.1. The Nonlinear Autoregressive Exogenous (NARX) Neural Network Model
NARX is a dynamic neural network with recurrent input fed by the feedback connection encircling the network layers [
31]. A two-layer feed-forward NARX network that consists of a hidden layer and an output layer was used in this research. The sigmoidal transfer function is used as the hidden layer’s transfer function, and the linear function was employed in the output layer. The NARX feedback connection was removed, making it a complete open-loop feed-forward network.
The inputs of the NARX model consist of the currently available air quality and meteorological parameters (NO
2, O
3, PM
2.5, PM
10, SO
2, CO, API, T, WD, and WS), while the output is the predicted future API values. Two hidden neurons were used in the hidden layer, determined by analysis in a preliminary study [
32]. The NARX model employed the Levenberg–Marquardt algorithm for training. A total of 24 NARX models were developed and trained to obtain 1- to 24-step ahead prediction. Each unit in the 24 models was built from the s-step predictor depicted in
Figure 3.
2.3.2. The Support Vector Regression (SVR) Model
The Support Vector Machine (SVM) is a supervised machine learning approach widely used to solve classification problems [
33]. The SVM can also be used to solve regression problems to predict discrete values and is usually referred to as Support Vector Regression (SVR). In SVR, a margin of tolerance known as epsilon is introduced to solve regression problems, which is the tolerated error for the SVR [
34]. Similar to the classification problem, a kernel function was applied in SVR to solve the dimensional problem of nonlinear data. The well-tested kernel functions are Medium Gaussian, Coarse Gaussian, Fine Gaussian, Cubic, Quadratic, and Linear.
Figure 4 shows the SVR model used to perform the multi-step ahead prediction. The SVR inputs were fed with the currently available air quality and meteorological parameters, and the output was set to the s-step-ahead API value. The C and epsilon parameters were set to a default value during the training and testing stages. The default value of the C is set to the estimated value of the standard deviation using the interquartile range of the response variable y (the real API), while the default value of the epsilon is set to one-tenth of the C value. Twenty-four SVR models with the Linear kernel were employed using the direct approach to obtain the 24-step-ahead API prediction.
2.4. Performance Indicator
RMSE and R
2 were used to assess the prediction performance of the NARX and SVM models. RMSE explains the prediction error or the difference between the predicted and the actual value of API. The R
2 value represents the ratio of the variation in the predicted API value that can be explained by the linear association between the actual and predicted API values and the total variation of the predicted API value. Equations (3) and (4) define the RMSE and R
2, respectively.
Based on the equations, Pt is the predicted API while is its mean, Tt is the actual value of API while is its mean, N is the number of data points used in the measurement, is the standard deviation of the predicted API, and is the standard deviation for the actual value of API.
4. Conclusions
The present study developed two multi-step-ahead API predictors based on NARX and SVR using Malaysia air quality data collected at three renowned industrial areas. Both predictors were evaluated for their ability to perform multi-step-ahead API prediction using the air quality parameters NO2, O3, PM2.5, PM10, SO2, CO, and API and meteorological parameters T, WD, and WS. The analyses reveal that both predictors show comparable performance in multi-step API prediction, with the SVR slightly outperforming the NARX.
The SVR predictor can also perform multi-step prediction by using the actual (non-normalized) data, hence it is simpler to implement in actual applications. For uniformity, all air quality and meteorological parameters can be included as the predictor’s inputs, as removing some parameters did not affect prediction performance. This finding indicates that a uniform SVR predictor can be installed in all air quality monitoring stations in Malaysia’s industrial areas. Regarding robustness and the need for frequent retraining, SVR is also better than NARX as it shows more resilience towards outliers and is also stable. As Wang and Han [
42] recommended, a predictor developed offline must be updated periodically to match the latest trends. However, based on the trends exhibited by the Malaysia API data, a yearly update is sufficient for SVR due to its resilience and stability. Based on the results, this study proposes that the SVR predictor could be applied practically to enhance MDOE service quality by providing API prediction information in advance.
As we advance, the SVR predictor should be immune to missing or false data for the API prediction to be reliable and without interruption. Thus, future research should focus on finding a supporting mechanism to provide continuous and valid data in case such a problem happens on-site. On the other hand, adaptive machine learning could be explored and adopted to deal with outliers.