Next Article in Journal
Interdecadal Variation of Summer Extreme Heat Events in the Beijing–Tianjin–Hebei Region
Next Article in Special Issue
Temperature Inversion and Particulate Matter Concentration in the Low Troposphere of Cergy-Pontoise (Parisian Region)
Previous Article in Journal
Traits of Adaptive Outdoor Thermal Comfort in a Tropical Urban Microclimate
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Prediction of PM10 Concentration in Malaysia Using K-Means Clustering and LSTM Hybrid Model

by
Noratiqah Mohd Ariff
*,
Mohd Aftar Abu Bakar
and
Han Ying Lim
Department of Mathematical Sciences, Faculty of Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Selangor, Malaysia
*
Author to whom correspondence should be addressed.
Atmosphere 2023, 14(5), 853; https://doi.org/10.3390/atmos14050853
Submission received: 29 March 2023 / Revised: 19 April 2023 / Accepted: 26 April 2023 / Published: 11 May 2023
(This article belongs to the Special Issue Data Analysis of Atmospheric and Air Quality Process)

Abstract

:
Following the rapid development of various industrial sectors, air pollution frequently occurs in every corner of the world. As a dominant pollutant in Malaysia, particulate matter PM10 can cause highly detrimental effects on human health. This study aims to predict the daily average concentration of PM10 based on the data collected from 60 air quality monitoring stations in Malaysia. Building a forecasting model for each station is time-consuming and unrealistic; therefore, a hybrid model that combines the k-means clustering technique and the long short-term memory (LSTM) model is proposed to reduce the number of models and the overall model training time. Based on the training set, the stations were clustered using the k-means algorithm and an LSTM model was built for each cluster. Then, the prediction performance of the hybrid model was compared with the univariate LSTM model built independently for each station. The results show that the hybrid model has a comparable prediction performance to the univariate LSTM model, as it gives the relative percentage difference (RPD) less than or equal to 50% based on at least two accuracy metrics for 43 stations. The hybrid model can also fit the actual data trend well with a much shorter training time. Hence, the hybrid model is more competitive and suitable for real applications to forecast air quality.

1. Introduction

In line with the rapid development of various industrial sectors, air pollution frequently occurs worldwide, including in Malaysia. According to the World Health Organization (WHO) [1], air pollution is defined as the contamination of indoor and outdoor environments by impurities that modify the natural features of the environment. Data collected by WHO reveal that most of the global population breathes highly contaminated air that exceeds WHO guidelines. Air pollution can cause detrimental effects on human health, especially the respiratory system, and becomes one of the fundamental sources of morbidity and mortality [1].
In Malaysia, the air pollutant index (API) adopts six main air pollutants and serves as an indicator to deliver accurate and insightful information on air quality status in any area to the public [2]. Rani et al. [3] analyzed the trend of the API in Malaysia from years 2010 to 2015 based on various categories by using XLSTAT. In October 2010, the concentration of particulate matter 10 μm or less in diameter, better known as PM10, was extremely high in some areas in Johor following the occurrence of forest fires in Indonesia, which led to high API values as the highest relative subindex of monitored pollutants that account for the API readings [3,4]. This suggests that such fine dust often found in polluted air contributes greatly to the variability of the API [3].
Particulate matter is not just the main air pollutant in the Southeast Asia region, but is also identified as the most severe city pollutant around the globe [5,6]. For instance, most of the daily average PM10 concentration at three monitoring stations in Buenos Aires from the years 2010 to 2018 exceeded the standard limit of WHO guidelines, that is, 50 μ g/m3 [7]. Some research findings highlight that the particulate matter concentrations have certain correlations with the weather conditions, four seasons and monsoons [8,9,10].
Due to the increasing public awareness of the dangers of air pollution, numerous air quality-related studies have been performed using various statistical and deep learning models, including forecasting and clustering. Clustering is an exploratory data analysis technique that investigates the fundamental structure of data [11]. By adopting the clustering technique, the data are assigned into several distinct groups based on their degree of similarity before any further analysis or modeling can be performed. As the data within the cluster can be treated using the same analysis technique, it can save costs and computation time. There are several types of clustering methods, such as partitional clustering, hierarchical clustering and fuzzy clustering. Hierarchical clustering groups similar objects into clusters that eventually merge into a single cluster, whereas fuzzy clustering is a soft-clustering technique in which the objects can be clustered into more than one cluster. As a partitional clustering method, the k-means algorithm is one of the most common and popular techniques since it can be implemented easily [12]. It classifies data with closer centroid values into the same cluster such that the differences between the clusters are maximized. For instance, k-means clustering was used to analyze the significant changes in air quality in Southampton [13]. While Kim et al. [14] applied this algorithm to cluster monitoring stations in the United States based on different temporal patterns of PM2.5, Beaver and Palazoglu [15] adopted it to classify classes of ozone episodes in San Francisco.
Air quality time series clustering in Malaysia is often utilized to identify the pattern between the clusters and categorize the area into zone based on the pollution level so that government policies can be executed accurately [16]. In this context, Suris et al. [17] clustered the PM10 data in Malaysia using dynamic time warping (DTW) as the dissimilarities measure. Adopting four clustering techniques, that is, k-means, partitioning around medoid (PAM), agglomerative hierarchical clustering (AHC) and fuzzy k-means (FKM), the results show that the clusters were formed mainly on the basis of the region and geographical location of the stations instead of the station category and local economic activities. A similar result was obtained by Rahman et al. [11], whereby the stations were classified into high, medium and low pollution regions, respectively, using the AHC technique based on the daily average PM2.5 concentration.
As climatic and environmental issues concern society, air quality forecasting has become the focus among researchers as an accurate prediction that can reduce the effect of pollution on humans and the biosphere [18]. Therefore, various types of prediction models have been applied in previous studies. For instance, Aditya et al. [19] used the logistic regression and autoregression (AR) models to detect air quality and predict the concentration of PM2.5. A similar approach is shown in the research by Bhalgat et al. [18], which adopted AR and autoregression integrated moving average (ARIMA) models to predict the concentration of sulfur dioxide (SO2). Meanwhile, Guo et al. [20] used a geographically and temporally weighted regression model to calibrate the spatiotemporal dynamic PM2.5 concentrations to manage haze pollution in China. The random forest method is also deemed capable of modelling various concentrations of air pollutants, such as PM2.5 and ozone [21,22]. In fact, random forest regression is believed to predict air pollutant concentrations more accurately than linear regression and decision trees [23].
In recent years, neural networks have been preferred by researchers rather than the abovementioned traditional models due to their ability to fit non-linear data with higher accuracy [10]. The long short-term memory (LSTM) model is a deep learning method modified based on the concept of the recurrent neural network (RNN). Given its strength in solving the shortcomings of the RNN model, such as poor performance with tasks that involve long-term dependency and a vanishing and exploding gradient, the LSTM is found to be suitable to predict sequential data, including time series data. The outstanding performance of the LSTM model is observed through a lower root mean squared error (RMSE) in predicting the prices of gold [24] and Bitcoin [25], as well as influenza-like illnesses and respiratory diseases [26].
In terms of air quality prediction, the LSTM model also possesses great potential to give an accurate result [27]. The findings obtained by Bakar et al. [28] show that the multivariate LSTM model predicted the PM10 concentration at five selected monitoring stations most accurately with the lowest RMSE values, followed by the univariate LSTM model and the univariate ARIMA model. Aiming to increase prediction accuracy, hybrid models that involve a combination of techniques are gaining popularity in the research field. Zhang et al. [29] discovered that the combination of principal component analysis (PCA) and least squares support vector machine (LSSVM) can reduce the noise in meteorological data, hence giving more accurate predictions in API than the ARIMA model. The PCA–ANN model that uses only the significant parameters also seems competitive in giving a better prediction than the standalone artificial neural network (ANN) model [30].
For the case of clustering-based LSTM model, it considers the changes in features that are more specific in each cluster, making it an ideal choice to improve prediction accuracy. Yulita et al. [31] utilized fuzzy clustering and bidirectional LSTM (Bi-LSTM) to obtain higher accuracy and precision in classifying sleep stages. In accordance with the findings obtained in the study on the load prediction for dynamic spectrum allocation performed by Liu et al. [32] using AHC–LSTM, Li et al. [33] also found that type-2 fuzzy clustering-based LSTM can increase the accuracy with a much shorter model training time in long-term traffic volume prediction than the LSTM, random forest, back propagation network (BPN) and deep neural network (DNN).
Besides the abovementioned combinations, k-means clustering is also one of the widely used techniques in hybrid models. Ao et al. [10] first clustered meteorological data according to seasons using the k-means algorithm, then combined the clustering results with the air pollutant concentrations to be input into the Bi-LSTM model. It was found that the proposed model outperforms the other models as it can overcome the continuous fluctuation in meteorological conditions. Using the k-means–LSTM model, Baca et al. [34] also obtained a better air quality prediction in Andahuaylas, Peru.
Air quality prediction is indeed important for society to take preliminary preparations and preventive measures against poor air conditions. In order to figure out the potential of the hybrid model in predicting the daily average PM10 concentration in Malaysia, this study proposes a clustering-based LSTM model and compares its performance with the univariate LSTM model without clustering. Being a state-of-the-art deep learning method, the LSTM model usually outperforms conventional forecasting models in prediction accuracy. However, it is too time-consuming and unrealistic to construct the model individually for each station, especially in real-life applications. If the model is trained based on a few samples and generalizes its finding to all stations, it might cause an undesirably low accuracy at some stations outside the sampling. Therefore, such a combination of techniques is deemed capable of increasing the prediction accuracy with much less computation time, thus proving to be more efficient than the classical forecasting technique.

2. Materials and Methods

2.1. Data Preprocessing

The data used in this study are the daily average PM10 concentrations monitored at 60 air quality monitoring stations in Malaysia from 5 July 2017 to 31 January 2019, provided by the Malaysian Department of Environment (DOE). The dataset, with a length of 576 days for each time series, was divided into the training set and test set based on a ratio of 8:2 [18,26,35]. Data normalization was carried out in order to eliminate the effect of a wide range observed in the PM10 concentration, to speed up the training process and to increase prediction accuracy [35]. The training data was scaled into a range of [0, 1] using the min–max scaler as follows:
x s c a l e d = x x m i n x m a x x m i n ,
where x s c a l e d and x refer to the scaled data and the original data, respectively, whereas x m i n and x m a x represent the minimum and maximum values of the data, respectively.

2.2. Time Series K-Means Clustering

The k-means approach is a partitional clustering technique that decomposes the data into a set of disjointed clusters based on the nearest centroids.
Let X = x i j : 1 , , I ; j = 1 , , J as a data matrix, where x i j represents the j -th variable observed for the i -th object. According to Kobylin and Lyashenko [36], the k-means algorithm usually adopts the Euclidean distance as the proximity measure:
d i l = j = 1 J x i j x l j 2 .
This distance measure has been proven competitive in terms of time series classification accuracy [37].
Additionally, the shape-based DTW distance can also be implemented to measure the proximity in time series clustering. Despite being a good similarity and dissimilarity measure [17], this approach typically consumes more computation time due to its dynamic and complicated calculations [38]. Since the time series data are of the same length, the Euclidean distance has been chosen as the proximity measure [39].
The procedure for time series k-means clustering is as follows:
(i)
Initiate the k -cluster based on the randomly chosen cluster centroids;
(ii)
Allocate each datapoint into the nearest cluster by employing the Euclidean distance;
(iii)
Recompute the cluster centroids based on the current cluster members;
(iv)
Repeat steps (ii) and (iii) until no there are changes in the cluster membership.
The k-means algorithm classifies a time series into k clusters in such a way that the within-group sum of squares (WGSS) is minimized. According to Maharaj et al. [40], the objective function of the k-means clustering is as follows:
min k = 1 K i = 1 I j = 1 J u i k x i j k i j 2 ,
where u i k is the degree of membership of the i -th object in the k -th cluster that takes the value of 0,1 . If u i k = 1 , it indicates that the i -th object is in the k -th cluster. On the contrary, u i k = 0 shows that the i -th object is not in the k -th cluster.
Choosing an optimum number of k clusters could be a challenging task. In this study, the optimal k is chosen based on the internal index, that is, the WGSS visualized on the elbow plot and the silhouette index. For each time series, the error is defined as the distance to the nearest cluster [41].
The k that gives the highest gradient and the sharpest elbow curve is chosen as the candidate before it is evaluated by the silhouette index, as shown below:
s = b a max a , b ,
where a is the average distance within the cluster and b represents the average distance between the clusters. This index is a metric that evaluates the accuracy of a clustering technique based on scores between −1 and 1. A coefficient of 1 indicates that the clusters are well separated and clearly distinguished, whereas a score of −1 means that the clusters are not appropriately partitioned. If the silhouette index has a value of 0, it shows that the distance between the clusters is insignificant. Therefore, a higher index score indicates a better separation of the clusters [42,43].

2.3. Model LSTM

2.3.1. Introduction

An LSTM model is the extension of RNN and is capable of learning long-term dependency and storing the information for a long period. These characteristics of LSTM make it a state-of-the-art model, especially in time series prediction, which highly depends on the changing patterns of previous values.
Generally, the chain-like LSTM structure consists of three gates that control the flow of information in the memory cell, namely, the forget gate, input gate and output gate. In every cell, there are two types of non-linear activation functions, that is, the sigmoid function and the hyperbolic tangent (tanh) function. The other components of the LSTM cell include the cell state and hidden state. At each gate, there exist weights, W , and biases, e .
According to Colah [44], the key to LSTM is the cell state, which is the horizontal line running through the top of the diagram shown in Figure 1.
The cell state runs straight down the entire chain with a few minor linear interactions. Information can flow along the cell state under the control of three gates that are composed of a sigmoid neural net layer and a pointwise multiplication operation. The sigmoid layer gives an output between 0 and 1 to indicate how much of each component should be let through. None of the information can flow through the gates when a value of 0 is output. On the other hand, a value of 1 indicates all the information can be let through.
The process in the LSTM cell begins at the forget cell, whereby the sigmoid layer determines what information needs to be removed from the cell state. Looking at the former hidden state h t 1 and input data x t , it outputs a value between 0 and 1 for each number in the former cell state C t 1 . This process can be described by the following equation:
f t = σ W h f h t 1 + W x f x t + e f .
Next, the new information to be stored in the cell state will be determined in two steps. Firstly, the sigmoid layer at the input gate will determine which values are to be updated. Secondly, the tanh layer will produce a vector of new candidate values C ~ t that could be added to the cell state. These processes can be expressed as follows:
u t = σ W h u h t 1 + W x u x t + e u ,
C ~ t = tanh W h u h t 1 + W x u x t + e u .
Then, a combination of the outputs will be used to update the former cell state C t 1 into the new cell state C t . The former cell state is multiplied by f t to lose the decided information before it is added to the product of u t · C ~ t . These are the new candidate values that have been scaled by how much each cell state value should be updated. The process is described by the following equation:
C t = f t · C t 1 + u t · C ~ t .
Finally, the output gate decides what information should be output based on the filtered cell state. Firstly, the former hidden state and the input data will be run through the sigmoid layer to decide which part is to be eliminated. Then, the cell state will be put through the tanh layer to generate the values between −1 and 1 before multiplying by the output from the sigmoid layer. Eventually, only the decided portion will be output. The following equation summarizes the processes that occur at the output gate:
o t = σ W h o h t 1 + W x o x t + e o ,
h t = y t = o t · tanh C t .

2.3.2. Multivariate LSTM Model

As more than one feature are considered when constructing the hybrid model for each cluster, the LSTM model is said to be multivariate. In this study, the mean squared error (MSE) was adopted as the loss function.
Adaptive moment estimation (Adam) was employed to update the weights in the neural network based on the training data. The number of epochs was set as 100. Aiming to avoid overfitting, early stopping was employed to stop the training whenever there was no improvement in the model performance for 15 consecutive epochs [45].
The optimum values for other hyperparameters, such as the dropout rate, hidden neuron, timestep, batch size and hidden layer, were determined by using the manual tuning approach to obtain the best model performance at the training stage.

2.3.3. Univariate LSTM Model

A univariate LSTM model is a model that is trained based on one feature only, that is, it only involves one time series. The model construction process is the same as in the multivariate LSTM model, except for the number of input features.

2.4. Comparison of Model Prediction Performance

There are three accuracy metrics adopted as the prediction performance indicators for the constructed models in this study, namely RMSE, mean absolute error (MAE) and mean absolute percentage error (MAPE).
Then, the relative percentage difference (RPD) was calculated for each accuracy metric to compare the prediction performance between both models. Generally, the RPD is computed using the following formula:
R P D = D 1 D 2 D 1 + D 2 2 × 100 % ,
where D 1 and D 2 are the values measured by the first and second methods, respectively, which are the values obtained from the proposed hybrid model and univariate LSTM model in this case. The RPD is a common method to compare two experimental values when there is no theoretical value as a reference [46]. A good RPD value can be defined based on the types of experiments. In general, an acceptable RPD value ranges from 0% to 50% [47].

2.5. Framework

This study involves three main components, namely, the time series clustering phase, the modeling phase and a comparison of the model prediction performance, as summarized in Figure 2.
As the first step to constructing the proposed model, the air quality monitoring stations were grouped into k clusters by utilizing the time series k-means clustering approach based on the training set. Then, a multivariate LSTM model was trained for each cluster. Combined with the clustering results, the observed values in the test set were compared with the corresponding predicted values based on RMSE, MAE and MAPE.
After that, a univariate LSTM model was constructed independently for each station by using the same hyperparameter settings with its corresponding hybrid model. Hence, a total of 60 univariate LSTM models were built. Similar to the proposed model, the prediction performance for each univariate model was measured based on three accuracy metrics. Lastly, the prediction accuracy was compared between both models by using RPD.

3. Results and Discussion

3.1. Descriptive Analysis

The dataset was split into a training set and a test set by a ratio of 8:2, where the training set consists of data ranging from 5 July 2017 to 30 September 2018 and the test set comprises the last four months, that is, from 1 October 2018 to 31 January 2019.
Table 1 shows the minimum value, maximum value and quartiles for the whole dataset.

3.2. Time Series K-Means Clustering

Before the clustering and modeling phases were carried out, the training set was scaled into a range of [0, 1] by adopting min–max normalization. Then, the 60 monitoring stations were clustered based on the k-means algorithm. To identify the optimum k clusters, the values of WGSS were calculated and visualized in Figure 3 for k = 1 , 2 , , 10 .
By using the elbow method, the optimum number of clusters was estimated to be between k = 2 ,   3 and 4. To further validate the goodness of separation, the silhouette index was applied to the identified candidates. Table 2 shows the silhouette scores for each number of clusters.
Based on the table above, k = 2 has the highest silhouette score, while k = 4 has the lowest index. A higher index score indicates a better partitioning of the data, hence k = 2 is said to be the optimum number of clusters.
The clustering results show that Cluster 1 consists of 19 stations, whereas Cluster 2 comprises 41 stations. Table 3 lists the cluster membership for the daily average PM10 concentration according to the stations.
Figure 4 shows the distribution of stations according to clusters.
It was found that most stations in Cluster 1 are in the more developed states along the west coast of Peninsular Malaysia, such as Selangor, Perak, Pulau Pinang and Kuala Lumpur. On the other hand, Cluster 2 is mainly made up of stations that are widely distributed in the less developed states around the east coast of Peninsular Malaysia and east Malaysia, including Terengganu, Kelantan, Sabah and Sarawak.
Moreover, the number of stations based on categories according to the clusters is shown in Figure 5.
The figure above demonstrates that most stations in Cluster 1 are located in suburban and urban areas in Klang Valley with only one station falling in the rural and industrial areas, respectively. In addition, the majority of the stations in Cluster 2 are categorized as suburban, followed by rural, industrial and urban. On top of that, it was observed that there are more stations located in suburban, rural and industrial areas in Cluster 2 as compared to Cluster 1, which has more urban stations.
After classifying the test set into the clusters, the minimum values, maximum values and quartiles according to the clusters are tabulated in Table 4.
Table 4 highlights that the range of the daily average PM10 concentration for the whole dataset in Cluster 2, that is, 231.45 μ g/m3, is much higher than the range of 173.66 μ g/m3 in Cluster 1. The station locations that mainly spread in the neighboring states might give rise to this situation in accordance with a similar level of haze pollution carried by the monsoon winds [8,9]. On the other hand, the median of the daily average concentration of PM10 of the whole dataset in Cluster 1 is higher than Cluster 2 by 9.08 μ g/m3. Such a circumstance is believed to be closely related to the fact that most stations in Cluster 1 are in highly developed areas, including Klang Valley and Pulau Pinang [11].
The time plots of the daily average concentration of PM10 for the training set and test set of the selected stations in each cluster are extracted and visualized in Figure 6 and Figure 7, respectively.
From Figure 6, it can be seen that the stations within each cluster have a similar and stable time series pattern across the time range, except for a few spikes observed during a certain period. The drastic increase in the concentration of PM10 for both clusters around August until mid-September 2018 seems to be closely associated with the transboundary haze that affected most areas of Malaysia at that point.
According to Yusof [48], the unhealthy API readings were recorded in some states due to haze originating from North Sumatra and West Kalimantan at the time. The situation became worse and lasted until September as the southwest monsoon wind blew toward Peninsular Malaysia. Some states also experienced hot and dry climates with less rainfall, giving rise to the increase in the daytime temperature. Such weather caused wildfires in certain locations, for instance, the occurrence of peatland fires in Klang, Selangor [49]. As a result, the air quality decreased at station CA21B in Klang, followed by an increase in the daily average concentration of PM10 to the maximum value of 180.23 μ g/m3 in Cluster 1.
Referring to the time plots in Cluster 2, the highest daily average concentration of PM10 during the hazy period was recorded by station CA55Q, which is located in Permyjaya, Miri, Sarawak. This situation was deemed to be primarily driven by the forest fires at the nearby Industrial Training Institute, Permyjaya, which reduced the air quality in Miri and worsened the hazy conditions. According to Kawi [50], the API reading in Miri reached an unhealthy level of 130 in the morning on 19 August 2018. In conjunction with the nearly unhealthy API readings caused by the wildfire smoke from West Kalimantan, Indonesia, the PM10 concentration at other stations in Sarawak, such as Bintulu, Mukah, Sibu and Sarikei, also reported an increase during the hazy period.
Generally, the values of the test set data are at a lower level compared to the training set, that is, not exceeding 75 μ g/m3 in both clusters, as shown in Figure 7. It then leads to a small difference of 3.41 μ g/m3 in the data range between both clusters based on Table 4.
In a nutshell, the time series k-means clustering has assigned the stations into two clusters with a size of 19 and 41 stations, respectively. This result forms the basis of the proposed model.

3.3. Construction of Hybrid Models

A multivariate LSTM model was trained based on the training set for each cluster. An optimum setting of the values of the hyperparameters was tuned manually to achieve the best model performance in the training phase. After a few trials, it was found that the models for both clusters perform well under the same hyperparameter settings as tabulated in Table 5.
By applying the settings above, the MSE and RMSE, as well as the computation time were computed to evaluate the fitness of the hybrid models to the training set, as shown in Table 6.
As depicted in the table above, the RMSE values for both of the hybrid models are significantly low in the training phase, indicating that the constructed models can learn the trend of the training set well. In terms of the training time, both models required a similar duration, between 83 s and 85 s.

3.4. Construction of Univariate LSTM Models

By using the same hyperparameter settings with the corresponding hybrid models as shown in Table 5, a univariate LSTM model was constructed independently for each station. The model performance and computation time were recorded in Table 7 to assess the degree of fitness of each model to the training set.
Overall, the RMSE values for the univariate LSTM models during the training phase are comparatively higher than the hybrid models, indicating a more unsatisfied fitness to the training set. Nevertheless, there are 38 stations with RMSE values lower than 0.1 in the training phase. In addition, about 74 s to 99 s were needed to train the univariate models.

3.5. Comparison of Prediction Performance between Hybrid Models and Univariate LSTM Models

The prediction performance was computed by comparing the predicted values and the actual test data based on three accuracy metrics, namely RMSE, MAE and MAPE. Then, the difference in prediction performance between the two models was measured based on RPD for each metric. If a model has a smaller value than another for at least two metrics, then it is said to have a better prediction performance. Moreover, a hybrid model is said to have comparable prediction accuracy to the univariate model if the RPD values are less than or equal to 50%. Table 8 displays the abovementioned values for all the stations; the smaller values of accuracy metrics and RPD values below or equal to 50% are listed in bold.
Based on Table 8, the hybrid model has recorded a lower value for at least two accuracy metrics at two stations in Cluster 1, which are CA16W and CA17W. Despite having a better prediction performance for most stations, the univariate model does not significantly outperform the hybrid model based on RPD values. This is because the RPD values are more than 50% for at least two accuracy metrics at only four stations, which are CA21B, CA22B, CA33J and CA34J. Hence, a conclusion stating that the proposed model has a competitive prediction performance in Cluster 1 can be drawn.
On the other hand, it is highlighted that the proposed model is capable of giving a more accurate prediction for station CA02K based on much lower RMSE, MAE and MAPE values compared to the univariate model. Focusing on the RPD values, the prediction performance of the proposed model only varies significantly from the univariate model at 13 stations in Cluster 2.
There are 39 stations with an RPD less than or equal to 50% for RMSE. Among these stations, 12 of them have RPD values within 0–10%, 6 stations have RPD around 10–20%, 10 stations and 3 stations have a range of 20–30% and 30–40%, respectively, while the rest have RPD values within 40–50%. Meanwhile, most of the satisfactory RPD values based on MAE fall in the range of 0–10% (12 stations), followed by the range of 30–40% (9 stations), 10–20% and 20–30% (8 stations, respectively) and 40–50% (6 stations). Lastly, 47 stations have an RPD less than or equal to 50% for MAPE. It is observed that most of the RPD values based on MAPE fall in the range of 0–10% (18 stations), followed by 10–20% (12 stations), 20–30% (7 stations), 40–50% (6 stations) and 30–40% (4 stations). In short, the hybrid model can output a competitive prediction performance compared to the univariate model, as it records an acceptable range of RPD values based on all three metrics.
If the prediction performance of the hybrid model does not significantly vary from the univariate model based on RPD for at least two accuracy metrics at each station, then it can be concluded that the proposed model is suitable to forecast the PM10 concentration at that station. From Table 8, the hybrid model seems to be potentially adopted as the PM10 prediction model for 43 stations (71.67%), whereas the univariate LSTM model is more suitable to be employed for the stations in Johor, Terengganu and Sabah.
Figure 8 shows the actual and predicted values for selected stations from both clusters. Both models can fit the actual data trend well for stations CA10A (Cluster 1) and CA01R (Cluster 2). Plots from other stations were also investigated and similar results were observed.
To summarize, the prediction accuracy of the hybrid model does not significantly deviate from the univariate model, as the RPD values are within the 50% acceptable range at 43 stations for 71.67% of the stations. This has proven the capability of the hybrid model to predict the PM10 concentration at a similar accuracy level to the univariate model. Furthermore, the hybrid model can capture and fit the actual data trend quite well for most stations with a rather shorter computation time than the univariate LSTM model. This is closely related to the fact that only one hybrid model is constructed for each cluster, whereas the univariate model is individually constructed for each station, leading to a total model training time of 4951.842 s for 60 univariate models and just 168.237 s for two hybrid models. Such a rather shorter computation time without any drawback on prediction performance or trend fitness has made the hybrid model a more ideal forecasting model.
Nevertheless, the occurrence of hazy conditions at certain periods in the training set that negatively affected the air quality of each location at different levels is one of the factors that leads to a better prediction accuracy of the univariate LSTM model for some stations. The PM10 concentration increases drastically during hazy days in conjunction with the high emissions of particulate matter and greenhouse gases. On the other hand, PM10 is at a low concentration during normal days as the aerosol particles are released by mobile sources, including motor vehicles, and stationary sources, such as factories [6]. Due to the nature of the hybrid model that uses the data from all the stations within the same clusters to predict the PM10 values without considering much about the localized pollution level as in the univariate model, this might cause the tendency to overestimate PM10 for some stations that are less affected by the transboundary haze.
In addition, the concentration of PM10 is mainly influenced by other meteorological factors, such as wind speed, temperature and relative humidity [6]. The concentration of particulates is found to have a correlation with the temperature, wind speed, dew point and air pressure [6,19]. In accordance with this, Zhang et al. [51] found that there is a significant correlation between particulates and relative humidity during the winter season in Nanyang. Meanwhile, Pineda Rojas et al. [7] also revealed that the high daily average PM10 concentration is often recorded when the sky cover and relative humidity are low. Similar to the finding that the PM10 concentration is high during the southwest monsoon season [9], Yassen and Jahi [8] discovered that the TSP concentration in Klang Valley is higher during that season as compared to the rainy season. Thus, it can be concluded that different real-time meteorological conditions at each station will influence the concentration of particulate matter and lead to a slightly lower prediction accuracy of the hybrid model for some stations.

4. Conclusions

In brief, this study proposed a novel hybrid model that combines both the k-means clustering technique and the state-of-the-art LSTM model in predicting the daily average PM10 concentration in Malaysia. Throughout the study, comparisons were made between the hybrid model and the univariate LSTM model in terms of prediction performance, trend fitting and computation time.
In this study, 60 air quality monitoring stations were divided into two distinct clusters by adopting the time series k-means clustering method. Cluster 1 consists of 19 stations that are mainly distributed in highly developed areas, such as Klang Valley and Pulau Pinang, such that most of them fall under the urban and suburban categories. On the other hand, Cluster 2 comprises 41 suburban and rural stations that are located mainly on the east coast of Peninsular Malaysia, Sabah and Sarawak. The within-cluster time series patterns are quite similar and relatively stable with a few unexpected spikes, especially during the transboundary hazy period.
The results show that the hybrid model can give a comparable prediction performance to the univariate LSTM model based on the RPD values for three accuracy metrics. In terms of fitting the actual trend, the hybrid model can capture the patterns of daily average PM10 concentration, although it gives a poorer result compared to the univariate model for some stations due to several factors, such as the hazy period in the training set that contaminated the air quality at a different level and the varying meteorological conditions at each location. In addition, the hybrid model significantly outperforms the univariate LSTM model based on its much shorter training time, suggesting the capability of the proposed model to effectively increase the prediction efficiency in real-life applications.
As for the future research direction, it is suggested to consider the other meteorological factors, especially wind speed, during the clustering phase to reduce their impacts on the PM10 concentration. Moreover, the hourly PM10 concentration also warrants further study so that the public can better plan their daily activities beforehand. In such a context, two-step k-means clustering could be implemented to better capture the variation in the PM10 concentration before constructing the forecasting model for each subclass of the main clusters. Last but not least, a comparison between hybrid models that employ different forecasting methods, such as ARIMA, gated recurrent unit (GRU) and LSSVM models, can be carried out to identify which combination of techniques can predict the PM10 concentration better.

Author Contributions

Conceptualization, N.M.A. and H.Y.L.; methodology, N.M.A. and H.Y.L.; software, H.Y.L.; validation, N.M.A. and M.A.A.B.; formal analysis, H.Y.L.; investigation, H.Y.L.; resources, N.M.A. and M.A.A.B.; data curation, N.M.A. and H.Y.L.; writing—original draft preparation, H.Y.L.; writing—review and editing, N.M.A. and M.A.A.B.; visualization, H.Y.L.; supervision, N.M.A. and M.A.A.B.; project administration, N.M.A.; funding acquisition, M.A.A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Universiti Kebangsaan Malaysia with the grant number GP-K017073.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data were obtained from the Malaysian Department of Environment (DOE) and are available from DOE upon request.

Acknowledgments

The authors would like to express their utmost gratitude to the Malaysian Department of Environment (DOE) for providing the air quality data used in this study. In addition, the authors would also like to thank Universiti Kebangsaan Malaysia for the allocation of the research grant, GP-K017073.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. WHO. Air Pollution. Available online: https://www.who.int/health-topics/air-pollution (accessed on 15 May 2022).
  2. Kamaruddin, S.B. UKM Pakarunding Kaji Semula Cara Nilai Kualiti Udara. Available online: https://www.ukm.my/news/Latest_News/ukm-pakarunding-kajli-semula-cara-nilai-kualiti-udara/ (accessed on 15 May 2022).
  3. Rani, N.L.A.; Azid, A.; Khalit, S.I.; Juahir, H.; Samsuding, M.S. Air Pollution Index Trend Analysis in Malaysia, 2010–2015. Pol. J. Environ. Stud. 2018, 27, 801–807. [Google Scholar] [CrossRef]
  4. Malaysian Department of Environment (DOE). Pengiraan Indeks Pencemar Udara (IPU). Available online: http://apims.doe.gov.my/pdf/API_Calculation.pdf (accessed on 20 January 2023).
  5. Al Jallad, F.; Al Katheeri, E.; Al Omar, M. Concentrations of Particulate Matter and Their Relationships with Meteorological Variables. Sustain. Environ. Res. 2013, 23, 191–198. [Google Scholar]
  6. Chooi, Y.H.; Yong, E.L. The Influence of PM2.5 and PM10 on Air Pollution Index (API). In Proceedings of the Civil Engineering Research Work: Environmental Engineering, Hydraulics & Hydrology, UTM, Johor Bahru, Malaysia, 7–8 June 2016; pp. 132–143. [Google Scholar]
  7. Pineda Rojas, A.L.; Borge, R.; Mazzeo, N.A.; Saurral, R.I.; Matarazzo, B.N.; Cordero, J.M.; Kropff, E. High PM10 Concentrations in the City of Buenos Aires and Their Relationship with Meteorological Conditions. Atmos. Environ. 2020, 241, 117773. [Google Scholar] [CrossRef]
  8. Yassen, M.E.; Jahi, J.M. Investigation of Variations and Trends in TSP Concentrations in the Klang Valley Region, Malaysia. Malays. J. Environ. Manag. 2007, 8, 57–68. [Google Scholar]
  9. Rahman, S.R.A.; Ismail, S.N.S.; Raml, M.F.; Latif, M.T.; Abidin, E.Z.; Praveena, S.M. The Assessment of the Ambient Air Pollution Trend in Klang Valley, Malaysia. World Environ. 2015, 5, 1–11. [Google Scholar]
  10. Ao, D.; Cui, Z.; Gu, D. Hybrid Model of Air Quality Prediction Using K-Means Clustering and Deep Neural Network. In Proceedings of the 38th Chinese Control Conference, Guangzhou, China, 27–30 July 2019; pp. 8416–8421. [Google Scholar]
  11. Rahman, E.; Hamzah, F.M.; Latif, M.T.; Dominick, D. Assessment of PM2.5 Patterns in Malaysia Using the Clustering Method. Aerosol Air Qual. Res. 2022, 22, 210161. [Google Scholar] [CrossRef]
  12. Ariff, N.M.; Bakar, M.A.A.; Zamzuri, Z.H. Academic Preference Based on Students’ Personality Analysis through K-Means Clustering. Malays. J. Fund. Appl. Sci. 2020, 16, 328–333. [Google Scholar] [CrossRef]
  13. Shafi, J.; Waheed, A. K-Means Clustering Analysing Abrupt Changes in Air Quality. In Proceedings of the Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 5–7 November 2020; pp. 26–30. [Google Scholar]
  14. Kim, S.B.; Park, S.K.; Sattler, M.; Russell, A.G. Characterization of Spatially Homogeneous Regions Based on Temporal Patterns of Fine Particulate Matter in the Continental United States. J. Air Waste Manag. Assoc. 2008, 58, 965–975. [Google Scholar] [CrossRef] [PubMed]
  15. Beaver, S.; Palazoglu, A. A Cluster Aggregation Scheme for Ozone Episode Selection in the San Francisco, CA Bay Area. Atmos. Environ. 2006, 40, 713–725. [Google Scholar] [CrossRef]
  16. Aghabozorgi, S.; Shirkhorshidi, A.S.; Teh, Y.W.; Soltanian, H.; Herawan, T. Spatial and Temporal Clustering of Air Pollution in Malaysia: A Review. In Proceedings of the International Conference on Agriculture, Environment and Biological Sciences (ICFAE’14), Antalya, Turkey, 4–5 June 2014; pp. 67–72. [Google Scholar]
  17. Suris, F.N.A.; Bakar, M.A.A.; Ariff, N.M.; Mohd Nadzir, M.S.; Ibrahim, K. Malaysia PM10 Air Quality Time Series Clustering Based on Dynamic Time Warping. Atmosphere 2022, 13, 503. [Google Scholar] [CrossRef]
  18. Bhalgat, P.; Pitale, S.; Bhoite, S. Air Quality Prediction Using Machine Learning Algorithms. Int. J. Comput. Appl. Technol. Res. 2019, 8, 367–370. [Google Scholar] [CrossRef]
  19. Aditya, C.R.; Chandana, R.D.; Nayana, D.K.; Praveen, G.V. Detection and Prediction of Air Pollution Using Machine Learning Models. Int. J. Eng. Trends Technol. 2018, 59, 204–207. [Google Scholar]
  20. Guo, B.; Wang, X.; Pei, L.; Su, Y.; Zhang, D.; Wang, Y. Identifying the spatiotemporal dynamic of PM2.5 concentrations at multiple scales using geographically and temporally weighted regression model across China during 2015–2018. Sci. Total Environ. 2021, 751, 141765. [Google Scholar] [CrossRef] [PubMed]
  21. Guo, B.; Zhang, D.; Pei, L.; Su, Y.; Wang, X.; Bian, Y.; Zhang, D.; Yao, W.; Zhou, Z.; Guo, L. Estimating PM2.5 concentrations via random forest method using satellite, auxiliary, and ground-level station dataset at multiple temporal scales across China in 2017. Sci. Total Environ. 2021, 778, 146288. [Google Scholar] [CrossRef]
  22. Guo, B.; Wu, H.; Pei, L.; Zhu, X.; Zhang, D.; Wang, Y.; Luo, P. Study on the spatiotemporal dynamic of ground-level ozone concentrations on multiple scales across China during the blue sky protection campaign. Environ. Int. 2022, 170, 107606. [Google Scholar] [CrossRef]
  23. Sharma, R.; Shilimkar, G.; Pisal, S. Air Quality Prediction by Machine Learning. Int. J. Sci. Res. Sci. Technol. 2021, 8, 486–492. [Google Scholar] [CrossRef]
  24. Uh, B.H.; Majid, N. Comparison of ARIMA Model and Artificial Neural Network in Forecasting Gold Price. J. Qual. Meas. Anal. 2021, 17, 31–39. [Google Scholar]
  25. Chee, K.C.; Omar, N. Bitcoin Price Prediction Based on Sentiment of News Article and Market Data with LSTM Model. Asia-Pac. J. Inf. Technol. Multimed. 2020, 9, 1–16. [Google Scholar]
  26. Tsan, Y.T.; Chen, D.Y.; Liu, P.Y.; Kristiani, E.; Nguyen, K.L.P.; Yang, C.T. The Prediction of Influenza-Like Illness and Respiratory Disease Using LSTM and ARIMA. Int. J. Environ. Res. Public Health 2022, 19, 1858. [Google Scholar] [CrossRef]
  27. Khumaidi, A.; Raafi’udin, R.; Solihin, I.P. Pengujian Algoritma Long Short Term Memory untuk Predikasi Kualitas Udara dan Suhu Kota Bandung. J. Telematika 2020, 15, 13–18. [Google Scholar]
  28. Bakar, M.A.A.; Ariff, N.M.; Mohd Nadzir, M.S.; Ong, L.W.; Suris, F.N.A. Prediction of Multivariate Air Quality Time Series Data Using Long Short-Term Memory Network. Mal. J. Fund. Appl. Sci. 2022, 18, 52–59. [Google Scholar] [CrossRef]
  29. Zhang, Y.; Yang, M.; Yang, F.; Dong, N. A Multi-Step Prediction Method of Urban Air Quality Index Based on Meteorological Factors Analysis. In Proceedings of the International Conference on Environment, Renewable Energy and Green Engineering (EREGCE 2022), Online, China, 22–24 April 2022; p. 01010. [Google Scholar]
  30. Azid, A.; Juahir, H.; Toriman, M.E.; Kamarudin, M.K.A.; Saudi, A.S.M.; Hasnam, C.N.C.; Aziz, N.A.A.; Azaman, F.; Latif, M.T.; Zainuddin, S.F.M.; et al. Prediction of the Level of Air Pollution Using Principal Component Analysis and Artificial Neural Network Techniques: A Case Study in Malaysia. Water Air Soil Pollut. 2014, 225, 2063. [Google Scholar] [CrossRef]
  31. Yulita, I.N.; Fanany, M.I.; Arymurthy, A.M. Fuzzy Clustering and Bidirectional Long Short-Term Memory for Sleep Stages Classification. In Proceedings of the 2017 International Conference on Soft Computing, Intelligent System and Information Technology, Denpasar, Bali, Indonesia, 26–29 September 2017; pp. 11–16. [Google Scholar]
  32. Liu, L.; Jahromi, H.M.; Cai, L.; Kidston, D. Hierarchical Agglomerative Clustering and LSTM-Based Load Prediction for Dynamic Spectrum Allocation. In Proceedings of the 2021 IEEE 18th Annual Consumer Communications & Networking Conference (CCNC), Las Vegas, NV, USA, 9–12 January 2021; pp. 1–6. [Google Scholar]
  33. Li, R.; Hu, Y.; Liang, Q. T2F-LSTM Method for Long-Term Traffic Volume Prediction. IEEE Trans. Fuzzy Syst. 2020, 28, 3256–3264. [Google Scholar] [CrossRef]
  34. Baca, H.A.H.; Valdivia, F.d.L.P.; Ibarra, M.J.; Cruz, M.A.; Baca, M.E.H. Air Quality Prediction Based on Long Short-Term Memory (LSTM) and Clustering K-Means in Andahuaylas, Peru. In Proceedings of the 2021 Future of Information and Communication Conference (FICC): Advances in Information and Communication, Vancouver, Canada, 29–30 April 2021; pp. 179–191. [Google Scholar]
  35. Chen, H.; Guan, M.; Li, H. Air Quality Prediction Based on Integrated Dual LSTM Model. IEEE Access 2021, 9, 93285–93297. [Google Scholar] [CrossRef]
  36. Kobylin, O.; Lyashenko, V. Time Series Clustering Based on the K-Means Algorithm. J. La Multiapp 2020, 1, 1–7. [Google Scholar] [CrossRef]
  37. Lkhagva, B.; Suzuki, Y.; Kawagoe, K. New Time Series Data Representation ESAX for Financial Applications. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW’06), Atlanta, GA, USA, 3–7 April 2006; pp. 17–22. [Google Scholar]
  38. Sardá-Espinosa, A. Time-Series Clustering in R Using the dtwclust Package. R. J. 2019, 11, 22–43. [Google Scholar] [CrossRef]
  39. Hautamaki, V.; Nykanen, P.; Franti, P. Time-Series Clustering by Approximate Prototypes. In Proceedings of the 19th International Conference on Pattern Recognition, Tampa, FL, USA, 8–11 December 2008; pp. 1–4. [Google Scholar]
  40. Maharaj, E.A.; D’Urso, P.; Caiado, J. Time Series Clustering and Classification, 1st ed.; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar]
  41. Aghabozorgi, S.; Shirkhorshidi, A.S.; Teh, Y.W. Time-Series Clustering—A Decade Review. Inf. Syst. 2015, 53, 16–38. [Google Scholar] [CrossRef]
  42. Bhardwaj, A. Silhouette Coefficient. Available online: https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c (accessed on 31 May 2022).
  43. Denyse. Time Series Clustering—Deriving Trends and Archetypes from Sequential Data. Available online: https://towardsdatascience.com/time-series-clustering-deriving-trends-and-archetypes-from-sequential-data-bb87783312b4 (accessed on 31 May 2022).
  44. Colah. Understanding LSTM Networks. Available online: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ (accessed on 31 May 2022).
  45. Vijay, U. Early Stopping to Avoid Overfitting in Neural Network—Keras. Available online: https://medium.com/zero-equals-false/early-stopping-to-avoid-overfitting-in-neural-network-keras-b68c96ed05d9 (accessed on 10 January 2023).
  46. NC State University Physics Department. Percent Error and Percent Difference. Available online: https://www.webassign.net/question_assets/ncsucalcphysmechl3/percent_error/manual.html (accessed on 10 January 2023).
  47. Northern Territory Department of Lands, Planning and the Environment (DLPE). Appendix D—Data Quality Objectives, Quality Assurance, Quality Control. Available online: https://ntepa.nt.gov.au/__data/assets/pdf_file/0003/286149/Edith-River-Investigation-Report (accessed on 10 January 2023).
  48. Yusof, N.A.M. Jerebu Akibat Kebakaran di Sumatera dan Kalimantan. Available online: https://www.bharian.com.my/berita/nasional/2018/08/463184/jerebu-akibat-kebakaran-di-sumatera-dan-kalimantan (accessed on 10 January 2023).
  49. Nufael, A. Malaysia Alami Jerebu Akibat Pembakaran Terbuka di Kalimantan. Available online: https://www.benarnews.org/malay/berita/my-jerebu-180817-08172018183152.html (accessed on 10 January 2023).
  50. Kawi, M.R. IPU Sarawak Naik, Miri Catat Bacaan Tidak Sihat. Available online: https://www.bharian.com.my/berita/wilayah/2018/08/463688/ipu-sarawak-naik-miri-catat-bacaan-tidak-sihat (accessed on 10 January 2023).
  51. Zhang, M.; Chen, S.; Zhang, X.; Guo, S.; Wang, Y.; Zhao, F.; Chen, J.; Qi, P.; Lu, F.; Chen, M. Characters of Particulate Matter and Their Relationship with Meteorological Factors during Winter Nanyang 2021–2022. Atmosphere 2023, 14, 137. [Google Scholar] [CrossRef]
Figure 1. LSTM cell structure.
Figure 1. LSTM cell structure.
Atmosphere 14 00853 g001
Figure 2. Flow chart of the framework.
Figure 2. Flow chart of the framework.
Atmosphere 14 00853 g002
Figure 3. Elbow plot.
Figure 3. Elbow plot.
Atmosphere 14 00853 g003
Figure 4. Distribution of air quality monitoring stations according to clusters.
Figure 4. Distribution of air quality monitoring stations according to clusters.
Atmosphere 14 00853 g004
Figure 5. Bar chart for the number of stations based on categories according to clusters.
Figure 5. Bar chart for the number of stations based on categories according to clusters.
Atmosphere 14 00853 g005
Figure 6. Time plots of daily average concentration of PM10 for training set of selected stations in each cluster: (a) Cluster 1; (b) Cluster 2.
Figure 6. Time plots of daily average concentration of PM10 for training set of selected stations in each cluster: (a) Cluster 1; (b) Cluster 2.
Atmosphere 14 00853 g006
Figure 7. Time plots of daily average concentration of PM10 for test set of selected stations in each cluster: (a) Cluster 1; (b) Cluster 2.
Figure 7. Time plots of daily average concentration of PM10 for test set of selected stations in each cluster: (a) Cluster 1; (b) Cluster 2.
Atmosphere 14 00853 g007
Figure 8. Actual and predicted values for selected stations from each cluster: (a) CA10A (Cluster 1); (b) CA01R (Cluster 2).
Figure 8. Actual and predicted values for selected stations from each cluster: (a) CA10A (Cluster 1); (b) CA01R (Cluster 2).
Atmosphere 14 00853 g008
Table 1. Minimum value, maximum value and quartiles for the whole dataset ( μ g/m3).
Table 1. Minimum value, maximum value and quartiles for the whole dataset ( μ g/m3).
Minimum1st QuartileMedian3rd QuartileMaximum
4.3716.2921.8729.38235.72
Table 2. Silhouette scores for each number of clusters.
Table 2. Silhouette scores for each number of clusters.
k 234
Silhouette Score0.26280.17420.1532
Table 3. Cluster membership for daily average PM10 concentration according to stations.
Table 3. Cluster membership for daily average PM10 concentration according to stations.
StationStation LocationStation CategoryLongitudeLatitudeCluster
CA01RKangar, PerlisSuburban100.21116.4299222
CA02KLangkawi, KedahSuburban99.858466.3315392
CA03KAlor Setar, KedahSuburban100.34686.1372442
CA04KSungai Petani, KedahSuburban100.46785.6296312
CA05KKulim Hi-Tech, KedahIndustry100.59035.4241472
CA06PSeberang Jaya, Pulau PinangUrban100.40395.398171
CA07PSeberang Perai, Pulau PinangSuburban100.44355.3293581
CA09PBalik Pulau, Pulau PinangSuburban100.21475.3375982
CA10ATaiping, PerakSuburban100.67914.898851
CA11ATasek Ipoh, PerakUrban101.11674.6294441
CA12APegoh Ipoh, PerakSuburban101.08024.5533361
CA13ASeri Manjung, PerakRural100.66344.2003441
CA14ATanjung Malim, PerakSuburban101.52453.6877582
CA15WBatu Muda, Kuala LumpurSuburban101.68223.2124391
CA16WCheras, Kuala LumpurUrban101.71793.1062361
CA17WPutrajayaSuburban101.69012.9148161
CA18BKuala Selangor, SelangorRural101.25623.3213082
CA19BPetaling Jaya, SelangorSuburban101.6083.1331691
CA20BShah Alam, SelangorUrban101.55623.1047171
CA21BKlang, SelangorSuburban101.41313.0148891
CA22BBanting, SelangorSuburban101.62322.8166891
CA23NNilai, Negeri SembilanSuburban101.81152.8216921
CA24NSeremban, Negeri SembilanUrban101.96852.7233812
CA25NPort Dickson, Negeri SembilanSuburban101.86692.4413832
CA26MAlor Gajah, MelakaRural102.22462.3709252
CA27MBukit Rambai, MelakaSuburban102.17272.2585191
CA28MBandaraya Melaka, MelakaUrban102.25712.1909362
CA29JSegamat, JohorSuburban102.86272.4939142
CA31JBatu Pahat, JohorSuburban102.86661.9193232
CA32JKluang, JohorRural103.31212.0378822
CA33JLarkin, JohorUrban103.7361.4946251
CA34JPasir Gudang, JohorUrban103.89351.4701221
CA35JPengerang, JohorIndustry104.14961.3894892
CA36JKota Tinggi, JohorSuburban104.22531.5640562
CA37CRompin, PahangRural103.41922.9266452
CA38CTemerloh, PahangSuburban102.37643.4716031
CA39CJerantut, PahangSuburban102.36663.948362
CA40CIndera Mahkota, Kuantan, PahangSuburban101.91973.2765292
CA41CBalok Baru, Kuantan, PahangIndustry103.36223.9518421
CA42TKemaman, TerengganuIndustry103.42584.2621212
CA43TPaka, TerengganuIndustry103.43484.5980642
CA44TKuala Terengganu, TerengganuRural103.12045.3080942
CA45TBesut, TerengganuSuburban102.51565.7484492
CA46DTanah Merah, KelantanSuburban102.13455.8111722
CA47DKota Bahru, KelantanSuburban102.24926.1474312
CA48STawau, SabahSuburban117.93594.2497862
CA49SSandakan, SabahSuburban118.09115.8644672
CA50SKota Kinabalu, SabahSuburban116.04335.893722
CA51SKimanis, SabahIndustry115.85065.5382252
CA54QLimbang, SarawakRural115.01374.7588912
CA55QPermyjaya, Miri, SarawakRural114.04344.4947912
CA56QMiri, SarawakSuburban114.01244.4246792
CA57QSamalaju, SarawakIndustry113.29523.5370592
CA58QBintulu, SarawakSuburban113.04113.1770842
CA59QMukah, SarawakRural112.01972.8832382
CA61QSibu, SarawakSuburban111.83192.3144082
CA62QSarikei, SarawakRural111.52292.1328092
CA63QSri Aman, SarawakRural111.46481.2196562
CA64QSamarahan, SarawakRural110.49151.4548532
CA65QKuching, SarawakUrban110.3891.5622292
Table 4. Minimum values, maximum values and quartiles according to clusters ( μ g/m3).
Table 4. Minimum values, maximum values and quartiles according to clusters ( μ g/m3).
ElementWhole DatasetTraining SetTest Set
Cluster 1Cluster 2Cluster 1Cluster 2Cluster 1Cluster 2
Minimum6.574.376.574.377.325.92
1st Quartile22.3814.8323.1315.3220.3513.58
Median28.2519.1729.4520.0624.9016.75
3rd Quartile37.7625.3337.1726.5230.2921.00
Maximum180.23235.72180.23235.7270.7772.78
Table 5. Optimum hyperparameter settings according to clusters.
Table 5. Optimum hyperparameter settings according to clusters.
HyperparameterSetting
Hidden layer1
Hidden neuron400
Dropout rate0.1
Timestep7
Batch size32
Epochs100
Activation functionTanh
Recurrent activationSigmoid
Loss functionMSE
OptimizerAdam
Table 6. Model performance of hybrid models and computation time in training phase.
Table 6. Model performance of hybrid models and computation time in training phase.
Hybrid ModelMSERMSEComputation Time (Seconds)
Cluster 10.00300.055183.165
Cluster 20.00230.048185.072
Table 7. Model performance of univariate LSTM models and computation time in training phase.
Table 7. Model performance of univariate LSTM models and computation time in training phase.
StationMSERMSEComputation Time (Seconds)
CA01R0.00440.066175.517
CA02K0.01160.107880.691
CA03K0.00590.077075.653
CA04K0.00680.082278.791
CA05K0.00540.073878.364
CA06P0.00450.066883.509
CA07P0.00560.074583.682
CA09P0.00380.061674.093
CA10A0.00840.091488.693
CA11A0.00920.095891.114
CA12A0.00560.074896.784
CA13A0.00710.084184.607
CA14A0.00610.078176.651
CA15W0.00790.089084.997
CA16W0.01150.107383.640
CA17W0.00830.091077.432
CA18B0.00570.075274.904
CA19B0.01430.119487.481
CA20B0.01380.117684.607
CA21B0.00600.077792.034
CA22B0.00420.064886.087
CA23N0.01030.101385.621
CA24N0.01210.110077.689
CA25N0.01600.126680.627
CA26M0.01380.117276.856
CA27M0.01090.104589.117
CA28M0.00980.099186.062
CA29J0.01040.101980.175
CA31J0.01610.127181.089
CA32J0.01720.131279.806
CA33J0.01590.126087.609
CA34J0.01540.124090.253
CA35J0.01090.104580.129
CA36J0.01540.124080.523
CA37C0.00710.084378.975
CA38C0.00970.098699.205
CA39C0.01000.100178.328
CA40C0.00700.083777.969
CA41C0.00750.086792.146
CA42T0.00660.081287.247
CA43T0.01240.111478.129
CA44T0.00290.053478.864
CA45T0.00830.091083.894
CA46D0.01050.102778.864
CA47D0.01240.111182.339
CA48S0.01590.126283.257
CA49S0.01290.113777.018
CA50S0.00530.073081.258
CA51S0.00810.090384.210
CA54Q0.00630.079281.136
CA55Q0.00190.043483.684
CA56Q0.00890.094578.898
CA57Q0.00930.096380.553
CA58Q0.00830.091081.694
CA59Q0.00450.066780.901
CA61Q0.00530.072876.724
CA62Q0.00670.081684.564
CA63Q0.00560.075082.353
CA64Q0.00530.072581.946
CA65Q0.00630.079682.799
Table 8. Comparison of prediction performance between hybrid models and univariate LSTM models.
Table 8. Comparison of prediction performance between hybrid models and univariate LSTM models.
StationClusterRMSEMAEMAPE
Hybrid ModelUnivariate ModelRPD (%)Hybrid ModelUnivariate ModelRPD (%)Hybrid ModelUnivariate ModelRPD (%)
CA01R24.51984.37013.373.49953.37263.6921.578421.39730.84
CA02K24.55054.95388.493.29504.179323.6618.167629.079246.19
CA03K24.98964.89721.873.63913.59681.1723.033023.26140.99
CA04K26.02985.318412.544.54323.899215.2619.959120.20921.25
CA05K25.07144.135920.323.80233.224416.4517.013615.337410.36
CA06P14.96994.68975.803.94963.59809.3214.386713.39027.18
CA07P15.35415.42751.364.36684.21773.4718.537018.04302.70
CA09P24.74744.278410.393.87913.399613.1723.305021.96065.94
CA10A17.99057.62304.716.46036.41880.6526.447428.49737.46
CA11A16.02245.027218.014.93814.016620.5826.051821.646518.47
CA12A15.40085.19113.964.11504.10980.1315.138115.06290.50
CA13A15.79715.44316.304.32364.06556.1522.251321.15205.07
CA14A23.00892.004740.062.17901.533734.7619.310414.454328.76
CA15W19.07775.534048.516.23474.235138.2021.743518.570015.74
CA16W15.14935.11840.603.97034.09723.1516.185017.29716.64
CA17W16.23876.33471.534.76044.71630.9318.153419.05254.83
CA18B26.72434.841632.565.35563.775034.6226.691719.644230.42
CA19B16.95026.43257.745.37055.18823.4516.922518.845710.75
CA20B111.35707.262443.988.75435.637443.3224.848318.997526.69
CA21B115.99088.991456.0413.75087.840054.7557.603633.626152.57
CA22B110.56275.773458.638.84094.653562.0636.026419.029961.74
CA23N112.53977.772546.949.20906.261838.1023.700619.054221.73
CA24N28.43725.010650.966.03064.025439.8826.002121.219320.26
CA25N27.03804.232349.795.14513.311643.3622.606717.530525.29
CA26M27.53554.581248.765.22153.668434.9424.042721.257512.30
CA27M16.28104.705828.674.79253.813822.7420.231617.606613.87
CA28M27.26565.870121.255.94514.996917.3339.178335.092911.00
CA29J27.07813.615464.765.12032.925454.5625.129318.199831.99
CA31J28.61284.672059.336.04713.809845.4025.447721.019419.06
CA32J29.61174.705768.537.61273.675169.7734.587421.618946.15
CA33J114.04366.209177.3711.48934.796982.1836.912317.931669.22
CA34J113.25366.137573.4010.40144.639776.6135.842219.728357.99
CA35J27.44534.222055.255.31453.448342.5925.324719.608325.44
CA36J210.04483.677692.808.23072.7321100.3141.903915.686891.05
CA37C27.97624.140663.315.78663.210457.2726.794317.452342.23
CA38C16.61685.640915.924.59754.38584.7119.079520.42526.81
CA39C25.96453.902541.793.90812.943628.1522.933120.444311.47
CA40C25.30553.953029.223.74723.074219.7323.645721.70608.55
CA41C16.94345.610821.235.74824.398926.5927.043222.881416.67
CA42T25.25594.465016.274.01923.487714.1621.583121.03742.56
CA43T27.38484.060858.085.81102.958665.0530.144719.879741.04
CA44T219.86617.488790.5017.19226.162194.46106.206341.536487.54
CA45T26.62394.898629.955.05603.673231.6829.277824.788316.61
CA46D210.29586.914039.307.72565.410035.2631.829030.61603.88
CA47D27.75506.643415.445.75975.156111.0627.953629.51725.44
CA48S24.04292.166260.453.46591.694668.6525.912215.254151.78
CA49S25.51372.561573.124.75961.877486.8527.365311.701480.19
CA50S29.11107.021625.906.19185.041120.4925.060322.110612.51
CA51S23.86432.165356.363.23551.634365.7622.354412.495456.58
CA54Q24.49973.458226.173.44412.657125.8022.906418.665320.40
CA55Q223.31022.4208162.3720.24601.8668166.23136.046611.8319168.00
CA56Q27.79534.119861.696.60073.416063.5929.382417.473950.83
CA57Q25.51445.08298.143.72413.46187.3019.927519.61821.56
CA58Q28.30226.747820.666.42315.390517.4831.016929.78754.04
CA59Q29.33984.619567.637.46473.987060.7446.993025.848258.06
CA61Q26.52273.764853.625.11633.090249.3829.063718.178646.08
CA62Q25.21953.149549.474.22922.545349.7129.881118.085049.19
CA63Q24.62503.107039.263.65242.437939.8823.616616.518635.37
CA64Q25.30582.671766.044.19882.191462.8328.910115.059363.00
CA65Q26.21384.752126.664.63363.451829.2326.040218.634733.15
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ariff, N.M.; Bakar, M.A.A.; Lim, H.Y. Prediction of PM10 Concentration in Malaysia Using K-Means Clustering and LSTM Hybrid Model. Atmosphere 2023, 14, 853. https://doi.org/10.3390/atmos14050853

AMA Style

Ariff NM, Bakar MAA, Lim HY. Prediction of PM10 Concentration in Malaysia Using K-Means Clustering and LSTM Hybrid Model. Atmosphere. 2023; 14(5):853. https://doi.org/10.3390/atmos14050853

Chicago/Turabian Style

Ariff, Noratiqah Mohd, Mohd Aftar Abu Bakar, and Han Ying Lim. 2023. "Prediction of PM10 Concentration in Malaysia Using K-Means Clustering and LSTM Hybrid Model" Atmosphere 14, no. 5: 853. https://doi.org/10.3390/atmos14050853

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop