1. Introduction
Air pollution is one of the biggest environmental concerns around the world [
1]. Air pollution is mostly caused by meteorological factors, with PM
2.5 being the most dangerous pollutant [
2]. Fine particulate matter has a negative effect on health and harms the body when we are exposed to it in the short and long term [
3]. In 2013, the WHO’s International Agency for Research Cancer (IARC) identified particulate matter as a potential cause of lung cancer. Exposure to PM
2.5 can lead to breathing illnesses such as chronic obstructive pulmonary disease (CPPD) [
4], acute lower respiratory tract infections, myocardial infarctions, and lung cancer [
5,
6], as well as other acute and chronic respiratory illnesses, heart disease, and strokes in both urban and rural regions. The levels of PM
2.5 and other pollutants emitted into the atmosphere have started to interfere with normal biological functions [
7]. By emphasizing the significance of precise PM
2.5 prediction in health risk assessment and air pollution control, the study of PM
2.5 offers a scientific foundation for comprehending and preventing air pollution in different areas [
8,
9]. The major sources of PM
2.5 include biomass burning, industrial activities, and motor vehicles [
10,
11].
Urbanization and industry have been linked to air pollution; as a result, a lot of studies have been performed in cities in developed countries [
12,
13,
14]. There are different classifications of zones, which include rural, suburban, urban, and industrial. The spatial distributions of air pollution, such as PM concentrations, in urban areas were higher than in rural areas of the same region [
15]. Concentrations of particulate matter are influenced by weather and climate factors, since the movement and dispersion of air pollutants in the surrounding atmosphere are affected by wind patterns, vertical mixing, and precipitation. These processes lead to alterations in particulate matter such as PM
2.5 concentrations over space and time [
16]. The rapid development of urban cities and the growth of the economy have accelerated the growth of the automobile industry as well as the rate at which vehicles are present in cities. In recent years, environmental pollution including PM
2.5 from motor vehicles has worsened [
17,
18]. Furthermore, it has been observed that PM
2.5 exhibits more favorable characteristics in high-income countries that have completed the stages of rapid urbanization and economic growth. Nevertheless, low- and middle-income nations are either experiencing, or are on the verge of, fast-paced urbanization and economic development, which poses a potential threat regarding the concentration of particulate matter [
18].
Spatial autocorrelation is used to identify the correlations in data across location and time. Moran’s I evaluate the dispersion, unpredictability, or clustering of data patterns. Moran’s I is frequently used in geography and geographic information science (GIS) to assess how closely clustered certain map features are. A study by Zhang et al. [
19] undertook spatial autocorrelation using Moran’s I and found that the air quality in 19 Chinese cities increased and decreased over the past six years, with winter pollution being more severe. The spatial autocorrelation of these changes showed similar trends across different regions, indicating an expansion trend in air pollution levels. Global and local Moran’s I were used to measure the spatial autocorrelation of the air quality index (AQI) values from thirteen cities in China. The study reveals that the air quality in Northern and Southern Jiangsu is better than in the central region, with Southern Jiangsu experiencing worse air quality in certain seasons and months [
20]. A study in Poland by (Danek et al. 2022) [
21], using a geostatistical approach (standardized geographical weighted regression (GWR), Moran 1, Getis-Ord Gi) on data from during the COVID-19 lockdown, found that topography, meteorological variables, and PM concentrations are related. The results revealed a correlation between meteorological factors and pollution, with higher PM
10 values linked to lower temperatures and higher relative humidity. Before estimating via kriging, it is important to produce the semi-variograms (exponential, spherical, stable, Gaussian model) to make sure the data meet these requirements. The model with the fewest residuals should be selected when modelling a variogram. A study by Tong et al. [
22] described how the exponential model was the most suitable for monitoring Wuhan’s air quality in terms of mean error, while the Gaussian model was the best one to reflect the AQI in Wuhan, of which the northern region was inferior to the southern one. Another study by Aziz et al. [
23] demonstrated that the exponential model, which was used to determine the ideal number and location of rain gauge stations in Malaysia, was the best semi-variogram model.
A family of statistical methods known as “geo-statistics” was created to assess and forecast the values of something that is spread in the air or at ground level [
24]. Kriging is a geostatistical method for estimating values in unknown areas that considers the degree of variation and distance between known data points. Few researchers have used the kriging method by comparing methods in their research. A study by Gia Pham et al. [
25] assesses the most effective environmental variables to estimate soil parameters with regression kriging (RK) and compares the results of the ordinary kriging (OR) and RK methods. A previous study by Belkhiri et al. [
26] improved the spatial interpolation of the groundwater quality index using geostatistical interpolation techniques like co-kriging (CK) and OK.
In this study, the changes in variations of PM2.5 concentrations between the years of 2019 and 2020 in two zones, the central zone and south zone, containing 21 monitoring stations (11 stations in the central zone and 10 in the south zone), will be identified using the kriging method due to limited access to certain areas. The objective of this study is to investigate the temporal and spatial changes in the PM2.5 concentrations in the central and south zones of Peninsular Malaysia in the years 2019 and 2020. The spatial interpolation of PM2.5 and the air pollutant index (API) has been conducted using Moran’s 1 and variogram using ordinary kriging, universal kriging, and simple kriging.
2. Materials and Methods
2.1. Study Area
This study focuses on the concentrations of particulate matter with a diameter size below 2.5 μm (PM
2.5). The study area in Peninsular Malaysia has been divided into two zones: the central zone and the south zone (
Figure 1).
The central zone, situated in the heart of Peninsular Malaysia, encompasses the states of Selangor, Kuala Lumpur, Putrajaya, and Negeri Sembilan. This region is commonly referred to as the Kuala Lumpur mega-metropolitan environment and encompasses the Klang Valley. The Klang Valley region is widely recognized as the most developed area in Malaysia, boasting the highest population in the country. Kuala Lumpur functions as the capital city of Malaysia and holds the status of a federal territory. This region, which encompasses Putrajaya, functions as the administrative capital of Malaysia. The central zones are predominantly characterized as urban and semi-urban environments, primarily due to population expansion, increased industrial and economic activity, and their role as the administrative center for the federal government. The central zone is additionally impacted by industrial activity and transportation emissions resulting from a significant concentration of motor vehicles, particularly within Kuala Lumpur city center.
The south zone consists of two states, Melaka and Johor. This area is dominated by urban and industrial activities, especially in the southern region of Johor and next to Singapore. Most of the urban environments and townships are located on the west coast, and the rural area is located toward the middle of the peninsula. Cities such as Melaka Tow and Johor Bahru are usually influenced by heavy traffic and industrial activities. As well as by local emissions, the air quality in the central and south zones is influenced by transboundary emissions, especially from Sumatra Indonesia during the southwest monsoon.
The average annual temperature for Peninsular Malaysia is about 26 °C, with little variation between months. Both zones have two distinct seasons: the northeast monsoon (November to March) and the southwest monsoon (May to September). The northeast monsoon brings heavy rainfall and thunderstorms to the zone, while the southwest monsoon brings drier and cooler air. The average annual rainfall is about 2000 mm, with most of it falling during the northeast monsoon.
2.2. Data Collection and Analysis
The concentrations of PM
2.5 were collected at 11 continuous air quality monitoring stations (CAQMSs) in the central zone and 10 CAQMSs in the south zone of Peninsular Malaysia.
Table S1 (Supplementary Material) shows the description of the monitoring stations in Kuala Lumpur, Putrajaya, Selangor, Negeri Sembilan (central zone), and Melaka and Johor (south zone) as additional information.
Figure 1 shows the location of each monitoring station in the central and south zones of Peninsular Malaysia. All the CAQMS are operated by Pakar Scieno TW Sdn Bhd on behalf of the Malaysian Department of Environment. A Thermo Scientific tapered element oscillating microbalance (TEOM) 1405-DF (USA) was used to determine PM
2.5 concentrations. All calibration procedures and quality control/quality assurance (QA/QC) of the data were conducted via Pakar Scieno TW Sdn Bhd (Shah Alam, Malaysia) before the data were submitted to the Malaysian Department of Environment [
27,
28]. In this study, data analysis has been conducted using the library “geoR” and the “gstat” package in R version 4.1.2, SAGA GIS 7.8.2, and QGIS version 3.26.2 software produced the raster image and mapping of the study area.
2.3. Classification Breakpoint of PM2.5 Concentrations in the Air Pollutant Index (API)
The breakpoint of PM
2.5 for API calculation is based on the concentration suggested by the Malaysian Department of Environment, which, in turn, is based on the breakpoint concentrations adopted by the United States Environmental Protection Agency (US EPA). There are a few categories of breakpoints of PM
2.5 concentrations for 24 h averages that will be referred to in this study: Good (0–12.0 µg/m
3); Moderate (12.1–35.4 µg/m
3); Unhealthy for Sensitive Groups (35.5–55.4 µg/m
3); Unhealthy (55.5–150.4 µg/m
3); and Very Unhealthy (150.5–250.4 µg/m
3), based on the United States Environment Protection Agency (USEPA) recommendations [
29]. Based on the Malaysian Ambient Air Quality Standard, the standard 24 h average particulate matter with sizes of less than 2.5 microns (PM
2.5) is 35 µg/m
3, whereas for one year, it is 15 µg/m
3.
2.4. Spatial Autocorrelation
In this study, PM2.5 concentration is the main parameter to be evaluated based on data from different stations. The spatial autocorrelation (Moran’s I) was used to calculate the spatial autocorrelation based on the location and value of the parameter.
2.4.1. Moran’s I Method
Patrick Alfred Pierce Moran created the Moran’s I scale to measure spatial autocorrelation. A correlation in a signal between close-by spatial locations essentially defines spatial autocorrelation. Moran’s I is written as (1):
where
N is the quantity of spatial units that are indexed by both
I and
j,
is the variable of interest,
is the mean of
, and
is a spatial weights matrix where the diagonal contains zero (i.e.,
.
Moran’s I employs the subsequent null and alternative hypotheses:
H0. The data are randomly dispersed.
HA. The data are not randomly dispersed (they are clustered in noticeable patterns).
The
-score for the statistics is calculated as
where
and
are
where
is the predicted index value, and
is the index value of variance.
The method uses hypothesis testing to determine whether the specified pattern is clustered, scattered, or random, given a set of associated features and attributes. To assess the significance of Moran’s I index, the tool calculates its value along with the
-score and
p-value. If the
p-value for Moran’s I is less than a specified level of significance (i.e.,
a = 0.05), and the
-score is positive, then the null hypothesis is rejected. The dataset exhibits a more spatially concentrated distribution of high and/or low values than one could anticipate from random underlying spatial processes. If the
p-value for Moran’s I is less than a specified level of significance (i.e.,
a = 0.05), and the
-score is negative, then we may reject the null hypothesis. The dataset’s spatial distribution of high and low values is more dispersed than expected if the underlying spatial processes were random [
30,
31].
2.4.2. Variogram
The variogram is an important input in kriging interpolation. In order to describe the geographical variation in the pollutant concentrations, the geostatistical procedure known as kriging defines a correlation (semi-variogram) among the sample locations [
32]. This is a statistical measure that quantifies the spatial correlation between two points. An exploratory data analysis tool that plots half of the mean squared difference between paired observations against their separation distance is called an experimental variogram. The variogram shows how comparable the values are between close measurements. The experimental variogram can be fitted with a variogram model. For kriging, the model coefficients are necessary [
33]. The variogram is the random process
Z(
x), a theoretical function that, we believe, is responsible for the actual realization on the ground [
34]. The spatial variability of the regional variables between two places is characterized using the variogram. Equation (5) is utilized to calculate data pairs that are separated by the distance
h using the semi-variogram
γ(
h);
where
h denotes the lag distance between two observation locations,
Z (
) is the value of the regional variable of interest at the observation at the observation locations
,
Z (
) is the regional variable of interest value at the specified location
, and
N(
h) is the quantity of data pairs at the observation locations divided by
h.
h is represented by a distance interval, since there is little chance that the distance between the sampled pairs will be accurate.
A fitted semi-variogram contains the three components of nugget, range, and sill. The sill, which is made up of the nugget and the partial sill, is the height at which the semi-variogram levelling is removed. The range is the distance from the sill at which the semi-variogram levelling occurs. The measurement errors or micro-scale differences are represented by the nugget effect [
23]. The semi-variogram model, the quantity
N of PM
2.5, and its spatial position all affect the predicted variance. Consequently, selecting the right semi-variogram model is crucial for determining the best estimation variance. The semi-variogram obtained from the experimental data at the observation site is compared to a theoretical semi-variogram model of
γ(
h). Exponential, Gaussian, and stable semi-variogram models were written as selected to fit a model [
35]:
where
specifies the lag of separating distances from which the dependent variable shall be calculated. It must be a positive real number,
is the effective range, and the effective range is the lag where 95% of the sill is exceeded. This is necessary because an exponential function can only approach the sill asymptotically.
represents the sill of the variogram where it will flatten out;
represents the nugget of the variogram.
where
specifies the lag of the separating distances that the dependent variable shall be calculated from, and
is the effective range. The latency at which 95% of the sill is exceeded is the effective range. This is necessary since the e-function portion of the stable model only approaches the sill asymptotically.
is the sill of the variogram where it will flatten out and is the shape parameter. For
s ≤ 2, the model will be shaped more like an exponential or spherical model; for
s > 2, it will be shaped most like a Gaussian function.
is the nugget of the variogram.
2.5. Spatial Interpolation
Kriging is a family of interpolation processes in the geostatistical method after IDW (inverse-distance-weighted) and spline interpolation. Kriging is a multi-step procedure that involves investigating a variance surface as well as exploratory statistical analysis of the data, variogram modelling, surface creation, and surface exploration (if applicable). The kriging interpolation technique is the best linear unbiased estimator (BLUE). It represents the spatial autocorrelation of pairs of points in space and creates a minimum variance of the predicted error by analyzing a variogram. When the forms of the semi-variogram are known, it is possible to estimate the concentrations of the variables at any unsampled location using kriging techniques [
35]. There are different techniques that apply in this study: ordinary kriging (OK), simple kriging (SK), and universal kriging (UK).
2.5.1. Ordinary Kriging
The basic geostatistical approach for simulating the geographical distribution of a random variable is known as ordinary kriging (OK). The OK method is an optimal spatial interpolation estimation method in which the value of the random variable
at an unsampled location
is determined according to the linear combination of the known values of all the sampled locations, as follows:
where
is a sample location between Station 1 and Station 21,
is the unsampled location,
is the unknown value of the random variable (PM
2.5) to be determined at the unsampled location
denotes the known value of the random variable (PM
2.5) at the sample location
,
M is the total number of known values of the random variable
at the sampled location, and
is a kriging weighted factor for the knockdown of the random variable
Z(
x) at the sampled location
, which is used to determine
[
36].
is the weight that indicates the contribution from observations to ensure that
is unbiased; the optimized unbiased estimation represents the average of the estimated error, or the residuals close to zero, as a mathematical expectation of the difference between the predicted value
and the observed value
. The goal is to minimize the variation between the predicted
and observed value of
[
37]. Ordinary kriging can use either semi-variograms or covariances, transformations and remove trends, and allows measurement error. Fitri et al. [
38] explains that ordinary kriging has a good level of accuracy when estimating the concentrations of PM
2.5 in Surabaya.
2.5.2. Simple Kriging
An estimate is created using simple kriging (SK), which modifies the mean. In an SK equation, the stationary random variable’s mean value,
, is presumed to be constant and well known throughout the research zones. The SK estimator must be fair and have a small variance in the estimate error for the global mean assumption to hold true. The SK equation is written as (10):
where
is the random variable (PM
2.5) at the location
,
values are equal to
M data locations,
is equal to the location-dependent expected values of the random variable
,
is the linear regression estimator,
is the weight, and
is the mean [
35].
2.5.3. Universal Kriging
Universal kriging (UK) is kriging with a trend and is similar to OK. The UK addresses a situation in which the local mean varies within the research region. Even though the local mean
is unknown, much like in OK, the UK models are a linear combination of coordinate functions. UK can handle a nonstationary mean in which the predicted value of
is a linear or high-order deterministic function of the
coordinates of the data points. The random function
is a combination of trend components with a deterministic variation,
and a residual component,
. The UK is described as
where
is the kriging weighted factor, and
is the fixed, unknown coefficient,
is the random function,
is a trend component with a deterministic variation, and
is a residual component [
35].
2.6. Performance Indicator
The spatial variability needed for the kriging approach can be explained using the theoretical semi-variogram model [
37]. The mean squared error (MSE), root-mean-square error (RMSE), and normalized root-mean-square error (NRMSE) were used to validate the error of estimates [
39]:
where
is the actual value of PM
2.5 at the location
i,
is the prediction value at location
i, and
is the number of observations.
4. Discussion
For spatial autocorrelation, Moran’s I method was used to define the relationship between signals in nearby locations. Then, to define the kriging method, a semi-variogram was needed to define the distance of PM2.5 concentrations between each location. Through comparing the three models, the best method was selected and used in the kriging method. Comparisons between OK, SK, and UK were carried out to define the better performance based on the data provided. From the map, the central zone (11 stations) showed slightly higher concentrations compared to the south zone. The central zone consists of suburban, urban, and industrial areas with high populations and high economic activities. It shows that the Klang (S7), Shah Alam (S8), and Cheras (S5) monitoring stations had high PM2.5 concentrations compared to other stations in 2019. The south zone also consists of urban, suburban, and industrial areas.
Figure 3 represents a map of spatiotemporal changes in PM
2.5 concentrations in the central zone using universal kriging. In the year 2019, there was an orange and yellow area on the map because many activities occurred, causing a high mean concentration of PM
2.5. Petaling Jaya (S5), Shah Alam (S6), Klang (S7), and Nilai (S9) are stations located in urban, suburban, and industrial areas with high socioeconomic activities that release PM
2.5. In the year 2020, the concentration of PM
2.5 was less than 10 to 20 µg/m
3, in the moderate categories, which was lower compared to the previous year because minimum activity occurred due to the COVID-19 outbreak. Most non-essential activities were required to close to prevent the spread of the virus and to keep people at home. This result indicates that due to the stay-at-home order, the concentrations of PM
2.5 were reduced compared to before the COVID-19 era. Fuel combustion decreased along with the reduction in traffic volume brought on by COVID-19. As a result, the concentrations of smaller particles (PM
2.5) kept dropping. However, not all industries were shut down, including those that dealt with food and those that produced items like masks and sanitizers that were relevant to COVID-19 [
44]. The Klang station shows an increased concentration of PM
2.5 to 20–30 µg/m
3, but still in the moderate categories. Klang is where Port Klang is located, which is the 12th busiest transshipment port in the world. Despite COVID-19, the port operated normally. Therefore, the contribution of these industries and associated dust still led to a rise in the concentration of particulate matter (PM) [
45].
The spatiotemporal changes provide predictive locations that cannot be reached through interpolation. Based on the comparison of several methods, universal kriging provided better performance using the performance indicators. For the map in
Figure 3 and
Figure 4, the concentrations of PM
2.5 are in the good and moderate ranges in most areas. In 2019, there was a yellow area of 20–30 µg/m
3 PM
2.5 and a green area of 10–20 µg/m
3 PM
2.5; therefore, the scale is between good and moderate, while, in 2020, all parts were green because the concentrations of PM
2.5 were 10–20 µg/m
3, which is between the good and medium categories. The map shows temporal changes between the years 2019 and 2020. In the year 2019, the concentrations of PM
2.5 were more concentrated in certain areas compared to the year 2020. There were socioeconomic activities with heavy transportation, manufacturing, and combustion occurring in the area during that time [
46]. In the year 2020, most activities were shut down due to the COVID-19 outbreak causing people to stay at home. The government implemented a movement control order (MCO) to prevent activities that may have caused the spread of the virus. Some activities still occurred, but at a minimum level [
47,
48,
49]. From the map, we can see that the higher concentrations of PM
2.5 are lower compared to the year 2019.
5. Conclusions
Spatial autocorrelation using Moran’s I was used to define the spatial autocorrelation of PM2.5 concentrations. The results of the study show that the concentrations of PM2.5 in 2019 are not randomly distributed because economic activities occurred regularly in certain areas, and haze episodes occurred at that time while, in the year 2020, the concentrations are randomly distributed, based on hypothesis testing due to a movement control order that only allowed a few industrial and transportation operations in certain areas during a certain period in the COVID-19 era. The model that performed better in the years 2019 and 2020 was calculated, a performance indicator was used, and the findings showed that different models should be chosen based on different years. In the central zone, the Gaussian model was the best model in 2019 and the stable model in 2020; in the south zone, the stable model gave the best performance indicators in both 2019 and 2020. There were three kriging methods used: simple kriging, ordinary kriging, and universal kriging. Through comparison with the performance indicators, universal kriging showed a better performance compared to the other kriging methods in both years, 2019 (MSE = 13.9549, RMSE = 3.7356, NRMSE = 18.9385) and 2020 (MSE = 15.1398, RMSE = 3.8909, NRMSE = 20.1616).
There is a limitation due to the small number of monitoring stations and limited data, so some interpolation will not be accurate. More monitoring stations would give more data to interpolate across a specific area. In the future, more air quality data could be added, including the metrological data, and more kriging methods could be used to improve the results. The study should also be expanded to the whole of Peninsular Malaysia and other regions.