Comparison of Models for Missing Data Imputation in PM-2.5 Measurement Data

Lee, Ju-Yong; Han, Seung-Hee; Kang, Jin-Goo; Lee, Chae-Yeon; Lee, Jeong-Beom; Kim, Hyeun-Soo; Yun, Hui-Young; Choi, Dae-Ryun

doi:10.3390/atmos16040438

Open AccessArticle

Comparison of Models for Missing Data Imputation in PM-2.5 Measurement Data

by

Ju-Yong Lee

¹

,

Seung-Hee Han

¹,

Jin-Goo Kang

¹,

Chae-Yeon Lee

²,

Jeong-Beom Lee

¹

,

Hyeun-Soo Kim

¹,

Hui-Young Yun

³

and

Dae-Ryun Choi

^3,*

¹

Department of Environmental and Engineering, Graduate School of Anyang University, Anyang 14028, Republic of Korea

²

Division of Ocean & Atmosphere Sciences, Korea Polar Research Institute, Songdo, Incheon 21990, Republic of Korea

³

Department of Environmental and Energy Engineering, Anyang University, Anyang 14028, Republic of Korea

^*

Author to whom correspondence should be addressed.

Atmosphere 2025, 16(4), 438; https://doi.org/10.3390/atmos16040438

Submission received: 18 February 2025 / Revised: 31 March 2025 / Accepted: 7 April 2025 / Published: 9 April 2025

(This article belongs to the Special Issue New Insights in Air Quality Assessment: Forecasting and Monitoring)

Download

Browse Figures

Versions Notes

Abstract

:

The accurate monitoring and analysis of PM-2.5 are critical for improving air quality and formulating public health policies. However, environmental data often contain missing values due to equipment failures, data collection errors, or extreme weather conditions, which can hinder reliable analysis and predictions. This study evaluates the performance of various missing data imputation methods for PM-2.5 data in Seoul, Korea, using scenarios with artificially generated missing values during high- and low-concentration periods. The methods compared include FFILL, KNN, MICE, SARIMAX, DNN, and LSTM. The results indicate that KNN consistently achieved stable and balanced performance across different temporal intervals, with an RMSE of 5.65, 9.14, and 9.71 for 6 h, 12 h, and 24 h intervals, respectively. FFILL demonstrated superior performance for short intervals (RMSE 4.76 for 6 h) but showed significant limitations as intervals lengthened. SARIMAX performed well in long-term scenarios, with an RMSE of 9.37 for 24 h intervals, but required higher computational complexity. Conversely, deep learning models such as DNN and LSTM underperformed, highlighting the need for further optimization for time-series data. This study highlights the practicality of KNN as the most effective method for addressing missing PM-2.5 data in mid- to long-term applications due to its simplicity and efficiency. These findings provide valuable insights into the selection of imputation methods for environmental data analysis, contributing to the enhancement of data reliability and the development of effective air quality management policies.

Keywords:

data completeness; missing data imputation; machine learning; deep learning; environmental data

1. Introduction

Air pollution has emerged as one of the most critical environmental issues in modern society, significantly impacting human health and ecosystems. In particular, fine particulate matter (PM-2.5) consists of airborne particles with a diameter of 2.5 μm or less, which can penetrate deep into the human respiratory system and cause various health problems [1]. Long-term exposure to PM-2.5 has been linked to respiratory diseases, cardiovascular diseases, and an increased incidence of lung cancer, making it a major contributor to premature deaths and the global burden of disease [2]. Even short-term exposure can trigger acute health effects, such as asthma attacks and breathing difficulties, underscoring the importance of PM-2.5 monitoring and analysis as a critical public health research issue.

Environmental measurement data, such as PM-2.5, often contain missing values due to equipment malfunctions, extreme weather conditions, and errors in the data collection process [3]. Missing values can reduce the accuracy and reliability of data analysis models, which is particularly crucial in the analysis and prediction of time series data [4]. Ignoring or improperly handling missing values leads to distorted analytical results. Therefore, the development of effective techniques for handling missing environmental measurement data is essential for air quality analysis.

Various methodologies have been proposed to handle missing values. Traditional statistical methods, such as mean imputation, linear interpolation, and regression imputation, are simple and intuitive to implement; however, they have limitations in capturing complex patterns and nonlinear characteristics [5,6]. In contrast, machine-learning-based approaches, such as K-Nearest Neighbors (KNN) and Multivariate Imputation by Chained Equations (MICE), achieve higher accuracy by considering the structural characteristics of the data [7,8]. These techniques enhance the reliability of missing data imputation by preserving the distribution and patterns of the dataset.

Recently, deep-learning-based models have been effectively utilized for handling missing data. Deep neural networks (DNN) can learn complex nonlinear relationships within the data to predict missing values [9], while long short-term memory (LSTM) networks further enhance prediction accuracy by considering the temporal dependencies in time-series data [10]. However, these advanced techniques require additional effort in data preprocessing, model design, and computational complexity, which are considered key limitations.

This study aims to conduct research on missing data imputation using air quality measurement data in Seoul, South Korea. The city’s air pollution is intensifying due to a combination of industrial activities, increasing traffic, construction, and transboundary pollution sources [11,12]. Seasonal factors also play a significant role, with foreign pollutants contributing to a sharp rise in PM-2.5 concentrations during winter, while domestic pollution sources have a greater impact in summer [13]. Given the diverse causes and conditions of air pollution in Seoul, it is considered an ideal case study region for evaluating the effectiveness of missing data imputation techniques.

This study focuses on PM-2.5 data in the Seoul area, where missing values are artificially introduced to compare the performance of various imputation methods, including deep neural networks (DNN), long short-term memory (LSTM), Seasonal Autoregressive Integrated Moving Average with Exogenous Variables (SARIMAX), K-Nearest Neighbors (KNN), Multiple Imputation by Chained Equations (MICE), and forward fill (FFILL). Through this analysis, we aim to assess the strengths and weaknesses of each method and propose an optimal approach for handling missing environmental data. Furthermore, by quantitatively evaluating the performance variations of each method under different missing data rates and patterns, we seek to enhance their applicability in real-world air quality data analysis. This study systematically verifies missing data imputation techniques to improve data completeness and reliability, ultimately contributing to evidence-based policymaking and environmental management.

2. Materials and Methods

2.1. Study Area and Data Selection

Seoul is a city with high PM-2.5 concentrations and severe air pollution, making it a suitable region for research. Various factors, including industrial activities, traffic volume, heating, and construction activities, collectively contribute to high levels of particulate matter in the air. In particular, pollutants emitted from industrial activities and the use of fossil fuels are major causes of air pollution, playing a significant role in the deterioration of Seoul’s air quality [11,12]. Additionally, the combination of domestic and transboundary pollution sources exacerbates PM-2.5 concentrations depending on the season. During winter, pollutants originating from foreign sources are transported into Seoul, causing a sharp increase in fine particulate matter concentrations, whereas in summer, domestic pollution sources tend to have a more significant impact [11].

The input data consisted of air quality data, meteorological data, and model prediction data for Seoul from 2019 to 2023. As shown in Figure 1, the input dataset was structured as follows: air pollutant concentrations for six species (NO₂, CO, SO₂, O₃, PM-10, and PM-2.5) were obtained from AirKorea https://www.airkorea.or.kr/web/ (accessed on 4 November 2024), while meteorological variables including pressure, temperature, dew point temperature, relative humidity, and wind speed were sourced from the ASOS (Automated Synoptic Observing System) dataset provided by the Korea Meteorological Administration’s Open Data Portal. Additionally, chemical transport model predictions from the CMAQ (Community Multiscale Air Quality Model’s Chemical Transport Model) dataset were used as simulated forecast data.

The air quality monitoring station data used to construct the Seoul input dataset are summarized in Figure 2. Since the number of ASOS stations is smaller than that of air quality monitoring stations, a matching process was required. The matching was performed using a distance function to pair each air quality monitoring station with its nearest ASOS station. All air quality monitoring stations in Seoul were successfully matched with a meteorological station in Seoul, and the ASOS code for this station is 108.

By averaging the data from 25 air quality monitoring stations in Seoul, the missing data for PM-2.5 concentrations came to represent 0% of the data. Therefore, by artificially introducing missing values for a specific period in this complete dataset and simulating the missing periods using various models, it became possible to evaluate their performance and identify the most effective imputation model.

Moreover, actual missing data are more likely to occur continuously rather than in isolated one-hour gaps. Hence, this study evaluated the imputation performance under scenarios where continuous missing periods of 6 h, 12 h, and 24 h were introduced.

2.1.1. Monthly and Seasonal Characteristics of PM-2.5 in Seoul

The monthly average temperature in South Korea exhibits distinct seasonal and monthly characteristics and patterns. Peak temperatures are consistently observed during the summer months (June to August), while the lowest temperatures occur in winter (December to February), reflecting a typical seasonal cycle. Similarly, the monthly average concentration of PM-2.5 demonstrates noticeable seasonal variations. PM-2.5 concentrations tend to rise during December and January, whereas relatively lower levels are observed from July to October. This seasonal trend is shown in Figure 3b, which illustrates that the Korean Peninsula possesses atmospheric characteristics marked by both seasonality and periodicity. Seoul shows similar temporal and seasonal patterns to the entire Korean Peninsula. Therefore, the missing data imputation method validated for the Seoul area is expected to be applicable to the Korean Peninsula as a whole and is likely to exhibit similar model performance in other regions with comparable monthly and seasonal patterns.

2.1.2. Outlier Detection for PM-2.5 Measurements

Measurement data are subject to outliers caused by various factors such as measurement equipment failures and power outages, etc., and they can significantly impact the accuracy of missing data imputation. To ensure the reliability of PM-2.5 data, this study applied outlier detection and filtering techniques.

First, outlier detection was conducted using the interquartile range (IQR) method. Subsequently, a correlation analysis among air pollutants was employed to filter and refine the detected outliers. The IQR method defines outliers based on the first quartile (Q1) and the third quartile (Q3), using the following formula [14]:

U p p e r B o u n d = Q 3 + 1.5 \cdot I Q R

(1)

The IQR is defined as Q3 minus Q1. The values that fall outside this range are considered outliers and are illustrated in Figure 4a. However, it is important to note that these low outlier values are statistical and not true outliers. To find true outliers, correlation analysis among major air pollutants was conducted using outlier values. The correlations between PM-2.5 and other key pollutants such as PM-10, NO₂, CO, SO₂, and O₃ were analyzed in Figure 4b. This analysis revealed distinct patterns of high correlation with specific pollutants, except for O₃. We determined that the values in the outlier range are not outliers because when PM-2.5 concentrations are high, the concentrations of precursors that contribute to PM-2.5 production tend to be high as well. Additionally, air quality in Seoul is significantly influenced by transboundary pollutants originating from countries such as China and North Korea [15]. These transboundary inflows lead to sharply elevated PM-2.5 concentrations during the winter and spring seasons, providing further evidence that some values identified as outliers by the IQR method are not outliers.

2.2. Missing Data Imputation Periods

Missing values in PM-2.5 data were artificially generated to evaluate the performance of various imputation methods. The artificially created missing periods were categorized into high-concentration periods (daily average ≥ 35 μg/m³) and low-concentration periods (daily average ≤ 35 μg/m³). Table 1 presents cases of the high- and low-concentration periods along with their causes.

The six cases were selected based on high- and low-concentration events provided by AirKorea (www.airkorea.or.kr) in the analysis of the air quality forecast. The National Institute of Environmental Research (NIER) in South Korea provides air quality forecast analysis data to AirKorea four times a day. Although these cases are based on forecast analysis data and the expert opinions of operational forecasters, this study validated the selected dates via cross-validation with observed pollutant concentrations and meteorological patterns at the time. As a result, these cases were appropriate as representative case scenarios. Case A represents high-concentration scenarios, while Case B represents low-concentration scenarios.

For Case A, the scenarios include transboundary inflow, stagnation after transboundary inflow, and stagnation. For Case B, the scenarios include rain, snow, and atmospheric circulation. In Case A-1, a high concentration of PM-2.5 occurred due to the southeastward movement of transboundary PM-2.5 inflow that originated during the early morning hours. Case A-2 was characterized by high concentrations following transboundary inflow and atmospheric stagnation, which developed after rainfall in the afternoon of the 19th, transitioning from a low-concentration phase. In Case A-3, PM-2.5 concentrations remained above 50 µg/m³ for 93 consecutive hours, caused by the combined effects of residual pollution from the previous day and ongoing domestic emissions.

Conversely, Cases B-1 and B-2 displayed consistently clean air conditions due to efficient atmospheric circulation and precipitation. In Case B-3, clean conditions were sustained as a result of vigorous atmospheric dispersion.

2.3. Missing Data Imputation Models

In this study, various statistical methods, machine learning models, and artificial intelligence models were applied to effectively impute missing values in PM-2.5 data for the Seoul region. The methods used include FFILL, KNN, MICE, SARIMAX, DNN, and LSTM. The imputation process was conducted using input data from 2019 to 2023 to address missing values. Among the six methods presented in this study, all models except FFILL utilized the following input features: air quality measurements (PM-2.5, PM-10, NO₂, CO, SO₂, and O₃), meteorological variables (Pa, Ta, Td, RH, and WS), and 72 h forecast data for PM-2.5 from the CMAQ model.

2.3.1. FFILL (Forward Fill)

FFILL is a simple method for imputing missing values in time-series data by replacing a missing value with the most recent observed value before it. This method assumes data continuity, making it easy to implement and computationally efficient. However, it does not account for trends or seasonality in the data, making it less effective for time-series with significant fluctuations [13].

The imputation is performed by replacing the missing PM-2.5 concentration y_t at time t with the observed value y_t−₁ from the previous time step.

y_{t} = y_{t - 1}

(2)

2.3.2. KNN (K-Nearest Neighbors)

The KNN imputation method replaces missing values by identifying K neighboring samples with similar characteristics and using their values for imputation, as illustrated in Figure 5. This method can capture data patterns and distributions while considering nonlinear relationships [16].

For application, the distance between the sample x containing the missing value and other samples x_i is calculated. Typically, the Euclidean distance is used, which is defined as follows:

ⅆ (x, x_{i}) = \sum_{j = 1}^{n} {(\sqrt{x_{j} - x_{i, j}})}^{2}

(3)

Here, n represents the number of features and x_j is the j-th feature value of sample x. The K nearest neighbors are selected based on the calculated distances, and the missing PM-2.5 value y is imputed by averaging the PM-2.5 values y_i of the selected neighbors.

In this study, K was set to 11. Additionally, as the number of features n increases, Euclidean distances tend to become more similar, which affects the effectiveness of neighbor selection. To address this issue, out of 83 available features, the 10 most correlated features were selected for the KNN imputation process.

y = \frac{1}{k} \sum_{\dot{i} = 1}^{k} y_{i}

(4)

2.3.3. MICE (Multiple Imputation by Chained Equations)

MICE is an iterative imputation method that replaces missing values in a variable by leveraging regression relationships with other variables, as illustrated in Figure 6. By utilizing correlations among multiple variables, MICE improves data completeness and maintains consistency within the dataset [17].

The application process begins by assigning initial imputation values to each variable, such as the median or mean. A regression model is then built, where the variable Y containing missing values is treated as the dependent variable, and other complete variables X₁, X₂, …, Xₚ are used as independent variables. The regression equation is as follows:

y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{p} x_{p} + ε

(5)

The regression model is used to predict and replace missing values iteratively. This process is performed sequentially for all variables containing missing values. To enhance the stability of the imputed values, the entire process is repeated multiple times. In this study, the imputation process was repeated 10 times to ensure stability and consistency.

2.3.4. SARIMAX (Seasonal Autoregressive Integrated Moving Average with Exogenous Variables)

SARIMAX is an integrated model that considers the autoregressive properties, seasonality, and exogenous variables of time-series data. Since PM-2.5 concentrations exhibit temporal dependencies and seasonal patterns while being influenced by external factors such as meteorological conditions, the SARIMAX model is employed to predict missing values [18].

The SARIMAX model is expressed as follows:

Φ_{(P (L^{s}))} φ_{(p (L))} {(1 - L)}^{d} \cdot {(1 - L^{s})}^{D} \cdot y_{t} = Θ_{(Q (L^{s}))} θ_{(q (L))} ε_{t} + γ \cdot X_{t}

(6)

L represents the lag operator, while

Φ_{(P (L^{s}) \cdot)}

and

φ_{(p (L) \cdot)}

denote the non-seasonal and seasonal autoregressive polynomials, respectively.

θ_{(q (L) \cdot)}

and

Θ_{(Q (L^{s}) \cdot)}

represent the non-seasonal and seasonal moving average polynomials, respectively. The parameters d and D indicate the non-seasonal and seasonal differencing orders, respectively, while s is the seasonal period (e.g., for monthly data, s = 12).

The dependent variable y_t represents PM-2.5 concentration, while X_t represents exogenous variables such as meteorological factors, with γ being the coefficient vector of the exogenous variables. ε_t represents the error term.

The SARIMAX model used in this study follows the ARIMA framework, originally proposed by the Box–Jenkins methodology. SARIMAX improves overall accuracy by minimizing error values when input and output datasets are correlated and is particularly effective for time-series data with periodic patterns.

2.3.5. DNN (Deep Neural Network)

DNN is an artificial neural network with multiple hidden layers, making it effective in learning complex nonlinear relationships. It utilizes various input variables to predict missing values and demonstrates strong performance even with large datasets [19].

The model architecture, illustrated in Figure 7, consists of an input layer, three hidden layers, and an output layer. For implementation, the Rectified Linear Unit (ReLU) function was used as the activation function in the hidden layers. The model was trained using the mean squared error (MSE) as the loss function, which is defined as follows:

M S E = (1 / n) * \sum_{i = 1}^{n} {(ŷ_{i} - y_{i})}^{2}

(7)

Here, ŷ_i represents the predicted value from the model, while y_i is the actual value. The Adam algorithm was used as the optimizer for training. The input variables were the same as those used in the training process of the DNN model.

To prevent overfitting, early stopping was applied during model training.

2.3.6. LSTM (Long Short-Term Memory)

LSTM is a type of recurrent neural network (RNN) that effectively learns long-term dependencies in time-series data. By utilizing a gated structure, it adjusts the importance of previous information and mitigates the vanishing gradient problem [20].

The basic structure of the model consists of an input gate (Equation (8)), forget gate (Equation (9)), cell state update and cell state (Equations (10) and (11)), and output gate and output (Equations (12) and (13)) as illustrated Figure 8. The equations are as follows:

i_{t} = σ (W_{i} * x_{t} + U_{i} * h_{t - 1} + b_{i})

(8)

f_{t} = σ (W_{f} * x_{t} + U_{f} * h_{t - 1} + b_{f})

(9)

ĉ_{t} = t a n h (W_{c} * x_{t} + U_{c} * h_{t - 1} + b_{c})

(10)

C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ ĉ_{t}

(11)

o_{t} = σ (W_{o} * x_{t} + U_{o} * h_{t - 1} + b_{o})

(12)

h_{t} = o_{t} ⊙ t a n h (C_{t})

(13)

Here, x_t represents the current input vector, h_t₋₁ is the hidden state from the previous time step, and C_t₋₁ is the previous cell state. W and U denote the weight matrices, while b is the bias vector. σ represents the sigmoid activation function and ⊙ denotes element-wise multiplication.

For implementation, the input data were structured as a sequence using the previous 72 h of data to predict the next PM-2.5 concentration. The model was built with three stacked LSTM layers, using MSE (mean squared error) as the loss function and Adam as the optimizer.

For artificial intelligence models, all input variables were normalized using min–max scaling as part of the data preprocessing step. Additionally, similar to other models, data from 2019 to 2023 were used as the training data. In this process, the validation set ratio was set to 0.1.

To ensure proper training, the 72 h period preceding each missing value was excluded from the training process. Instead, these 72 h segments were used as test data for model evaluation.

2.4. Model Evaluation Metrics

The performance of the models was evaluated using the root mean square error (RMSE), mean absolute error (MAE), and correlation (r). The predictive performance of each model was compared using these evaluation metrics, and the results were analyzed separately for high-concentration and low-concentration periods to assess model efficiency under different conditions.

The formulas for the evaluation metrics used in this study are as follows:

R M S E = s q r t (1 / n) * \sum_{i = 1}^{n} {(ŷ_{i} - y_{i})}^{2}

(14)

M A E = (1 / n) * \sum_{i = 1}^{n} |ŷ_{i} - y_{i}|

(15)

r = \frac{\sum_{i = 1}^{n} (y_{i} - \bar{y}) (\hat{y_{i}} - \bar{\hat{y}})}{\sqrt{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}} \cdot \sqrt{\sum_{i = 1}^{n} {(\hat{y_{i}} - \bar{\hat{y}})}^{2}}}

(16)

R^{2} = {(\frac{\sum_{i = 1}^{n} (y_{i} - \bar{y}) (\hat{y_{i}} - \bar{\hat{y}})}{\sqrt{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}} \cdot \sqrt{\sum_{i = 1}^{n} {(\hat{y_{i}} - \bar{\hat{y}})}^{2}}})}^{2}

(17)

3. Results

3.1. Statistical Evaluation

3.1.1. Statistical Analysis: Six-Hour Missing Data Imputation

Analysis of the 6 h missing data imputation in Table 2 results showed that FFILL achieved the best performance with an RMSE of 4.76, MAE of 3.20, and r of 0.97, confirming that interpolation methods using previous time-step data are effective for short time intervals.

The machine learning model KNN performed second best, with an RMSE of 5.65, MAE of 3.70, and r of 0.98, followed by the statistical model SARIMAX, which showed an RMSE of 6.09, MAE of 5.05, and r of 0.96.

However, the deep learning models DNN and LSTM demonstrated relatively lower performance, with an RMSE of 11.22 and 8.78, an MAE of 9.01 and 7.01, and an r of 0.94 and 0.95, respectively.

For short-term missing data imputation, simple interpolation and statistical or machine learning-based approaches proved to be more effective than complex deep learning models. Although FFILL and KNN are simple methods, they effectively maintain temporal continuity. In contrast, deep learning models such as DNN and LSTM exhibited the worst performance among all evaluated models. This limitation is likely due to insufficient hyperparameter tuning and the inability of these models to accurately capture the dynamics under extreme variations in concentration. To overcome the limitations, it is necessary to implement alternative architectures, such as Transformer-based or bi-directional models, in place of the DNN and LSTM models employed in this study.

3.1.2. Statistical Analysis: Twelve-Hour Missing Data Imputation

For the 12 h missing data imputation in Table 3, the performance of FFILL dropped significantly, with RMSE increasing from 4.76 to 19.79, MAE increasing from 3.20 to 9.45, and r decreasing from 0.94 to 0.66.

In contrast, KNN and MICE exhibited the best performance, with an RMSE of 9.14 and 8.38, an MAE of 5.70 and 6.01, and an r of 0.97 and 0.97, respectively. SARIMAX also maintained relatively good performance, achieving an RMSE of 10.24, MAE of 7.31, and r of 0.95).

However, DNN and LSTM continued to show relatively lower performance, with an RMSE of 17.03 and 16.77, an MAE of 10.67 and 11.97, and an r of 0.76 and 0.78, respectively. These results indicate that deep learning models still have limited predictive capability for the missing periods.

3.1.3. Statistical Analysis: Twenty-Four-Hour Missing Data Imputation

For the 24 h missing data imputation in Table 4, the performance of FFILL continued to decline, with RMSE increasing from 19.79 to 21.58 and MAE increasing from 9.45 to 13.12, while r slightly improved from 0.66 to 0.72.

SARIMAX achieved the best performance, with an RMSE of 9.37, an MAE of 6.85, and an r of 0.97. KNN and MICE also demonstrated stable results, with an RMSE of 9.71 and 9.63, an MAE of 6.24 and 6.99, and an r of 0.96 and 0.97, respectively.

However, DNN and LSTM maintained relatively lower performance even for longer missing periods, with an RMSE of 15.92 and 15.22, an MAE of 10.64 and 11.48, and an r of 0.84 and 0.86, respectively.

In summary, the analysis of the 12 h and 24 h missing intervals indicates that deep learning models were less effective in imputing missing data compared to statistical and machine learning-based methods. The factors that contributed to the limited performance of deep learning models in short-term imputation, such as inadequate hyperparameter tuning and instability under extreme concentration conditions, also appeared in longer imputation intervals, resulting in similarly limited effectiveness.

KNN, MICE, and SARIMAX are among the best-performing models for both 12 h and 24 h missing intervals. The SARIMAX model requires a parameter optimization process (e.g., P, D, Q, S, etc.) each time it is executed, which takes approximately 30 min, making it less suitable for real-time operations. In contrast, KNN and MICE can produce results within one minute, suggesting that they are more appropriate choices for real-time implementation.

3.2. Time-Series Evaluation

3.2.1. Time Series Analysis: Six-Hour Missing Data Imputation

For the 6 h missing data imputation in Figure 9a, FFILL exhibited trends similar to the observed values (OBS), confirming its effectiveness in short-term missing value restoration. Due to its low computational complexity and minimal deviation from observed values, FFILL can be useful for real-time forecasting or short-term missing data recovery.

However, compared to SARIMAX, KNN, and MICE, the simplified processing of FFILL fails to capture sudden fluctuations at specific moments (A-1 and A-2). SARIMAX, by learning time-series characteristics, provides more stable results than FFILL, while KNN and MICE also demonstrate relatively low errors compared to the observed values.

On the other hand, DNN and LSTM show larger deviations from the observed values, despite the short missing interval. This suggests that deep learning models have limitations in capturing fine patterns in short intervals or may not effectively reflect the complexity of the data.

3.2.2. Time Series Analysis: Twelve-Hour Missing Data Imputation

For the 12 h missing data imputation in Figure 9b, FFILL shows a noticeable decline in performance. This result demonstrates the limitations of simply copying previous values as the time interval increases.

KNN and MICE maintain relatively good performance with measurements at this interval, exhibiting a high correlation with observed values. SARIMAX, which learns seasonal and temporal patterns, provides the most reliable results for this interval, showing the highest correlation with observed values.

On the other hand, DNN and LSTM perform better than FFILL but still show larger deviations from the observed values compared to KNN, MICE, and SARIMAX.

3.2.3. Time Series Analysis: Twenty-Four-Hour Missing Data Imputation

For the 24 h missing data imputation in Figure 9c, FFILL’s performance drops significantly, reflecting its limitations for long-term forecasting. It shows the largest deviation from observed values, confirming that FFILL is unsuitable for longer missing periods.

KNN and MICE continue to produce reliable results, but SARIMAX performs the best at this interval. By effectively capturing repetitive and seasonal time-series patterns, SARIMAX minimizes deviations from the observed values and demonstrates strong performance in long-term forecasting.

DNN and LSTM continue to show relatively poor performance at this time interval. Although many studies have demonstrated that DNN- and LSTM-based models perform well in predicting PM-2.5 forecasts [21], their inferior performance in this study is likely due to the limited number of cases evaluated, particularly those involving extreme high and low concentrations. Since deep learning models require a large and diverse dataset to generalize effectively, the restricted number of such cases in this study hindered their ability to learn complex patterns. It is expected that as the number of extreme concentration cases increases, the performance of AI models will improve accordingly.

4. Discussion

The results indicate that this study offers an effective methodology for estimating missing environmental data, which could be particularly valuable in regions with limited or incomplete atmospheric monitoring records.

It was confirmed that detecting and filtering outliers prior to handling missing values is essential for enhancing the reliability of the original dataset. In this study, outlier detection was performed using the interquartile range (IQR) method, followed by a correlation analysis involving key air pollutants such as PM₁₀, NO₂, CO, SO₂, and O₃. This step allowed for the verification of whether the statistically identified outliers corresponded to actual environmental anomalies. The analysis showed a strong correlation between PM-2.5 and other air pollutants within the outlier range. This indicates that during periods of high PM-2.5 concentrations, precursor pollutants also exhibited high concentrations. Therefore, the statistically identified outlier values are not considered anomalies, but rather reflect actual environmental conditions.

The study also revealed that the effectiveness of imputation models varied significantly with the duration of the missing data intervals. For short durations (e.g., 6 h intervals), the forward fill (FFILL) method—which simply propagates the most recent observation—outperformed others in terms of the RMSE, MAE, and correlation coefficient (r). This suggests that for short durations, simple temporal extrapolation can adequately reflect PM-2.5 concentration trends. However, FFILL performance declined markedly for intervals exceeding 12 h, as it failed to capture evolving atmospheric patterns.

For mid- to long-term missing duration (12 h or longer), the KNN, MICE, and SARIMAX models demonstrated relatively superior performance. The KNN method effectively captured data patterns by referencing neighboring values in a multidimensional feature space, offering greater accuracy than simple extension techniques. Similarly, MICE showed reliable results by iteratively applying multivariate regression equations that accounted for inter-variable correlations. Notably, SARIMAX achieved the lowest RMSE and highest correlation coefficients for missing intervals exceeding 24 h, owing to its ability to incorporate both seasonality and exogenous variables. However, this model required considerable computational time during training, limiting its practical applicability in near-real-time cases.

In contrast, artificial-intelligence-based models such as DNN and LSTM, despite their advantages in learning complex, multidimensional, and nonlinear patterns, exhibited relatively lower prediction accuracy under the evaluation settings of this study, which focused on high- and low-concentration periods. This outcome is likely due to the absence of thorough hyperparameter tuning during model development. It is expected that with more refined and comprehensive hyperparameter optimization, the performance of these models could be significantly improved in future applications.

5. Limitations and Future Work

Future research should focus on enhancing the performance of deep learning models through more sophisticated hyperparameter optimization. The adoption of advanced architectures, such as bidirectional LSTM and transformer-based models, is also expected to improve predictive accuracy.

In addition, further studies are needed to explore the applicability of the models in this study to near-real-time situations. While this study primarily focused on imputing missing values in historical data, the current model structures can be adapted for near-real-time cases. Therefore, performance validation and optimization for near-real-time implementation are essential.

Moreover, the methodology proposed in this study is expected to have similar model performance in regions with a comparable temporal patterns and seasonal characteristics. However, in regions with substantially different environmental characteristics, the applicability and reliability of the model are expected to be limited. Therefore, additional experiments and validations on variable climatic and atmospheric conditions are necessary to support the development of more generalized models.

6. Conclusions

This study conducted a comprehensive evaluation of missing data imputation techniques using PM-2.5 concentration data from the Seoul area. The analysis considered a range of approaches, including statistical methods, machine learning models, and artificial intelligence-based techniques.

Outliers in the dataset were detected using the IQR method along with an analysis of the correlations among air pollutants. Based on actual high- and low-concentration cases in Seoul, randomly consecutive periods of missing data of 6, 12, and 24 h were generated and subsequently imputed using 6 models such as FFILL, KNN, MICE, SARIMAX, DNN and LSTM.

For short missing intervals (e.g., 6 h), the FFILL method demonstrated superior performance. However, as the duration of consecutive missing data increased from 6 to 24 h, the K-Nearest Neighbors (KNN), Multiple Imputation by Chained Equations (MICE), and Seasonal Autoregressive Integrated Moving Average with Exogenous Variables (SARIMAX) methods yielded more reliable results. In contrast, AI-based models underperformed relative to other models. Nevertheless, with comprehensive hyperparameter tuning in future studies, their performance is expected to improve significantly

For short intervals, the FFILL method demonstrated superior performance; however, for longer missing intervals, the KNN, MICE, and SARIMAX techniques proved to be more effective. In contrast, while the AI-based models exhibited potential for capturing complex patterns, their performance could be significantly enhanced through comprehensive hyperparameter tuning in future studies.

Author Contributions

Conceptualization, J.-Y.L., J.-G.K., J.-B.L., H.-S.K., H.-Y.Y. and D.-R.C.; Methodology, J.-Y.L., J.-G.K., J.-B.L., H.-S.K. and D.-R.C.; Software, J.-Y.L., S.-H.H. and C.-Y.L.; Resources, C.-Y.L. and H.-Y.Y.; Writing—original draft, J.-Y.L.; Writing—review and editing, D.-R.C.; Supervision, D.-R.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Particulate Matter Management Specialized Graduate Program through the Korea Environmental Industry & Technology Institute (KEITI) funded by the Ministry of Environment (MOE).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to institutional and privacy restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Brook, R.D.; Rajagopalan, S.; Pope, C.A., III; Brook, J.R.; Bhatnagar, A.; Diez-Roux, A.V.; Holguin, F.; Hong, Y.; Luepker, R.V.; Mittleman, M.A.; et al. Particulate matter air pollution and cardiovascular disease: An update to the scientific statement from the American Heart Association. Circulation 2010, 121, 2331–2378. [Google Scholar] [CrossRef] [PubMed]
World Health Organization. Review of Evidence on Health Aspects of Air Pollution–REVIHAAP Project; World Health Organization Regional Office for Europe: Copenhagen, Denmark, 2013. [Google Scholar]
Schafer, J.L.; Graham, J.W. Missing data: Our view of the state of the art. Psychol. Methods 2002, 7, 147–177. [Google Scholar] [CrossRef] [PubMed]
Graham, J.W. Missing data analysis: Making it work in the real world. Annu. Rev. Psychol. 2009, 60, 549–576. [Google Scholar] [CrossRef] [PubMed]
Donders, A.R.T.; van der Heijden, G.J.; Stijnen, T.; Moons, K.G. Review: A gentle introduction to imputation of missing values. J. Clin. Epidemiol. 2006, 59, 1087–1091. [Google Scholar] [CrossRef]
Troyanskaya, O.G.; Cantor, M.; Sherlock, G.; Brown, P.O.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R.B. Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17, 520–525. [Google Scholar] [CrossRef] [PubMed]
Choi, H. Imputation Method Based on a Voting Manner for Missing Data; Department of Industrial Engineering, College of Engineering: Seoul, Republic of Korea, 2019. [Google Scholar]
Ghahramani, Z.; Jordan, M.I. Supervised Learning From Incomplete Data Via an EM Approach. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1994; pp. 120–127. [Google Scholar]
Che, Z.; Purushotham, S.; Cho, K.; Sontag, D.; Liu, Y. Recurrent neural networks for multivariate time series with missing values. Sci. Rep. 2018, 8, 6085. [Google Scholar] [CrossRef]
Han, D.-C.; Lee, D.W.; Jung, D.-Y. A Study on the Traffic Volume Correction and Prediction Using SARIMA Algorithm. J. Korea Inst. Intell. Transp. Syst. 2021, 20, 1–13. [Google Scholar] [CrossRef]
Lee, D.; Choi, J.-Y.; Myoung, J.; Kim, O.; Park, J.; Shin, H.-J.; Ban, S.-J.; Park, H.-J.; Nam, K.-P. Analysis of a severe PM2. 5 episode in the Seoul Metropolitan area in South Korea from 27 February to 7 March 2019: Focused on estimation of domestic and foreign contribution. Atmosphere 2019, 10, 756. [Google Scholar] [CrossRef]
Park, S.-H.; Ko, D.-W. Investigating the effects of the built environment on PM2. 5 and PM10: A case study of Seoul Metropolitan city, South Korea. Sustainability 2018, 10, 4552. [Google Scholar] [CrossRef]
Odoi, B.; Pels, W.A.; Gyamfi, E.H. Efficiency of imputation techniques in univariate time series. Int. J. Sci. Environ. Technol. 2019, 8, 430–453. [Google Scholar]
Magar, V.; Ruikar, D.; Bhoite, S.; Mente, R. Innovative inter quartile range-based outlier detection and removal technique for teaching staff performance feedback analysis. J. Eng. Educ. Transform. 2024, 37, 176–184. [Google Scholar] [CrossRef]
Kim, J.-H.; Choi, D.-R.; Koo, Y.-S.; Lee, J.-B.; Park, H.-J. Analysis of Domestic and Foreign Contributions using DDM in CMAQ during Particulate Matter Episode Period of February 2014 in Seoul. J. Korean Soc. Atmos. Environ. 2016, 32, 82–99. [Google Scholar] [CrossRef]
Ahn, H.; Sun, K.; Kim, K.P. Comparison of missing data imputation methods in time series forecasting. Comput. Mater. Contin. 2022, 70, 767–779. [Google Scholar] [CrossRef]
Fang, C.; Wang, C. Time series data imputation: A survey on deep learning approaches. arXiv 2020, arXiv:2011.11347. [Google Scholar]
Afrifa-Yamoah, E.; Mueller, U.A.; Taylor, S.M.; Fisher, A.J. Missing data imputation of high-resolution temporal climate time series data. Meteorol. Appl. 2020, 27, e1873. [Google Scholar] [CrossRef]
Phan, T.T.H. Machine learning for univariate time series imputation. In Proceedings of the 2020 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), Hanoi, Vietnam, 8–9 October 2020. [Google Scholar]
Wang, J.; Du, W.; Yang, Y.; Qian, L.; Cao, W.; Zhang, K.; Wang, W.; Liang, Y.; Wen, Q. Deep learning for multivariate time series imputation: A survey. arXiv 2024, arXiv:2402.04059. [Google Scholar]
Koo, Y.-S.; Choi, Y.; Ho, C. Air Quality Forecasting Using Big Data and Machine Learning Algorithms. Asia-Pac. J. Atmos. Sci. 2023, 59, 529–530. [Google Scholar] [CrossRef]

Figure 1. Data sources for raw data and variables used for missing values imputations of PM-2.5, as shown in the red box.

Figure 2. Locations of AQMS and ASOS stations in Seoul.

Figure 3. Seasonal timeseries trends of temperature and PM-2.5 concentrations.

Figure 4. PM-2.5 outlier detection and filtering process in Seoul.

Figure 5. K-Nearest Neighbors (KNN) method: red circle for K = 3 and yellow circle for K = 5 green points represent known data samples, the yellow star indicates the target point to be imputed.

Figure 6. Structure of MICE (Multiple Imputation by Chained Equations) process.

Figure 7. Architecture of the deep neural network (DNN) with multiple hidden layers.

Figure 8. Internal structure of an LSTM cell: forget, input, and output gates.

Figure 9. Time series comparison of missing data imputation methods for 6 h, 12 h, and 24 h intervals: (a) 6 h interval comparison, where the black dashed line represents observed data (OBS), the blue line represents FFILL, the green line represents KNN, the orange line represents MICE, the red line represents DNN, the purple line represents LSTM, and the brown line represents SARIMA; (b) 12 h interval comparison, with the same color and model associations; (c) 24 h interval comparison, also consistent with the prior panels.

Table 1. Cases of missing data imputation for evaluation.

CASE	DATE	Remarks (https://www.airkorea.or.kr)
A-1	22 November 2023	Transboundary inflow
A-2	20 February 2019	Stagnation after Transboundary inflow
A-3	11 February 2021	Stagnation
B-1	29 August 2022	Rain
B-2	14 January 2023	Snow
B-3	17 January 2022	Atmospheric circulation

Table 2. Performance metrics for 6-Hour missing data imputation methods.

	RMSE (μg/m³)	MAE (μg/m³)	r	R²
FFILL	4.76	3.20	0.97	0.94
KNN	5.65	3.70	0.98	0.96
MICE	6.97	4.97	0.93	0.86
DNN	11.22	9.01	0.94	0.89
LSTM	8.78	7.01	0.95	0.89
SARIMAX	6.09	5.05	0.96	0.93

Table 3. Performance metrics for 12-h missing data imputation methods.

	RMSE (μg/m³)	MAE (μg/m³)	r	R²
FFILL	19.79	9.45	0.66	0.43
KNN	9.14	5.70	0.97	0.93
MICE	8.38	6.01	0.97	0.93
DNN	17.03	10.67	0.76	0.58
LSTM	16.77	11.97	0.78	0.60
SARIMAX	10.24	7.31	0.95	0.91

Table 4. Performance metrics for 24-h missing data imputation methods.

	RMSE (μg/m³)	MAE (μg/m³)	r	R²
FFILL	21.58	13.12	0.72	0.51
KNN	9.71	6.24	0.96	0.92
MICE	9.63	6.99	0.97	0.94
DNN	15.92	10.64	0.84	0.70
LSTM	15.22	11.48	0.86	0.75
SARIMAX	9.37	6.85	0.97	0.94

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, J.-Y.; Han, S.-H.; Kang, J.-G.; Lee, C.-Y.; Lee, J.-B.; Kim, H.-S.; Yun, H.-Y.; Choi, D.-R. Comparison of Models for Missing Data Imputation in PM-2.5 Measurement Data. Atmosphere 2025, 16, 438. https://doi.org/10.3390/atmos16040438

AMA Style

Lee J-Y, Han S-H, Kang J-G, Lee C-Y, Lee J-B, Kim H-S, Yun H-Y, Choi D-R. Comparison of Models for Missing Data Imputation in PM-2.5 Measurement Data. Atmosphere. 2025; 16(4):438. https://doi.org/10.3390/atmos16040438

Chicago/Turabian Style

Lee, Ju-Yong, Seung-Hee Han, Jin-Goo Kang, Chae-Yeon Lee, Jeong-Beom Lee, Hyeun-Soo Kim, Hui-Young Yun, and Dae-Ryun Choi. 2025. "Comparison of Models for Missing Data Imputation in PM-2.5 Measurement Data" Atmosphere 16, no. 4: 438. https://doi.org/10.3390/atmos16040438

APA Style

Lee, J.-Y., Han, S.-H., Kang, J.-G., Lee, C.-Y., Lee, J.-B., Kim, H.-S., Yun, H.-Y., & Choi, D.-R. (2025). Comparison of Models for Missing Data Imputation in PM-2.5 Measurement Data. Atmosphere, 16(4), 438. https://doi.org/10.3390/atmos16040438

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparison of Models for Missing Data Imputation in PM-2.5 Measurement Data

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Data Selection

2.1.1. Monthly and Seasonal Characteristics of PM-2.5 in Seoul

2.1.2. Outlier Detection for PM-2.5 Measurements

2.2. Missing Data Imputation Periods

2.3. Missing Data Imputation Models

2.3.1. FFILL (Forward Fill)

2.3.2. KNN (K-Nearest Neighbors)

2.3.3. MICE (Multiple Imputation by Chained Equations)

2.3.4. SARIMAX (Seasonal Autoregressive Integrated Moving Average with Exogenous Variables)

2.3.5. DNN (Deep Neural Network)

2.3.6. LSTM (Long Short-Term Memory)

2.4. Model Evaluation Metrics

3. Results

3.1. Statistical Evaluation

3.1.1. Statistical Analysis: Six-Hour Missing Data Imputation

3.1.2. Statistical Analysis: Twelve-Hour Missing Data Imputation

3.1.3. Statistical Analysis: Twenty-Four-Hour Missing Data Imputation

3.2. Time-Series Evaluation

3.2.1. Time Series Analysis: Six-Hour Missing Data Imputation

3.2.2. Time Series Analysis: Twelve-Hour Missing Data Imputation

3.2.3. Time Series Analysis: Twenty-Four-Hour Missing Data Imputation

4. Discussion

5. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI