2. Materials and Proposed Methodology
Figure 1 provides an overview of the research workflow, encompassing the entire process from data collection to results generation. The study begins with the collection of diverse datasets, including PM
2.5 data, satellite-based data, ground-based data, as well as temporal and spatial parameters. These data undergo a comprehensive preparation phase, which includes noise reduction using the Savitzky–Golay filter, missing data recovery through spline interpolation, and spatial interpolation using the inverse distance weighting (IDW) method. Finally, the data are aggregated for input into predictive models. The predictive modeling phase incorporates traditional ML algorithms (e.g., SVM and RF), DL models (e.g., LSTM, RNN, and DNN), and the proposed OA-LSTM model, which integrates evolutionary optimization to enhance predictive accuracy. The results are evaluated using validation metrics such as RMSE and
R2, and outputs include PM
2.5 distribution maps, the identification of the best-performing model, and a sensitivity analysis. This workflow outlines the structure of the research, and detailed explanations of each data type, preprocessing step, and predictive model are provided in the following sections for clarity and depth.
2.1. Study Area
Tehran, the capital of Iran, is located in the northern part of the country, at the southern slopes of the Alborz Mountains. Its geographical position gives it a strategic advantage, serving as a central hub for economic, political, and cultural activities. The city’s strategic importance extends to its role as the administrative and economic center of Iran, hosting numerous government institutions, businesses, and industries.
Figure 2 illustrates the study area, including Tehran’s location in Iran, its province boundaries, and a detailed map of Tehran city, showcasing pollution and meteorological stations along with elevation data. Tehran is the most populous city in Iran, with a population exceeding 8.5 million within the city and more than 15 million in its metropolitan area. Rapid urbanization over the past few decades has transformed Tehran into a sprawling metropolis. The city’s development has been characterized by a significant expansion of infrastructure, housing projects, and commercial centers. However, this rapid growth has also led to challenges, including inadequate urban planning, congestion, and strain on public services. Tehran’s dense population and urban sprawl have amplified environmental and social issues, particularly air pollution and traffic congestion.
Tehran’s climate is classified as semi-arid, with hot summers and cold winters. The city experiences significant variations in temperature due to its location between the Alborz Mountains and the central desert. Summers are typically dry, with temperatures reaching over 35 °C, while winters are cold, with occasional snowfall. Tehran’s air quality is heavily influenced by its geographical features, as the surrounding mountains trap pollutants, particularly during temperature inversions in winter. The lack of sufficient rainfall and high levels of dust exacerbate the city’s air pollution problems. Air pollution is one of the most critical environmental challenges faced by Tehran. The city consistently ranks among the most polluted urban areas in the world, with PM2.5 and PM10 levels often exceeding international safety thresholds. The geographical and climatic conditions, such as the city’s location in a basin surrounded by mountains, worsen the accumulation of pollutants. Air pollution in Tehran has serious health implications, contributing to respiratory diseases, cardiovascular problems, and premature deaths.
Industrial activities and traffic congestion are the primary sources of air pollution in Tehran. The city hosts numerous factories and industrial units, many of which rely on outdated technologies and emit significant amounts of pollutants. Additionally, Tehran’s traffic is notorious for its intensity, with millions of vehicles on the road daily. A significant proportion of these vehicles are older models with poor emission standards, further exacerbating air quality issues. The combination of industrial emissions and vehicular pollution contributes to the high levels of PM2.5 and other harmful pollutants in the city’s atmosphere. Given Tehran’s severe air pollution challenges, advanced methodologies are required to monitor and predict pollution levels effectively. The integration of RS and spatial–temporal data, combined with novel DL approaches, offers a promising solution. Such methods can provide accurate predictions of PM2.5 concentrations, helping policymakers and urban planners implement targeted measures to reduce pollution and improve public health. Tehran’s unique geographical, climatic, and urban characteristics make it an ideal case study for developing and testing innovative approaches to air quality management.
2.2. Dataset and Data Preparation
The selection of variables in this study was carefully guided by a comprehensive review of previous research and the practical availability of data in Tehran. The parameters chosen reflect the multifaceted influences on PM2.5 concentration, encompassing pollutant data, RS-derived variables, meteorological parameters, and spatial features. These parameters are integral to understanding the dynamics of PM2.5 pollution and were selected based on their demonstrated significance in prior studies and their applicability to Tehran’s unique environmental and urban conditions. Meteorological variables, such as temperature, wind speed, wind direction, atmospheric pressure, precipitation, water vapor, and humidity, significantly influence PM2.5 dynamics by shaping atmospheric conditions that determine pollutant dispersion, transport, and formation. Temperature affects chemical reaction rates and atmospheric mixing. Wind speed and direction control the horizontal and vertical movement of pollutants, influencing their spatial distribution. Atmospheric pressure plays a role in stabilizing or destabilizing atmospheric layers, which impacts the behavior of particulate matter. Precipitation removes particulate matter from the atmosphere, while water vapor and humidity influence the formation and growth of particles through condensation and hygroscopic processes.
Spatial features, such as elevation and NDVI, are critical due to their influence on pollutant distribution and retention. Elevation impacts airflow patterns and the occurrence of temperature inversions, which can trap pollutants near the ground. NDVI indicates vegetation cover, providing insights into areas where green spaces may influence the interaction of PM2.5 with the surrounding environment. RS-derived parameters, including AOD and LST, capture aerosol distributions and thermal conditions, providing broader spatial coverage and complementing localized ground data. AOD serves as a proxy for PM2.5 by monitoring aerosol levels, while LST adds insights into surface temperature variations that affect pollutant transport and chemical transformations. Temporal factors, such as the day of the week and seasonal variations, influence PM2.5 through changing urban activity patterns and climatic conditions. Traffic volumes, a major PM2.5 source, vary with daily and weekly schedules, while seasonal phenomena like dust storms and temperature inversions drive fluctuations in pollution levels. These variables are essential for capturing the dynamic temporal trends of air quality in urban environments.
In this paper, ground-level air quality data for PM2.5 concentration in Tehran were obtained from the Tehran Municipality Air Quality Control Company. Data collected daily from 19 selected monitoring stations over a three-year period (2014–2016) provide localized insights into pollution levels. This selection was based on the availability of consistent and reliable data within the study’s temporal scope. Some stations were excluded due to insufficient data: three stations were established after 2016 and lacked samples for the target period, while one station had only a single recorded PM2.5 sample during 2014–2016. Additionally, certain stations did not provide daily measurements, further limiting their utility. By focusing on these 19 stations, we ensured the integrity and consistency of the dataset, enabling a robust spatial and temporal analysis. The number of monitoring stations plays a crucial role in air quality studies, directly impacting spatial and temporal resolution. A higher density of stations improves the accuracy of pollution distribution mapping, capturing localized sources such as traffic and industrial emissions, especially in cities with significant spatial variability like Tehran. Temporally, more stations enhance data reliability by reducing the impact of gaps or inconsistencies from individual stations. However, logistical, financial, and technical constraints often limit the number of stations, with placement influenced by accessibility, population density, and governmental priorities, leading to uneven coverage. To address these limitations, advanced modeling techniques (such as DLs) and supplementary data sources, such as satellite imagery, are essential for ensuring comprehensive and reliable air quality predictions.
In addition, meteorological data, including maximum temperature (Max Temp), minimum temperature (Min Temp), wind speed (WS), wind direction (WD), atmospheric pressure (P), and humidity (H), were sourced from the Tehran Meteorological Research Center. These data, collected daily from five stations, play a crucial role in understanding the relationship between weather conditions and PM2.5 concentrations. Satellite data extracted through google earth engine provide a powerful tool for analyzing RS data. The satellite parameters used in this study include AOD, elevation (Ele), NDVI, LST, total precipitation (PE), and water vapor (WV). Each of these parameters was extracted through custom coding in the google earth engine, enabling a robust spatial–temporal analysis of PM2.5 levels. These satellite-derived metrics complement ground-based data by offering broader spatial coverage and additional atmospheric insights.
Figure 3 presents the histograms and descriptive statistics of the variables used in the study, providing a comprehensive overview of their distribution across the dataset (
S = 8106). Each histogram shows the frequency of observed values, highlighting variations in data distribution. The summary statistics (minimum, maximum, mean, and standard deviation) within each plot provide additional insights into the range and variability of these parameters. This visualization aids in understanding the data’s characteristics, essential for predictive modeling and analysis.
Table 2 provides a detailed overview of the variables included in the case study, outlining their units, and spatial and temporal resolutions. The dataset integrates diverse variables, such as PM
2.5 concentrations measured at stations with daily resolution, and RS-derived parameters like AOD (1 km, hourly). Vegetation indices such as NDVI are included with a 250 m spatial resolution and a 16-day temporal resolution, while meteorological variables like precipitation, water vapor, temperature, wind speed, and pressure are collected at station-level resolutions on a daily basis. This comprehensive dataset enables a robust spatial–temporal analysis of PM
2.5 dynamics.
Data preparation and refinement are critical steps before inputting ground-based and satellite data into predictive models, as they directly influence the accuracy and reliability of the results. Ground-based data, such as PM2.5 concentrations and meteorological measurements, often contain missing or inconsistent values due to sensor errors or station downtime, which must be addressed through imputation or filtering techniques. Similarly, satellite data, such as AOD and LST, can suffer from cloud contamination, spatial inconsistencies, or temporal gaps, necessitating preprocessing steps like resampling, interpolation, and noise removal. Proper data refinement ensures that the input dataset is complete, consistent, and representative of real-world conditions, reducing noise and biases that could degrade model performance. Additionally, harmonizing the spatial and temporal resolutions of ground and satellite data is essential for aligning the datasets and enabling effective spatial–temporal analysis. These steps collectively enhance the model’s ability to learn patterns and make accurate predictions.
In this study, the training and testing datasets were created based on the availability of PM2.5 data. Meteorological data were aligned with the specific days for which PM2.5 concentrations were available. Since meteorological data were not consistently available for all days of the year, missing data were imputed using the spline model. However, spline functions alone cannot fit the data accurately due to the presence of noise and irregularities, which hinder proper curve fitting. To address this, a Savitzky–Golay filter was employed for noise reduction prior to data fitting. The Savitzky–Golay filter is one of the powerful methods for data smoothing and noise reduction, introduced by Savitzky and Golay in 1964. This filter is based on fitting low-degree polynomials to small subsets of data and is particularly suitable for preserving key signal features, such as peaks and valleys. While many traditional filters tend to remove details, the Savitzky–Golay filter is designed to smooth data without destroying these details. This filter performs a piecewise polynomial fitting to the data. In each small subset, the existing data are modeled with a polynomial (typically second- or third-degree). These polynomials are adjusted to fit the existing data well, and their value at the central point of the window is used. This process is repeated for each data point to smooth the output signal.
For our analysis, the selection of the Savitzky–Golay filter parameters was guided by the need to balance noise reduction with the preservation of important signal features such as peaks and valleys. The window size and polynomial order were chosen based on the characteristics of the data and the level of noise observed during preliminary analysis. The window size defines the number of data points considered for fitting the polynomial in each segment. A larger window size results in smoother output but can oversmooth the data and remove critical details. Conversely, a smaller window size retains finer details but may leave some noise in the signal. Through trial and error, supported by visual inspections and quantitative error metrics, a window size of 7 was selected. This provided a good balance between smoothing and maintaining the fidelity of the underlying signal patterns. The polynomial order determines the complexity of the polynomial used to fit the data. Higher-order polynomials allow the filter to model more intricate variations but may also capture noise, leading to overfitting. Lower-order polynomials, on the other hand, simplify the signal and are more robust to noise but may fail to capture subtle trends. After testing various options, a polynomial order of 2 was chosen. This degree effectively captured the overall trends without overfitting the noise. The Savitzky–Golay filter was applied to the PM
2.5 time-series data to eliminate noise and abrupt fluctuations, as shown in
Figure 4. The filter successfully removed irregularities without altering the inherent structure of the data. Unlike meteorological data, which required the use of the spline model for gap-filling, the PM
2.5 data used in this study were directly obtained from the Tehran Municipality Air Quality Control Company and did not require further interpolation.
Spline is a powerful technique used to impute missing values in time-series data by fitting smooth and continuous curves to the available data. The method works by dividing the dataset into smaller segments and fitting piecewise polynomials of low degrees to each segment. These polynomials are adjusted to ensure continuity in both the value and derivatives across segment boundaries, preserving the natural flow and structure of the data. This makes spline interpolation particularly suitable for time-series data, where maintaining temporal trends and patterns is critical. The implementation of spline interpolation begins with identifying the valid data points in the time series, excluding the missing values. Using these valid points, a spline curve is constructed to approximate the missing values. The selection of control points is a crucial aspect of the process, as too few knots may result in oversmoothing and a loss of detail, while too many knots can lead to overfitting and an unnecessarily complex curve. The balance is achieved by iterative testing, using visual inspections and error metrics to ensure the imputed values align with the overall data trends.
Figure 5 illustrates the time series of AOD, comparing raw and refined datasets.
Figure 5a represents the raw data, which contain missing values and noise due to inconsistencies and gaps in the original measurements.
Figure 5b shows the refined data, where noise has been reduced using the Savitzky–Golay filter, and missing values have been imputed using a spline interpolation method. This preprocessing ensures a smoother and more complete dataset for analysis.
Figure 6 presents the time series of refined ground-based meteorological data. To prepare the data for algorithm implementation, it is essential to have complete meteorological parameters for all air pollution monitoring stations. Since meteorological measurements are not available for all locations, spatial interpolation is required to transfer information from meteorological stations to air pollution monitoring stations. In this paper, IDW interpolation was performed using ArcGIS 10.4.1 software to spatially align the meteorological data with the air quality monitoring stations, ensuring a consistent and structured dataset for modeling and analysis.
2.3. Proposed OA
The OA, introduced by Kaveh et al. [
40] in 2023, is a novel meta-heuristic optimization approach inspired by the natural growth patterns and processes found in orchards. The OA simulates the behavior of trees growing, competing for resources, and optimizing their positions in an environment to maximize their access to sunlight, water, and nutrients. This nature-inspired approach makes the OA particularly effective for complex, high-dimensional optimization problems, as it focuses on the adaptive exploration and exploitation of the search space. In the OA, the optimization process begins with an initial population of trees, each representing a candidate solution in the search space. These trees grow and adjust their positions iteratively, aiming to improve their fitness values, which measure the quality of each candidate solution. The algorithm incorporates several operators (such as growth, screening, grafting, pruning, and elitism) to mimic the natural selection and growth process within an orchard. These operators enable the OA to refine the population of trees over time, enhancing the quality of solutions as the algorithm progresses [
40]. The formulation of the OA is represented through Equations (1)–(7):
where
is the new solution generated through growth operator;
is the current position of the candidate solution;
is the growth factor;
is a random variable introducing variability in the direction of growth;
is the total objective function of
j-th candidate;
is the objective function value of
j-th candidate;
is the normalized objective function value of
j-th candidate;
is the growth rate of solution j;
is the normalized growth rate of solution j;
and
are weighting factors balancing the contributions of fitness and growth;
is the growth rate of each solution;
is the total number of growth years before the screening,
is the number of growth years before screening for which the growth rate is considered, and
is the weight given to those years;
is the new solution generated through grafting;
is the position of the stronger candidate;
is the position of the medium-quality candidate;
is a blending coefficient determining the contribution of each parent;
is the new randomly generated solution; and
are the bounds of the search space.
Equation (1) represents the growth operator, simulating the initial growth phase of trees. Each candidate solution adjusts its position based on a small perturbation defined by the growth factor and a random direction . This allows local exploitation to identify better solutions in the vicinity. Equations (2)–(5) define the screening operator, and evaluate and rank candidate solutions based on their fitness and growth rate. Candidates with higher values are considered stronger and are retained for further iterations, while weaker candidates are flagged for replacement or modification. Equation (6) models the grafting operator, where a new solution is generated by blending two parent candidates: a strong candidate and a medium-quality candidate. Equation (7) describes the replacement operator, where weak candidates are replaced by new random solutions within the defined bounds. This introduces fresh diversity into the population, preventing stagnation in local optima.
The standard OA, while effective in balancing exploration and exploitation through its nature-inspired operators, faces certain limitations that can affect its performance in complex, high-dimensional optimization problems. These challenges arise primarily from its reliance on a fixed set of operators and its inability to dynamically adapt to diverse problem landscapes, which can lead to premature convergence or inefficient exploration. One of the main weaknesses of the standard OA is its potential to stagnate in local optima. Although operators like grafting and replacement introduce diversity, they may not be sufficient to escape from local optima in highly rugged or deceptive fitness landscapes. This limitation is particularly pronounced in problems with a large number of local minima, where the algorithm’s exploration mechanisms might fail to effectively cover the entire search space. Another notable issue is the algorithm’s lack of targeted refinement for individual candidate solutions. The standard OA focuses on improving solutions as a whole, often neglecting the fine-tuning of specific components (genes) within each solution. This can lead to suboptimal performance, especially in cases where only a subset of the solution’s parameters requires adjustment. Additionally, the current operators may not fully utilize the potential of high-quality solutions, as they primarily focus on combining or replacing entire solutions rather than selectively enhancing specific attributes. These limitations highlight the need for introducing more adaptive and granular operators, such as the cutting operator, which can address the weaknesses by targeting individual genes for improvement. This not only enhances the algorithm’s ability to escape local optima but also enables a more focused exploitation of strong candidate solutions, leading to better convergence and overall performance.
In horticulture, cutting is a widely used propagation technique where a part of the parent plant is cut and cultivated independently to develop roots and grow into a new plant. This method is particularly effective for plants with high rooting potential. Inspired by this, cutting can be introduced as an operator in the OA, where a portion of a strong candidate solution is retained, and the rest is regenerated randomly. This operator allows the algorithm to leverage the strengths of high-quality solutions while introducing diversity by replacing weaker components. The cutting operator involves selecting a strong candidate solution (based on fitness) and dividing it into two parts: a portion of the strong solution is retained as a “cutting” to preserve its high-quality characteristics; the remaining portion is regenerated randomly to explore new areas in the search space. This hybridization of exploitation (using the strong solution) and exploration (introducing randomness) enhances the algorithm’s ability to refine its solutions effectively. The cutting operator can be formulated as Equation (8):
where
is the
d-th gene of the new solution after cutting;
is the
d-th gene of the strong candidate solution;
is a subset of indices corresponding to the retained portion of the strong solution; and
is a randomly generated value within the bounds of the search space.
The cutting operator works by first selecting a strong candidate solution
based on its fitness value. The solution’s genes are then divided into two parts: a portion of the genes, determined by
, is retained directly from
, while the remaining genes are replaced with random values to introduce diversity. Finally, the retained and randomized genes are combined to form a new candidate solution
. This operator enhances exploitation by preserving high-quality features from strong solutions, while also improving exploration by introducing randomized genes, preventing premature convergence and enabling the algorithm to search new areas of the solution space effectively. By incorporating the cutting operator, the OA gains an additional mechanism to refine solutions, enabling it to converge more effectively while maintaining diversity in the search process. Algorithm 1 presents the pseudo-code of the proposed OA.
Algorithm 1 Pseudo-codes of proposed OA |
Begin OA %%Parameter setting Initialization of population size, number of strong/medium/weak trees, α, β, Iteration %%Create population for n = 1 to population size do Create orchard (population) end %%Main loop for i = 1 to Maximum iteration do %% Elitism Sort population Save elite population %% Growth for j = 1 to population size do end %% Screening Save previous populations to calculate growth rate for j = 1 to population size do based Equation (3) based Equation (5) based Equation (2) Divide the seedlings into three groups: strong, medium, and weak end %% Grafting for j = 1 to population size do end %% Pruning for p = 1 to population size do end %% Cutting for j = 1 to population size do end Sort the population Show the best solution end End OA |
Figure 7 illustrates the application of the cutting operator within the proposed OA. In this example, a strong candidate solution
with an RMSE value of 0.5 µg/m
3 is selected for improvement. The
are defined as
, meaning these genes are retained directly from
. The remaining genes
are replaced with random values
generated within the search space bounds. The resulting new solution
demonstrates an improved performance, as indicated by its lower RMSE value of 0.2 µg/m
3. This example highlights how the cutting operator preserves high-quality components while introducing variability to enhance exploration and improve overall solution quality.
2.4. Proposed OA-LSTM
LSTM networks, introduced by Hochreiter and Schmidhuber in 1997, are a type of RNN designed specifically to handle long-term dependencies in sequential data [
34]. LSTMs were developed in response to the limitations of traditional RNNs, which struggle to retain information over extended sequences due to the vanishing gradient problem. This phenomenon results in the exponential decay of gradients during backpropagation, making it challenging for RNNs to learn relationships over long time steps. As a result, standard RNNs often fail in tasks requiring long-term memory, such as language modeling, time-series forecasting, and speech recognition. LSTMs are widely used in applications requiring understanding dependencies across different time steps. Their design, featuring a memory cell and gated structures, allows them to selectively retain relevant information over time, which enables better performance in tasks involving complex sequential patterns. This capability has made LSTMs particularly popular in fields such as time-series analysis, machine translation, and speech-to-text systems, where the ability to capture long-term dependencies significantly enhances the model effectiveness. In a time-series analysis, sequences of data points are often influenced by previous values, making it essential to retain relevant historical information across extended periods. Traditional neural networks and simple RNNs struggle with this requirement due to the vanishing gradient problem, which limits their capacity to remember long-term dependencies. LSTMs, with their gated memory mechanisms, allow models to selectively retain or forget information at each time step, facilitating more accurate predictions by capturing essential temporal patterns [
35].
LSTM, like a traditional RNN, is structured as a sequential chain where each cell passes information to the next.
Figure 8 illustrates this chain-like structure, where each LSTM cell sequentially passes information to the next. The LSTM network architecture incorporates a cell state
, which acts as a memory unit capable of storing long-term information, along with a hidden state
that reflects the short-term memory for each time step. In
Figure 8, each LSTM cell receives an input
, along with the previous cell state
and hidden state
, then outputs an updated cell state
and hidden state
to the next cell. This sequential structure helps the LSTM retain relevant information across multiple time steps, enabling it to capture complex patterns in sequential data [
36].
Information flow in LSTM is regulated by three key gates: the forget gate, input gate, and output gate. These gates are controlled by trainable weights and biases, allowing the network to retain essential information, discard irrelevant data, and update the cell and hidden states appropriately. The forget gate is responsible for deciding which information from the previous cell state is no longer relevant and should be “forgotten”. This gate enables the LSTM to filter out unnecessary information, ensuring that the cell state remains focused only on the essential data as it progresses through the sequence. By dynamically choosing what to forget, the LSTM prevents irrelevant or outdated information from cluttering the memory, which is particularly useful in long sequences where early inputs may lose significance over time. The input gate determines what new information should be added to the cell state. It evaluates the importance of the current input in the context of the sequence and selectively incorporates it into the memory. This mechanism allows the LSTM to update its memory in a controlled manner, only adding relevant new information that complements the existing context. The input gate, therefore, plays a key role in refining the memory by carefully integrating new data with past knowledge, enhancing the LSTM’s ability to capture meaningful patterns [
37].
The output gate manages what part of the cell state should be exposed as the hidden state, which serves as the output of the LSTM cell for the current time step. This gate decides how much of the memory should be made available to subsequent cells or layers, balancing the cell’s internal state with the need to communicate relevant information. By controlling the output, the LSTM effectively shares only the essential information with the next layer, enabling better learning in deep architectures and sequential processing tasks. In each time step, the cell state and hidden state are updated based on the operations of these gates. The cell state acts as a long-term memory reservoir, retaining crucial information across multiple time steps, while the hidden state functions as a short-term memory that changes at each step to reflect the immediate context. The combination of these two states allows the LSTM to remember relevant information from the past while dynamically adjusting to new inputs, making it adept at handling complex, long-term dependencies in sequential data. Equations (9)–(15) represent the internal computations of an LSTM cell, detailing how information flows through the forget, input, and output gates to update the cell state and hidden state at each time step [
22].
where
represents the forget gate;
represents the candidate generation;
denotes the input gate;
represents the output gate;
is the modulated candidate state;
is the updated cell state that combines retained information from
and new information from
;
is the updated hidden state;
are weight matrices corresponding to the previous hidden state
and current input
for each gate;
are biases terms associated with each gate;
is the sigmoid activation function, outputting values between 0 and 1;
is the hyperbolic tangent function, outputting values between −1 and 1; and
is the element-wise multiplication operator, which modulates the interaction between the gates and states.
LSTM networks, despite their strengths, face notable challenges in terms of hyper-parameter optimization, which significantly affects their performance and generalization capabilities. Key hyper-parameters in LSTM models include the weights and biases within the fully connected layers, the number of layers and neurons, the learning rate, and the dropout rate. Optimizing these parameters is critical because each plays a specific role in controlling the behavior of the network. For instance, weights and biases directly influence how input data is transformed as it flows through the network, impacting how well the LSTM captures patterns and dependencies. The learning rate controls the step size in the optimization process; if set too high, it can lead to divergence, while a low learning rate may cause slow convergence and increase the training time.
One of the main challenges with optimizing LSTM networks is the use of gradient-based learning algorithms, such as SGD and Adam. While effective, these methods can be prone to local minima and saddle points, especially in high-dimensional spaces like those encountered in deep LSTM architectures. Additionally, gradient-based methods often struggle to adapt dynamically to the complex landscape of non-convex loss surfaces, which are common in neural networks. The reliance on the gradient descent may also result in issues like vanishing or exploding gradients, further complicating the learning process for LSTM networks, particularly when the model depth or sequence length increases. The optimization of the layer depth and neuron count is another crucial aspect of LSTM configuration. Selecting the right number of layers and neurons is vital to balance model complexity with computational efficiency. An excessive number of layers or neurons can lead to overfitting, where the model performs well on training data but poorly on unseen data. Conversely, too few neurons or layers may hinder the model’s ability to capture important patterns, reducing its accuracy on complex sequences. This trade-off highlights the need for effective hyper-parameter tuning to achieve optimal network architecture for a specific task. Furthermore, LSTMs often require the careful tuning of additional parameters, such as the batch size and dropout rates, which influence how well the network generalizes to new data. Together, these hyper-parameters must be finely adjusted to enable the network to learn effectively, avoid overfitting, and maintain computational efficiency.
Given the complexity of LSTM optimization, meta-heuristic algorithms have shown promise in effectively navigating the high-dimensional search space of hyper-parameters. Meta-heuristic approaches, such as genetic algorithms (GAs), particle swarm optimization (PSO), and ant colony optimization (ACO), have proven effective for LSTM training, offering a robust alternative to traditional gradient-based methods. These algorithms can escape local optima and better handle the non-convex optimization landscape of deep networks. In this study, we propose using a novel meta-heuristic called the improved OA to train the LSTM network and optimize its hyperparameters. By applying the OA within the LSTM network, the model can better navigate the high-dimensional parameter space, avoiding issues such as local minima that commonly affect gradient-based optimization methods. This is especially useful in LSTM networks, where the non-convexity of the loss surface can hinder standard optimization techniques like SGD or Adam.
Figure 9 illustrates the proposed architecture, referred to as OA-LSTM, where the standard LSTM network is enhanced using the OA as an optimizer. This approach integrates the OA into the LSTM structure to optimize key hyper-parameters, such as weights and biases, throughout the learning process. The optimizer module, depicted in the figure, dynamically updates these parameters by minimizing errors in each time step, ultimately leading to an improved performance across sequential data tasks. One of the key advantages of using the OA in the LSTM architecture is its ability to balance exploration and exploitation. The OA uses an adaptive approach, allowing the optimizer to explore new parameter spaces when necessary while focusing on fine-tuning existing solutions to improve convergence. This flexibility helps the LSTM network achieve a more robust and globally optimized set of parameters, ultimately improving accuracy and generalization in tasks such as time-series forecasting, speech recognition, and natural language processing. Furthermore, the OA provides enhanced stability in the optimization process, reducing the likelihood of issues such as vanishing or exploding gradients. By using a global search strategy, the OA can dynamically adjust parameters to maintain stable learning, even as the LSTM model depth and sequence length increase. This stability is crucial for training deeper LSTM architectures that capture more complex temporal dependencies without encountering the computational challenges typical of gradient-based optimizers.
Figure 10 demonstrates the implementation of the OA operators for optimizing the weights and biases of an LSTM network. The process begins with the initial population, where each candidate solution contains a set of values representing potential weights and biases, alongside an RMSE value indicating the quality of the solution. The growth phase follows, where the algorithm applies the growth operator to explore the local neighborhood of each seedling by slightly perturbing their values. This results in an updated set of solutions with improved RMSE values, such as 0.5, 0.6, and 0.4 µg/m
3, demonstrating the effectiveness of local search in improving candidate solutions. Next, the screening phase categorizes the grown seedlings into strong, medium, and weak categories based on their RMSE values. In the final phase, specific operators such as grafting, cutting, and pruning are applied to refine the solutions further. The grafting operator combines features from strong and medium seedlings to generate a new solution with a significantly lower RMSE value (e.g., 0.1 µg/m
3). Similarly, the cutting operator retains certain parts of a medium seedling and randomizes others, producing a solution with an RMSE of 0.2 µg/m
3. The pruning operator adjusts specific genes in weak seedlings, resulting in an improved RMSE of 0.3 µg/m
3.
3. Experimental Result
To forecast PM2.5 air pollution levels, several advanced algorithms were implemented and evaluated to identify the most effective model for this task. These algorithms include OA-LSTM, LSTM, RNN, DNN, SVM, and RF, and their performance was compared comprehensively. The selection of algorithms for this study was based on their distinct capabilities and relevance to air pollution forecasting. Advanced DL models such as LSTM, RNN, and DNN were chosen for their ability to handle complex, non-linear relationships and sequential data, both of which are critical for PM2.5 prediction. LSTM, in particular, excels at capturing long-term dependencies in time series, making it highly effective for modeling the temporal variations of air pollution levels. RNN and DNN complement this by providing alternative approaches to sequential and non-linear modeling, enabling a comprehensive exploration of DL techniques for this problem. In addition to DL models, traditional ML algorithms such as SVM and RF were included as benchmarks due to their simplicity, robustness, and established success in environmental data modeling. SVM is particularly effective in handling high-dimensional data and non-linear relationships, while RF’s ensemble approach provides resilience against noise and complex feature interactions. Comparing these models with the proposed OA-LSTM ensures a thorough evaluation, highlighting its strengths and validating its performance against both classical and advanced techniques. This diverse selection underscores the robustness of OA-LSTM, showcasing its ability to outperform a wide range of predictive methodologies in capturing the intricate dynamics of PM2.5 concentrations.
All implementations were conducted in the Python programming environment, leveraging libraries such as TensorFlow, Keras, Scikit-learn, and NumPy for model development, optimization, and evaluation. The dataset used for training and testing comprises meteorological data, topographical features, PM2.5 concentrations, and satellite-based parameters such as AOD. In the process of model development and evaluation, proper validation plays a crucial role in determining the model’s ability to predict unseen data. In this study, the dataset was split into 70% training and 30% testing. The splitting was designed to account for both temporal and spatial aspects of the data, ensuring that the evaluation process was robust and reflective of real-world scenarios. For the temporal aspect, data from the years 2014 and 2015 were used for training, while data from 2016 were allocated for testing. This chronological split ensures that the model is evaluated on future data that was not available during training, closely mimicking real-world forecasting scenarios. By preserving the temporal sequence, we avoided data leakage and ensured that the model’s performance was evaluated on genuinely unseen data. Additionally, the distributions of key variables, such as PM2.5 concentrations and meteorological conditions, were analyzed across the training and testing datasets to confirm representativeness and balance.
From a spatial perspective, special care was taken to prevent data leakage by ensuring that individual monitoring stations were exclusively assigned to either the training or testing datasets. For instance, data from two monitoring stations were reserved entirely for the testing set, while the remaining stations were used for training. This strategy ensures that the model is tested on spatially distinct data, representing stations it has never seen during training. This setup mirrors real-world conditions where a trained model may encounter data from new or previously unmonitored locations. The spatial splitting strategy has several advantages. By excluding overlap between training and testing stations, we avoid inflating the model’s performance through exposure to similar patterns from the same location. This ensures that the model’s predictions are evaluated on entirely new patterns, enhancing its generalization ability. While this approach is more challenging and may lead to slightly lower accuracy on the testing set, it provides a more realistic assessment of the model’s performance in scenarios where it encounters new and unseen spatial data.
To evaluate the performance of the implemented proposed models, several key metrics were employed, as defined in Equations (16)–(20). These metrics include RMSE,
,
, convergence trend, and computational complexity metrics (such as runtime). Each metric addresses a specific aspect of model performance. RMSE quantifies the average magnitude of prediction errors, with lower values indicating more precise predictions.
evaluates the proportion of variance in the observed data explained by the model, providing insight into the goodness of fit.
measures the variability in prediction errors, highlighting the model’s consistency. Metrics related to convergence trends assess the stability and efficiency of optimization, while runtime directly reflects computational demands. The combination of these metrics ensures a holistic analysis of model performance, covering accuracy, stability, and computational efficiency.
where
is the observed parameter;
is the predicted (calculated) parameter;
is the mean of the observed parameter;
the mean of the predicted parameter;
the standard deviation of the observed parameter;
is the standard deviation of the predicted parameter;
is the prediction error for each data point;
is the mean of the prediction errors; and
is the number of observations.
Proper parameter calibration is crucial to the success of ML models, as these parameters directly influence model performance. Without optimal tuning, a model may underfit, overfit, or fail to learn the desired patterns in data effectively. This is especially true in a complex problem like PM
2.5 prediction, where non-linear interactions and temporal dependencies play a significant role. Hence, setting the right parameters ensures the models generalize well to unseen data while maintaining computational efficiency. To determine the optimal parameters for each model, we employed the trial-and-error method, which is a systematic and iterative approach. This process involves testing various combinations of parameters, evaluating model performance for each combination, and gradually narrowing down the parameter ranges based on the results. Metrics such as RMSE and
were used to assess the performance of each parameter setting, allowing us to identify configurations that minimize prediction errors and maximize consistency. Given the complexity of the task, a wide range of parameter values was tested for each model. For instance, learning rates, dropout rates, and batch sizes were varied across several orders of magnitude for DL models, while regularization parameters and kernel coefficients were adjusted for traditional ML models. However, due to the sheer volume of experiments, only the optimized parameter values are presented in
Table 3 for brevity. It is worth noting that extensive experimentation was conducted to ensure these values represent the best configurations for each algorithm.
For the OA-LSTM model, critical parameters such as a learning rate of 0.07, a batch size of 64, and a sequence length of 20 were selected. The number of hidden layers and neurons per layer were set to 8 and 64, respectively, ensuring the model could capture complex temporal dynamics without overfitting. Activation functions like Tanh and Sigmoid were chosen for their ability to handle non-linear relationships effectively, while the optimizer was calibrated using the novel OA for robust gradient updates. The DNN model was configured with a learning rate of 0.06 and six hidden layers, each containing 36 neurons. The Adam optimizer was employed for its adaptive learning rate properties, while a batch size of 32 ensured efficient training. Similarly, the RNN model utilized a learning rate of 0.08, ten hidden layers, and 64 neurons per layer, with Tanh and Sigmoid activation functions for capturing sequential patterns. The sequence length for RNN was set to 10 to balance the temporal depth and computational efficiency. Traditional ML models like SVM and RF were also fine-tuned. The SVM model featured a regularization parameter of 1, a gamma value of 0.003, and an RBF kernel, chosen for its ability to model non-linear data effectively. The RF model was calibrated with 300 estimators, a maximum depth of 12, and a minimum sample split of 4, ensuring robustness against overfitting while maintaining computational efficiency.
Table 4 presents the performance metrics of various models for predicting PM
2.5 pollution, focusing on the test dataset. The models are compared based on RMSE,
, and
, providing a comprehensive view of their accuracy and stability. Our proposed method, OA-LSTM, significantly outperforms all other models. It achieves the lowest RMSE (3.01 µg/m
3), indicating a much higher predictive accuracy compared to traditional LSTM (9.53 µg/m
3), RNN (9.84 µg/m
3), and other ML algorithms like DNN, RF, and SVM, which exhibit larger errors. Additionally, the
value of our model (0.88) demonstrates the strongest correlation between the predicted and actual values, reflecting a better model fit. The OA-LSTM model also shows excellent stability, with an
of just 0.05, much lower than the other models, which range from 1.24 to 4.83. These results highlight the robustness and efficiency of the proposed OA-LSTM model in handling spatial–temporal data for air pollution prediction.
Figure 11 illustrates scatter plots of the measured versus estimated PM
2.5 concentrations for six proposed models. These plots visually represent the models’ performance in predicting PM
2.5 values, with the x-axis showing the measured concentrations and the y-axis displaying the predicted values. The color bar on the right of plots represents the frequency of data points, with darker shades (blue) indicating fewer data points and brighter shades (yellow to red) highlighting areas of higher concentration. The density of the points near the diagonal regression line is an indicator of the accuracy and consistency of the predictions. In the OA-LSTM model, the results demonstrate a clear alignment between the predicted and measured values. The regression equation
indicates that the model nearly perfectly follows the trend of the actual data. The clustering of points around the regression line, particularly in the high-frequency areas highlighted in yellow and red, suggests that the OA-LSTM model is not only accurate but also consistent across different ranges of PM
2.5 concentrations. The superior performance of the OA-LSTM model can be attributed to the integration of the OA, which enhances the LSTM’s ability to capture spatial–temporal features in the PM
2.5 data.
The scatter plots for the LSTM and RNN models reveal a moderate alignment between the measured and estimated values, as evidenced by regression equations for LSTM and for RNN. The RMSE values of 9.53 µg/m3 (LSTM) and 9.84 µg/m3 (RNN) indicate a reasonable performance, although the density of points along the regression line shows some underprediction at higher PM2.5 concentrations. These models perform well in capturing trends in low to medium ranges but exhibit limitations in accurately modeling high pollution levels. In contrast, the DNN model shows a weaker correlation between the measured and predicted concentrations, with a regression equation and an RMSE of 10.41 µg/m3. The increased dispersion of points, particularly at higher concentrations, highlights the DNN model’s reduced ability to capture complex spatial–temporal patterns, leading to less reliable predictions of peak pollution events. The scatter plots for the RF and SVM display a significantly lower alignment with the measured data, with equations for RF and for SVM. The widely scattered points around the regression line, particularly for SVM, indicate a poor predictive performance in both low and high concentration ranges.
Table 5 presents the computational complexity (run time) of various algorithms, measured based on the time required to reach different RMSE thresholds. The OA-LSTM model significantly outperforms other methods in terms of computational efficiency. For instance, OA-LSTM reaches an RMSE of 20 µg/m
3 in just 32 s, while other models such as LSTM, RNN, and DNN take considerably longer, with 106, 123, and 162 s, respectively. As the RMSE threshold becomes more stringent (e.g., RMSE < 10 µg/m
3), OA-LSTM continues to demonstrate its advantage by achieving this goal in 173 s, whereas other models either take much longer (RNN: 634 s) or fail to reach that level. This table also highlights the scalability and efficiency of OA-LSTM as the only model capable of reaching an RMSE below 5 µg/m
3, albeit with a significant increase in runtime (384 s). In contrast, none of the other models manage to reduce the RMSE below 5 µg/m
3, suggesting a trade-off between complexity and performance. Overall, the results underline OA-LSTM’s ability to balance accuracy and computational cost, demonstrating a superior performance in both reducing prediction errors and minimizing computational demands.
Figure 12 illustrates the convergence trends of proposed models in terms of their RMSE across different epochs. Among the models, OA-LSTM exhibits the fastest and most effective convergence, with a sharp decline in RMSE within the first 50 epochs, reaching an RMSE value below 5 quite early. This indicates that the OA-LSTM model not only converges more quickly but also maintains a high level of accuracy throughout the training process. In contrast, the other models show a slower convergence rate. For example, while the LSTM model also steadily reduces its RMSE, it does so more slowly, achieving an RMSE around 10 only after approximately 150 epochs. The RNN and DNN models perform similarly but with larger RMSE values, particularly in the earlier epochs. Both models begin with high RMSEs above 30 µg/m
3 and gradually converge towards values around 10 µg/m
3 after 200–300 epochs, reflecting less efficient learning compared to OA-LSTM. The RF and SVM models show the slowest convergence trends, with RMSEs above 20 µg/m
3 even after 300 epochs, indicating that these models struggle to learn effectively and are less suited for this specific prediction task.
Figure 13 illustrates the spatial distribution of the observed PM
2.5 concentrations for two distinct periods: (a) August 2016 (Wednesday, summer), and (b) December 2016 (Friday, winter). The spatial distribution of PM
2.5 concentrations in August 2016 (
Figure 13a) shows relatively lower levels of pollution compared to the winter map. The highest concentration, marked by darker blue areas, can be seen in the northern and central regions of the map, with values peaking around 58 µg/m
3. The southern and eastern parts exhibit lower levels, ranging from 15 to 30 µg/m
3. This pattern of pollution could be attributed to various factors such as the natural topography, dominant wind patterns, traffic, and possibly industrial activities concentrated in the northern areas. During summer, the overall levels of PM
2.5 are expected to be lower due to better air dispersion caused by warmer temperatures and stronger winds, which help dilute pollutants. In contrast, the map for December 2016 (
Figure 13b) presents a significantly higher level of pollution, especially in the northern and central regions, where PM
2.5 concentrations reach up to 125 µg/m
3. The eastern and southern regions also exhibit higher pollution compared to the summer map, with values ranging between 40 and 80 µg/m
3. One possible reason for this stark difference is the phenomenon of temperature inversion, which commonly occurs during the winter. This meteorological condition traps pollutants close to the ground, particularly in urban areas, leading to higher concentrations of PM
2.5 [
25,
26,
27,
28]. Additionally, December is a colder month, and higher emissions from residential heating and the reduced dispersion of pollutants likely contribute to the elevated pollution levels. The fact that December 23 was a Friday, a non-working day in many regions, might also affect the data, as reduced traffic could have slightly mitigated PM
2.5 levels, but still, the impact of inversion and higher winter emissions likely outweighs this factor.
Figure 14 illustrates the spatial distribution of PM
2.5 concentrations predicted by six different algorithms in August 2016 (Wednesday, summer). These predictions are compared against the ground truth map provided in
Figure 13a, enabling a comprehensive evaluation of each model’s performance in replicating the observed spatial patterns. The OA-LSTM model (
Figure 14a) demonstrates exceptional accuracy, closely aligning with the ground truth map by effectively capturing the peak concentrations in the northern and central regions while maintaining a consistent gradient of lower pollution levels toward the southern and eastern areas. This alignment highlights OA-LSTM’s robustness in modeling both high and low concentration zones. In contrast, LSTM (
Figure 14b) and RNN (
Figure 14c) also perform well but show slight under predictions of maximum values in the northern regions, although their spatial trends remain consistent with the observed data. The DNN model (
Figure 14d) provides predictions with reasonable spatial consistency but tends to slightly overestimate PM
2.5 levels in the southern areas, deviating from the observed distribution. Traditional ML models like RF (
Figure 14e) and SVM (
Figure 14f) exhibit comparatively lower predictive accuracy, failing to replicate localized peaks in PM
2.5 concentrations, particularly in the northern regions. The RF model produces more generalized patterns, while SVM shows greater deviation in both spatial structure and concentration levels, highlighting its limitations in capturing complex spatial relationships. This comparison is not only visual but is also supported by the quantitative evaluation metrics presented in
Table 4, which provide a detailed numerical assessment of each model’s accuracy in predicting PM
2.5 concentrations. The reliability of the preprocessing methods used, such as noise reduction and handling missing data, are comprehensively detailed in the manuscript to ensure that the comparisons are robust and trustworthy. By combining visual and numerical evaluations, this study ensures a holistic assessment of model performance, showcasing the capability of OA-LSTM to effectively capture both spatial variability and high-concentration zones more accurately than other models.
For sensitivity analysis, we propose a binary OA (BOA) for feature selection, designed to achieve two primary objectives: minimizing the number of input features that do not significantly impact PM
2.5 prediction and simultaneously minimizing the prediction error. To ensure that critical features are not excluded, a penalty term is introduced, which is integrated into the objective function. This penalty accounts for the importance of features in predicting PM
2.5 concentrations. By integrating the feature reduction term and penalty into a unified framework, the algorithm achieves a balanced trade-off between simplicity and accuracy. Features with a minimal importance and low impact on RMSE are prioritized for exclusion, while highly significant features are retained. Each feature is encoded in binary form within a seedling, where a value of 1 indicates that the feature is included in the selected subset, and 0 indicates its exclusion. This encoding allows the algorithm to explore and select the optimal combination of features. The objective function is defined as Equations (21) and (22):
where
is a weighting parameter, with a default value of 0.89;
is the total number of features in the dataset;
is the number of selected features in the
i-th seedling;
is the weight assigned to the additional penalty for excluding important features; represents the relative importance of the
j-th feature, which can be derived using the Gini index from RF; and
is the binary value indicating the inclusion (
) or exclusion (
) of the
j-th feature in the selected subset.
This objective function ensures that RMSE is minimized, enhancing the predictive accuracy of the model; the number of features is reduced, simplifying the model and improving computational efficiency; and critical features are retained through the penalty term, preventing the exclusion of important predictors of PM
2.5 concentrations. The BOA starts by initializing a population, where each seedling represents a subset of features. Through iterative operations (elitism, growth, screening, grafting, pruning, and cutting) the algorithm evolves the population to minimize the objective function. The RMSE is computed using predictive models such as OA-LSTM, LSTM, RNN, DNN, RF, and SVM to evaluate the accuracy of the selected features.
Table 6 provides a comprehensive analysis of the selected features and their importance in predicting PM
2.5 concentrations across six predictive models. Each row in the table represents a model, while the columns denote whether a feature is included (1) or excluded (0), alongside the total number of selected features and the resulting RMSE value.
Certain features, such as AOD, LST, NDVI, WS, and the day of the week are consistently selected by all models. This universal inclusion underscores their critical importance in PM2.5 prediction. AOD directly relates to particulate matter and air quality, making it indispensable for any PM2.5 predictive model. Similarly, LST reflects surface temperature variations, which influence atmospheric stability and pollutant dispersion, while NDVI captures vegetation cover, a vital factor in pollutant absorption and urban heat mitigation. Other features exhibit moderate importance, such as Max T, Min T, and humidity (H). These features are included in most models but not universally. Their occasional exclusion suggests that their importance might be context-dependent, varying with the model’s structure or specific interactions between features. For instance, the minimum temperature affects nocturnal cooling and atmospheric stratification, impacting pollutant dispersion, whereas humidity interacts with particulate matter by influencing its hygroscopic growth. Conversely, some features, P, PE, WV, and WD, are less frequently selected, reflecting their limited role in PM2.5 prediction for Tehran. For example, water vapor might have indirect effects on pollution levels, but this effect is not as pronounced as those of primary features like AOD. The wind direction’s impact might also be diminished in a densely urbanized setting like Tehran, where pollutant sources are spatially distributed, and localized meteorological factors dominate. Among all models, OA-LSTM demonstrates the best performance, achieving the lowest RMSE (5.12 µg/m3) with only nine selected features. This indicates the capability of BOA to identify an optimal subset of features that maximizes predictive accuracy while minimizing redundancy. Models like SVM and RF, on the other hand, require more features and still result in higher RMSE values, indicating a less efficient utilization of the selected features.
4. Discussion
Table 7 presents a comparative analysis of our study’s results with previous works conducted in the same study area, Tehran. Although all the listed studies focus on PM
2.5 prediction, it is essential to note that the conditions of these studies are not entirely identical. Each study used different datasets collected over varying periods and employed distinct predictive models tailored to their respective datasets and objectives. Despite these differences, the comparison highlights the advancements achieved through our proposed methodology. In terms of performance, our model achieves the highest
value of 0.88, outperforming all other studies. The closest result is the 3DCNN-GRU model by Faraji et al. [
27], with an
of 0.84, followed by XGBoost models, which achieved
values of 0.81 and 0.74 in the works of Zamani Joharestani et al. [
28] and Bagheri [
26], respectively. Traditional models such as RF, utilized by Nabavi et al. [
24], demonstrate a lower predictive performance with an
R2 of 0.68. The superior performance of OA-LSTM can be attributed to several factors. First, the architectural design of OA-LSTM leverages advanced DL techniques tailored specifically for PM
2.5 prediction. This architecture effectively captures both spatial and temporal dependencies in the data, offering a significant advantage over conventional ML models. Second, the integration and preprocessing of data, including feature selection using the OA, have likely contributed to the enhanced performance by prioritizing the most relevant predictors and minimizing noise. Finally, the combination of an optimized model and rigorous data preparation underscores the robustness of our model. This not only improves the accuracy of the predictions but also sets a new benchmark for PM
2.5 modeling in Tehran, demonstrating the potential of advanced DL frameworks in addressing complex environmental challenges.
Table 8 provides a comprehensive evaluation of the RMSE values for six predictive models across three distinct PM
2.5 concentration ranges: low, moderate, and high. These ranges were defined based on the distribution of PM
2.5 concentrations in the dataset, with low concentrations corresponding to values below 35 µg/m
3, moderate concentrations between 35 and 75 µg/m
3, and high concentrations above 75 µg/m
3. This breakdown allows for a detailed analysis of each model’s performance under varying pollution levels, offering insights into their strengths and limitations. The OA-LSTM model consistently exhibits the lowest RMSE across all concentration ranges, with values of 2.65, 2.94, and 3.73 µg/m
3 for low, moderate, and high concentrations, respectively. This highlights the model’s robust ability to predict PM
2.5 concentrations with high accuracy, regardless of the pollution level. In contrast, traditional ML models like RF and SVM demonstrate significantly higher RMSE values, particularly in high-concentration scenarios, where the RMSE reaches 20.40 and 23.46 µg/m
3, respectively. Among DL models, the LSTM and RNN perform reasonably well, but their RMSE values remain higher than those of OA-LSTM, especially in moderate and high-concentration ranges. This analysis underscores the superior generalizability and precision of the OA-LSTM model, particularly in scenarios involving low and moderate pollution levels. However, at higher concentration ranges, all algorithms, including OA-LSTM, exhibit reduced accuracy. Nonetheless, OA-LSTM demonstrates remarkable consistency, as its RMSE values across low, moderate, and high concentrations are relatively close compared to other models. This consistency highlights the robustness and reliability of OA-LSTM in capturing the intricate patterns of PM
2.5 concentrations, even under challenging high-pollution scenarios.
Extreme pollution events are short-term episodes characterized by PM2.5 concentrations exceeding standard thresholds, posing severe risks to public health and the environment. These events result from a combination of natural causes, such as dust storms, wildfires, and temperature inversions, and anthropogenic factors, including industrial emissions, heavy traffic, and increased fossil fuel usage. The irregular and nonlinear nature of these events, coupled with data scarcity and noise, makes their prediction highly challenging. The accurate modeling of such events is critical for timely mitigation strategies, as they lead to acute health issues, economic disruptions, and environmental degradation. LSTM networks are well suited for predicting extreme pollution events due to their ability to capture long-term dependencies and model nonlinear relationships. By leveraging memory cells and gating mechanisms, LSTMs can identify early indicators of these events, such as shifts in meteorological variables or pollutant levels. The integration of the OA further enhances LSTM performance by addressing challenges in optimization, such as avoiding local minima and improving model robustness against noisy and imbalanced datasets. This hybrid approach enables the model to generalize effectively, ensuring higher accuracy in predicting rare and complex pollution events.
To ensure the model’s robustness in handling extreme pollution events, effective noise reduction and anomaly management techniques were applied during the data preprocessing stage. Specifically, the Savitzky–Golay filter was utilized to smooth the PM2.5 time-series data, reducing irregularities and ensuring that critical signal features, such as sharp peaks and valleys, were preserved. This step was crucial for enabling the model to accurately capture the unique patterns associated with extreme pollution events. By removing noise without compromising the inherent structure of the data, the preprocessing pipeline enhanced the reliability of the input dataset and facilitated the model’s ability to identify and predict these rare occurrences. Additionally, the proposed OA-LSTM model was rigorously evaluated using data specifically associated with extreme pollution events, such as days with exceptionally high PM2.5 concentrations. The results demonstrated that the model maintained a comparable level of predictive accuracy during extreme pollution events, with error metrics that were consistent with those observed for lower PM2.5 concentration levels. This indicates the model’s ability to generalize effectively, providing accurate and actionable forecasts even in the presence of anomalies and extreme conditions.
One of the main challenges in this study was the limited number of ground-based air quality monitoring stations, which could reduce prediction accuracy, especially in areas with sparse or no data coverage. To address this, satellite data, such as AOD, were used as complementary sources, providing broad spatial coverage and filling data gaps in unmonitored regions. The model was first trained and validated using satellite data aligned with existing monitoring stations to establish the relationship between satellite-derived variables and PM2.5 concentrations. Once validated, the model used satellite data from unmonitored regions to estimate PM2.5 levels, enabling comprehensive spatial predictions. Additional spatial features, such as elevation and NDVI, were incorporated to enhance the prediction accuracy by providing environmental context and capturing factors influencing pollution distribution. To harmonize the temporal and spatial resolutions across datasets, specific preprocessing steps were implemented. PM2.5 and AOD data, initially recorded at hourly intervals, were aggregated to daily averages to align with daily meteorological data, ensuring a unified temporal resolution. For spatial consistency, resampling and interpolation techniques, such as IDW and Kriging, were applied to adjust satellite and ground-based datasets to matching spatial granularities. These harmonization steps created a cohesive dataset, facilitating accurate spatial–temporal analysis and enabling the model to deliver reliable air quality predictions across diverse regions.
To address the challenge of missing data, we adopted a controlled scenario approach to evaluate the effectiveness of our interpolation methods. In this approach, a certain percentage of complete data was randomly removed, and the missing values were reconstructed using spline interpolation. This method allowed us to assess how accurately the interpolation technique could recover missing data and how it would impact the model’s predictive performance. As missing data are an unavoidable issue in real-world datasets, employing robust interpolation techniques is crucial. The accuracy of these techniques significantly depends on their implementation; therefore, careful calibration and testing of spline parameters were performed to ensure the best possible reconstruction.
Table 9 highlights the impact of our preprocessing strategies, showcasing the performance of the OA-LSTM model and other comparative models under different scenarios. For the real dataset, the application of the Savitzky–Golay filter for noise reduction resulted in notable improvements. The RMSE of the OA-LSTM model decreased from 5.29 µg/m
3 (without Savitzky–Golay) to 4.63 µg/m
3 (with Savitzky–Golay), demonstrating the filter’s effectiveness in maintaining data integrity while eliminating noise. Similar trends were observed across all models, with OA-LSTM consistently outperforming others.
For the interpolated datasets, spline interpolation proved effective, though RMSE values were slightly higher than those of the real dataset. When combined with Savitzky–Golay filtering, spline interpolation yielded reliable results, with the OA-LSTM model achieving an RMSE of 6.26 µg/m3. However, when spline interpolation was used without noise reduction, RMSE increased further to 7.53 µg/m3. This underscores the critical role of integrating noise reduction methods with interpolation to enhance data quality and improve predictive accuracy. These results emphasize the necessity of employing robust preprocessing strategies in scenarios involving missing data. The OA-LSTM model’s superior performance across all scenarios demonstrates its robustness and adaptability, even in challenging conditions. The controlled approach used in this study not only validates the effectiveness of spline interpolation but also highlights the importance of fine-tuning interpolation techniques to maximize accuracy in real-world applications.
The proposed OA-LSTM model holds significant practical potential for integration into existing air quality management systems. By leveraging its high predictive accuracy, the model can provide early warnings for extreme pollution events, enabling policymakers to implement timely mitigation measures such as traffic restrictions, industrial activity adjustments, or public health advisories. The model’s compatibility with real-time data streams from ground monitoring stations and satellite sources ensures its relevance in dynamic and rapidly changing urban environments. Furthermore, the model’s ability to generalize across regions with varying data densities makes it particularly valuable for areas with sparse monitoring infrastructure, where accurate predictions are crucial for resource allocation and public safety measures.
In addition, this framework can enhance existing systems by serving as a decision-support tool for urban planning and environmental policy development. For example, integrating the OA-LSTM model into mobile applications or online dashboards could provide actionable air quality forecasts directly to the public, increasing awareness and preparedness during high-risk periods. Similarly, policymakers could use model outputs to design more effective long-term strategies, such as identifying high-emission zones or optimizing the placement of new monitoring stations. Importantly, the model’s relatively low computational requirements enable its deployment on cloud-based or local infrastructures, making it accessible to a wide range of governmental and non-governmental organizations. By bridging the gap between predictive modeling and practical implementation, the proposed OA-LSTM model represents a critical advancement in data-driven air quality management.
5. Conclusions
Air pollution, particularly PM2.5, poses significant health and environmental challenges in urban areas. Tehran, as a highly populated and industrialized city, has been the focus of numerous studies aiming to model and predict PM2.5 concentrations. In this paper, we proposed a novel OA-LSTM model to address the dual challenges of feature selection and predictive accuracy. By analyzing meteorological, environmental, and spatial data collected from Tehran between 2014 and 2016, we aimed to optimize the balance between model simplicity and performance. Our results demonstrated that the proposed OA-LSTM model outperformed all other approaches, achieving the highest value of 0.88, indicating its robustness and reliability in capturing the complex dynamics of PM2.5 concentrations. Key features such as AOD, LST, NDVI, WS, and the day of the week were consistently identified as the most significant predictors across all models. Furthermore, the BOA’s ability to reduce the number of features without sacrificing accuracy was evident, as the OA-LSTM model achieved the lowest RMSE of 5.12 µg/m3 with only nine selected features, outperforming models like RF and SVM, which required more features and delivered lower predictive performance.
The comparative analysis with previous studies further highlighted the advantages of our proposed framework. While earlier works utilizing models like RF, XGBoost, and 3DCNN-GRU achieved respectable values, they were constrained by either less sophisticated feature selection techniques or limited temporal and spatial data. In contrast, our approach combined advanced architecture design, optimized data preprocessing, and robust feature selection to set a new benchmark for PM2.5 modeling in Tehran. These results underline the importance of integrating state-of-the-art methodologies for tackling complex environmental problems. In conclusion, this paper has successfully demonstrated the potential of combining advanced DL architectures with innovative optimization techniques like the OA method to address air quality prediction challenges. The OA-LSTM model, equipped with optimized feature subsets, provides a scalable and efficient solution for urban air quality management.
Building upon the advancements of this study, several promising directions can enhance the robustness and applicability of the OA-LSTM framework. First, expanding the dataset to include additional years and real-time monitoring data will enable the model to capture temporal variations more comprehensively and detect emerging pollution trends. Integrating high-frequency, real-time data sources can also enhance the framework’s ability to provide near-instantaneous predictions, making it suitable for urban air quality management systems. This scalability ensures adaptability to evolving environmental conditions and supports real-time decision-making. Second, the OA-LSTM model demonstrates strong potential for application in various regions and pollutants, beyond the current focus on PM2.5. By incorporating region-specific data, such as meteorological variables, pollutant concentrations, and geographical features, the model can adapt to new environmental conditions. For cities or regions with sparse monitoring networks, integrating supplementary data sources like satellite imagery, low-cost sensors, and socioeconomic indicators (e.g., traffic density and industrial activities) can significantly improve spatial coverage and prediction accuracy. Tailored preprocessing techniques, such as normalization and feature selection, will align these inputs with the unique characteristics of new target regions, ensuring accurate and reliable predictions.
Third, applying transfer learning techniques could streamline the adaptation of the OA-LSTM framework to diverse urban and rural environments. Pretraining the model on comprehensive datasets from one region and fine-tuning it with minimal local data from another can reduce computational costs and enhance its accessibility for resource-constrained areas. Testing the framework across regions with varying environmental conditions and pollutant profiles will validate its generalizability, allowing researchers to refine the model further for global air quality prediction. Lastly, future work could explore ensemble approaches that combine OA-LSTM with complementary models like CNNs to improve predictive accuracy and resilience. Additionally, developing dynamic optimization strategies, such as real-time extensions of the orchard algorithm, can enable continuous feature selection and parameter tuning, ensuring the model remains responsive to new data inputs. Incorporating uncertainty quantification methods into predictions will provide policymakers with reliable tools for risk assessment and targeted interventions. These advancements will position the OA-LSTM framework as a scalable and adaptable tool capable of addressing global challenges in urban air quality management and environmental sustainability.