AirTrace-SA: Air Pollution Tracing for Source Attribution

Zhao, Wenchuan; Zhang, Qi; Shu, Ting; Du, Xia

doi:10.3390/info16070603

Open AccessArticle

AirTrace-SA: Air Pollution Tracing for Source Attribution

¹

Faculty of Data Science, City University of Macau, Macau SAR, China

²

School of Artificial Intelligence, Shenzhen University, Shenzhen 518060, China

³

National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, China

⁴

School of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361024, China

^*

Authors to whom correspondence should be addressed.

Information 2025, 16(7), 603; https://doi.org/10.3390/info16070603 (registering DOI)

Submission received: 9 May 2025 / Revised: 30 June 2025 / Accepted: 9 July 2025 / Published: 13 July 2025

(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)

Download

Browse Figures

Versions Notes

Abstract

Air pollution source tracing is vital for effective pollution prevention and control, yet traditional methods often require large amounts of manual data, have limited cross-regional generalizability, and present challenges in capturing complex pollutant interactions. This study introduces AirTrace-SA (Air Pollution Tracing for Source Attribution), a novel hybrid deep learning model designed for the accurate identification and quantification of air pollution sources. AirTrace-SA comprises three main components: a hierarchical feature extractor (HFE) that extracts multi-scale features from chemical components, a source association bridge (SAB) that links chemical features to pollution sources through a multi-step decision mechanism, and a source contribution quantifier (SCQ) based on the TabNet regressor for the precise prediction of source contributions. Evaluated on real air quality datasets from five cities (Lanzhou, Luoyang, Haikou, Urumqi, and Hangzhou), AirTrace-SA achieves an average

R^{2}

of 0.88 (ranging from 0.84 to 0.94 across 10-fold cross-validation), an average mean absolute error (

M A E

) of 0.60 (ranging from 0.46 to 0.78 across five cities), and an average root mean square error (

R M S E

) of 1.06 (ranging from 0.51 to 1.62 across ten pollution sources). The model outperforms baseline models such as 1D CNN and LightGBM in terms of stability, accuracy, and cross-city generalization. Feature importance analysis identifies the main contributions of source categories, further improving interpretability. By reducing the reliance on labor-intensive data collection and providing scalable, high-precision source tracing, AirTrace-SA offers a powerful tool for environmental management that supports targeted emission reduction strategies and sustainable development.

Keywords:

hybrid model; multi-step decision; TabNet; air pollution; particulate matter; pollution source tracing

1. Introduction

Global industrialization and urbanization continue to advance, and, alongside economic growth, air pollution, particularly particulate matter (PM_2.5 and PM₁₀), has become a critical global challenge; it causes four million deaths annually and has severe ecological impacts [1,2,3,4]. The problem is especially severe in China’s rapidly developing regions, such as the Beijing–Tianjin–Hebei region, the Yangtze River Delta region and the Pearl River Delta region, where air quality frequently exceeds national standards, resulting in hundreds of billions of dollars in annual economic losses [5,6,7]. Studies indicate that every 10 μg/m³ increase in PM_2.5 concentrations correlates with 15–20% higher heart disease incidence and 18% increased chronic obstructive pulmonary disease (COPD) risk in elderly populations [8].

Air pollution source tracing is key to air pollution prevention and control, focusing on analyzing the concentrations of chemical components in air pollutants to accurately identify the pollution sources and their contribution [9]. Common pollutant sources include industrial emissions, traffic exhaust, coal combustion, urban dust and secondary pollutants, and so on [10]. After determining the contribution of pollution sources to PM_2.5 and understanding the causes of air pollution in the region, targeted emission reduction strategies can be created [11].

The challenge of pollution source tracing mainly stems from the heterogeneity and dynamics of the data. Coastal cities contend with interactions between sea salt and industrial emissions, while inland cities face dust–coal combustion complexities [12,13]. Secondary pollutants formed through atmospheric reactions require integration with meteorological modeling [14]. Traditional methods have provided valuable insights for decades, though they face challenges in handling increasingly complex pollution patterns and nonlinear source interactions in modern urban environments [15]. Numerous AI approaches (e.g., Random Forest, XGBoost, and 1D CNN) have been established as powerful benchmarks for this task, demonstrating strong predictive capabilities. However, a key challenge remains in developing a unified framework that can simultaneously handle complex, nonlinear source interactions and provide clear interpretability without sacrificing cross-regional generalization [16].

In this context, this study proposes AirTrace-SA (Air Pollution Tracing for Source Attribution), a novel hybrid deep learning approach for accurate air pollution source identification and quantification. AirTrace-SA integrates three key components: a hierarchical feature extractor (HFE) that derives multi-scale representations from chemical components, a source association bridge (SAB) that establishes chemical-to-source mappings through a multi-step decision mechanism, and a source contribution quantifier (SCQ) that precisely predicts pollution source contributions using TabNet regression capabilities. In summary, this study makes the following contributions:

Innovative fusion of models. This study introduces an innovative architecture that synergistically combines hierarchical feature extraction with multi-step decision processing, creating a powerful framework specifically designed for pollution source tracing challenges.
Improving the accuracy of pollution source tracing. The model significantly improves source tracing accuracy through its Source Association Bridge, which employs sparse attention mechanisms and sequential processors to distinguish overlapping pollution signals.
Enhancing generalization capability. K-fold cross-validation and multi-city data evaluation were used to maintain stable performance under different geographic and climatic conditions, as well as different pollution types, solving the generalization problem faced by traditional models in cross-regional applications, better adapting to various complex environments, and improving their stability and reliability.
Reducing analysis costs. Our method reduces the analysis cost and improves work efficiency by capturing the complex feature patterns from the existing data and the powerful pollutant source analysis capability, which reduces reliance on a large number of field surveys and on the manual analysis of samples.

The rest of the study is organized as follows: Section 2 describes the current state of research and surveys related to this study. Next, Section 3 introduces the main methods used in the AirTrace-SA model. Then, in Section 4, some relevant research experiments are conducted to demonstrate the effectiveness of the model when it is applied to air pollution source tracing. Section 5 provides a comprehensive discussion of the model’s limitations, global applicability, and temporal considerations. Finally, Section 6 summarizes the study and describes future research directions.

2. Related Work

Traditional air pollution source tracing methods, including source models and receptor models, have been successfully applied worldwide and continue to provide valuable insights. While each approach has specific requirements and constraints, they form complementary tools that can be selected or combined based on application needs.

Source models such as the Community Multiscale Air Quality Model (CMAQ) provide comprehensive regional air quality simulations by integrating chemical reactions and transport processes [17]. Zhao et al. successfully applied CMAQ in the Yangtze Delta region, accurately capturing

{S O}_{4}^{2 -}

distributions while revealing the challenges of NH₃ estimation [18]. CMAQ’s strength lies in its process-based approach and scenario analysis capabilities, making it invaluable for policy evaluation. While its performance depends on emission inventory quality, CMAQ remains the gold standard for understanding atmospheric processes at regional scales. Receptor models such as Chemical Mass Balance (CMB) offer direct source apportionment through mass balance calculations [19]. Zhang et al. demonstrated CMB’s effectiveness in Tianjin, accurately identifying the contributions of coal combustion (15.2%) and dust (12.5%) to PM_2.5 [20]. CMB’s mathematical transparency and regulatory acceptance make it particularly valuable when source profiles are well-established. For regions with evolving emission patterns, data-driven approaches can complement CMB by adapting to changing source characteristics.

Due to its powerful data processing abilities, machine learning has been increasingly emphasized in air pollution source tracing [21]. Tree-based methods have gained traction in source apportionment. Choi et al. applied Decision Trees in Seoul, achieving

R^{2}

values of 0.65–0.73 with excellent interpretability for preliminary analysis [22,23]. Random Forest extends this approach through ensemble learning, with Du et al.’s SPST model reducing prediction errors below 2% [24,25]. These methods excel at capturing nonlinear relationships and handling missing data, although their performance can vary across different urban environments. AirTrace-SA offers a complementary approach by combining hierarchical feature extraction with attention-enhanced source association, addressing cross-regional generalization challenges differently. Advanced machine learning algorithms offer improved accuracy. The Support Vector Machine (SVM) maximizes classification boundaries, with Kaya et al. achieving

R^{2}

values of 0.85–0.91 for PM₁₀ prediction in Punjab [26,27]. XGBoost’s gradient boosting approach provided superior results in Beijing compared to traditional statistics [28,29]. Both methods effectively handle nonlinear patterns, though they require careful parameter tuning. AirTrace-SA complements these approaches by automating feature extraction through deep learning while maintaining comparable computational efficiency.

Deep learning approaches have transformed pollution analysis by automatically discovering complex patterns in environmental data [30]. The one-dimensional convolutional neural network (1D CNN) efficiently processes temporal sequences, with Ragab et al. achieving a

M A E

of 2.036 for air pollution prediction through local feature extraction [31,32]. While computational requirements are superior to traditional methods, this investment often yields superior accuracy for time-series analysis. Hao et al. proposed a Tracing-U-Net model that combines the CNN and U-Net architectures to capture the spatial features of pollutants through convolutional operations, and the average prediction error on sparse datasets is less than 8%, demonstrating the data efficiency potential of deep learning [33]. The model particularly excels in capturing location-specific pollution patterns, though careful domain adaptation may be needed for cross-regional applications. Attention-based LSTM networks developed by Liu et al. demonstrate superior PM_2.5 prediction by focusing on relevant features [34,35], whereas the model still faces the challenge of high data requirements; in the case of data scarcity in particular, the robustness of the model may be affected. TabNet, proposed by Arik et al., represents a particularly promising development for pollution source analysis [36]. It employs sequential attention mechanisms for instance-wise feature selection, making it inherently interpretable—a crucial requirement for environmental applications. TabNet’s multi-step decision architecture progressively refines predictions while providing feature importance rankings, allowing researchers to understand which chemical components drive source identification. Its success on various tabular datasets demonstrates strong potential for environmental data. However, when applied to highly overlapping pollution signatures, single-architecture approaches may face limitations in disambiguating complex source mixtures. Research demonstrates that hybrid models combining multiple architectures often outperform standalone approaches [37]. AirTrace-SA leverages this principle by integrating hierarchical feature extraction, attention-based source association, and TabNet regression within a unified framework, addressing the limitations of individual methods while maintaining interpretability.

Recent advances have explored hybrid approaches, combining traditional models with machine learning techniques. Lee integrated machine learning with Bayesian Spatial Multivariate Receptor Modeling (BSMRM), enabling the spatial prediction of PM_2.5 sources at unmonitored locations while maintaining the physical interpretability of receptor models [38]. Ma et al. combined WRF-Chem with deep learning bias correction, achieving

R M S E

reductions of 38.90–48.86% across multiple urban agglomerations, with the hybrid approach significantly outperforming individual methods [39]. Lee et al. developed a conditional surrogate, achieving

R^{2}

> 0.95 for PM_2.5 concentration prediction, but this high accuracy is for total PM_2.5 mass rather than source-specific contributions [40]. These hybrid methods excel in their specific domains: BSMRM provides spatial coverage but requires extensive multi-site data; WRF-Chem hybrids improve concentration forecasts but need significant computational resources even with deep learning acceleration; CMAQ surrogates achieve rapid concentration predictions but do not perform source apportionment. AirTrace-SA addresses the distinct challenge of directly quantifying contributions from multiple individual pollution sources without requiring multi-site monitoring networks or computationally intensive atmospheric simulations, making it particularly suitable for regions with limited monitoring infrastructure or computational resources.

Building upon these insights, we introduce AirTrace-SA (Air Pollution Tracing for Source Attribution), a specialized hybrid architecture for pollution source tracing. The model features an HFE module for extracting multi-scale chemical patterns, an SAB component that employs iterative decision steps to map features to sources, and an SCQ unit that utilizes advanced tabular learning techniques for precise contribution estimation. Unlike traditional methods, AirTrace-SA does not need to rely on cumbersome emission lists or predefined source fingerprints. Instead, it automatically extracts complex feature patterns from existing data, which significantly reduces the cost of manual data collection and improves adaptability. Compared with machine learning approaches that have difficulties in capturing deep nonlinear interactions or maintaining generalization across heterogeneous datasets, AirTrace-SA models the complex relationships among pollution sources more effectively through a multi-step decision-making mechanism. In addition, compared to standalone deep learning models, which may have difficulty in fully extracting features when dealing with noisy or highly overlapping pollutant data, the hybrid architecture of AirTrace-SA enhances the deep feature representation capability and improves robustness.

Overall, AirTrace-SA offers advantages in accuracy, generalizability, and interpretability, providing an innovative and efficient solution in the field of air pollution source tracing.

3. Method

This section presents a detailed explanation of the AirTrace-SA model for air pollution source tracing. AirTrace-SA integrates three key components in an end-to-end framework to transform chemical concentration data into precise pollution source contribution predictions.

Figure 1 shows the schematic diagram of the proposed method. First, the input chemical concentration data in tabular form are entered into the hierarchical feature extractor (HFE). The HFE processes these data through a series of shared layers, followed by step-dependent layers, each activated by ReLU activation functions [41]. These layers progressively extract multi-scale features from raw chemical data, capturing both general patterns and specific chemical markers relevant to pollution source identification. The output layer of HFE delivers a comprehensive feature representation that encodes the complex relationships between chemical components.

This feature representation is then passed to the source association bridge (SAB), which forms the core of AirTrace-SA’s innovative approach. The SAB employs a multi-step decision mechanism that iterates for

t

steps to establish robust associations between chemical features and pollution sources. In each step, the process begins with a sparse attention mechanism that selectively focuses on the most relevant features for a particular source [42], followed by a sequential processor that enhances these selected features. This sequential processing allows the model to progressively refine its understanding of complex pollution signatures. The outputs from all steps are then combined through an elementwise sum operation, creating an output aggregation that captures multi-faceted relationships between chemical components and pollution sources.

Finally, the output aggregation is subsequently fed into the source contribution quantifier (SCQ), which employs a TabNet Regressor with its own multi-step processing capability [36]. Within the SCQ, each step involves feature selection to identify the most relevant aspects of the aggregated features, followed by feature processing to transform these selections into predictive insights. After multiple processing steps, the final output passes through the MSE loss function to optimize the model’s performance during training [43]. The SCQ ultimately generates precise quantitative predictions of the contribution of each pollution source.

This cascaded architecture enables AirTrace-SA to perform complex feature extraction, association building, and contribution quantification in a unified framework. The following subsections will elaborate on each component’s internal structure and functional mechanisms.

3.1. Hierarchical Feature Extractor (HFE)

The hierarchical feature extractor (HFE) is the initial stage of the model and is responsible for transforming the chemical concentrations in the raw tabular data into a more expressive feature representation. The design of HFE draws inspiration from TabNet’s hierarchical feature processing principles [36] but adopts a more streamlined architecture for our specific task. Instead of TabNet’s complex GLU-based layers with split mechanisms, our HFE employs shared and step-dependent layers followed by an output layer, achieving effective multilayer nonlinear transformation while reducing computational complexity [44].

The input data structure of our model directly reflects standard environmental monitoring practices. Each input sample represents a single PM_2.5 measurement with concentrations of 17 chemical species, forming a concentration matrix

x \in R^{(n \times 17)}

, where

n

represents the number of samples and 17 represents chemical species. This matrix-based representation is consistent with traditional receptor modeling approaches such as CMB and PMF, in which chemical concentration data are analyzed in matrix form to determine source contributions [19].

Input features are first processed through two shared layers that maintain parameter sharing across all sample processing, helping to capture generalized relationships between features:

h_{s h a r e d}^{(1)} = R e L U (W_{s h a r e d}^{(1)} \cdot x + b_{s h a r e d}^{(1)})

(1)

h_{s h a r e d}^{(2)} = R e L U (W_{s h a r e d}^{(2)} \cdot h_{s h a r e d}^{(1)} + b_{s h a r e d}^{(2)})

(2)

where

W_{s h a r e d}^{(i)}

and

b_{s h a r e d}^{(i)}

are the weight matrix and bias terms of the ith shared layer, respectively, and

x

is the original input feature. Next, the features are further processed through two step-dependent layers that provide more specialized feature transformation capabilities:

h_{s t e p}^{(1)} = R e L U (W_{s t e p}^{(1)} \cdot h_{s h a r e d}^{(2)} + b_{s t e p}^{(1)})

(3)

h_{s t e p}^{(2)} = R e L U (W_{s t e p}^{(2)} \cdot h_{s t e p}^{(1)} + b_{s t e p}^{(2)})

(4)

where

W_{s t e p}^{(i)}

and

b_{s t e p}^{(i)}

are the parameters of the step dependency layer. The final feature representation is generated by mapping the features to the specified dimension space through the output layer:

h_{o u t p u t} = W_{o u t p u t} \cdot h_{s t e p}^{(2)} + b_{o u t p u t}

(5)

This output

h_{o u t p u t}

serves as an input to the subsequent multi-step decision-making mechanism, the dimensionality of which is controlled by the parameter

W_{o u t p u t}

in the output layer and is usually set to a higher dimension to retain sufficient information.

The key role of the HFE is its ability to extract potential deep feature relationships from raw chemical composition data, converting simple concentration values into feature representations with rich semantic information. The hierarchical design enables the model to capture both common patterns across samples (via shared layers) and more specific, discriminative features (via step-dependent layers) that are essential for distinguishing between different pollution sources.

This conversion lays the foundation for the subsequent feature selection and transformation process, enabling the model to more accurately identify the contributions of different pollution sources.

3.2. Source Association Bridge (SAB)

The source association bridge (SAB) is the core innovative component of the AirTrace-SA model, responsible for establishing meaningful associations between chemical features and pollution sources through a multi-step decision mechanism. SAB receives the feature representations from HFE and transforms them into a form that enables accurate source contribution quantification.

The SAB employs a sequential multi-step processing approach, iterating for

t

steps. Each step progressively refines the feature representations to better distinguish between different pollution sources. The key components of SAB include the sparse attention mechanism and sequential processor, which work together in each step to select and enhance relevant features.

3.2.1. Sparse Attention Mechanism

The sparse attention mechanism is responsible for selecting and weighting the features, based on the attention score and the sparsification operation to select the features related to the current task. As shown in Figure 2, the attention mechanism generates an attention score based on the input features, and the weight is sparsity-enabled through the Sparsemax activation function [42]. Finally, the features are multiplied by the sparse weight to form a sparse input feature matrix. This sparsity allows the model to efficiently select a small number of key features while ignoring less important features.

Specifically, the attention scores of the features are first generated in each step by linear transformations

a

:

a^{(t)} = W_{a t t e n t i o n}^{(t)} \cdot X^{(t)}

(6)

where

W_{a t t e n t i o n}^{(t)}

is the linear weight matrix at step

t

and

X^{(t)}

is the input feature matrix. After obtaining the attention score

a^{(t)}

, it is sparsified using the Sparsemax function:

α^{(t)} = S p a r s e m a x (a^{(t)}) = m a x (a^{(t)} - τ, 0)

(7)

where

τ

is a threshold calculated via sorting and accumulating to ensure that the result is a sparse probability distribution. The Sparsemax function is characterized by its output being a sparse vector with most elements being zero and only a few elements having non-zero values, unlike Softmax which assigns non-zero probabilities to all elements [42]. The sparse attention weight

α^{(t)}

ensures sparse feature selection, i.e., only some of the key features are selected at each step. Applying

α^{(t)}

to the input features

X^{(t)}

selects the most relevant features for the next step:

X_{s e l e c t e d}^{(t)} = X^{(t)} ⊙ α^{(t)}

(8)

where ⊙ denotes element-by-element multiplication, ensuring that only those features with higher weights are retained and passed on to the subsequent feature converter. This selective attention allows AirTrace-SA to focus exclusively on chemical components that are most characteristic of specific pollution sources.

3.2.2. Sequential Processor

The sequential processor follows the sparse attention mechanism in each step of the SAB; it is responsible for the non-linear transformation of selected features to extract more expressive high-level features. While the sequential processor shares conceptual roots with TabNet’s feature transformer [36] and DCN’s cross layers [45], its implementation follows a different approach. We eschew complex gating mechanisms and explicit feature interactions in favor of a sequential processing pipeline that alternates between shared and step-dependent transformations. This architectural decision reduces computational overheads while maintaining the essential capability of progressive feature refinement through multiple ReLU-activated layers. As shown in Figure 3, it consists of multiple fully connected layers and processes input features through the nonlinear activation function ReLU.

First, the input features are processed through shared layers that are shared across all decision steps to help the model capture global features:

h_{s h a r e d}^{(t)} = R e L U (W_{s h a r e d} \cdot X_{s e l e c t e d}^{(t)} + b_{s h a r e d})

(9)

where

W_{s h a r e d}

is the weight matrix of the shared layer,

b_{s h a r e d}

is the bias term, and

X_{s e l e c t e d}^{(t)}

is the weighted input features obtained from the sparse attention mechanism. The ReLU activation function introduces a nonlinear transformation defined as [38]:

R e L U (x) = \max (0, x)

(10)

This nonlinear transformation helps the model to capture complex relationships between features, not just linear ones.

After processing in the shared layer, the features are passed to independent layers specific to the step, which help the model to perform specific nonlinear transformations for different feature choices at each step:

h_{s t e p}^{(t)} = R e L U (W_{s t e p}^{(t)} \cdot h_{s h a r e d}^{(t)} + b_{s t e p}^{(t)})

(11)

where

W_{s t e p}^{(t)}

and

b_{s t e p}^{(t)}

are step-specific parameters that generate different feature representations for each decision step

t

. This design allows the model to learn different transformation functions at different steps to enhance its representation.

The feature representation

h_{s t e p}^{(t)}

processed by the sequential processor will be used as the output of that step and also passed as an input to the next decision step, creating an iterative feature refinement process. This design enables the model to build and refine the feature representation incrementally, with each step building on the previous step to further enhance the features.

3.2.3. Multi-Step Decision Mechanism and Feature Aggregation

The multi-step decision mechanism is the core organizational principle of the SAB, controlling how features flow through multiple steps of sparse attention and feature transformation. Rather than processing features in a single pass, AirTrace-SA employs a progressive refinement approach over t steps.

At step t, the model uses the sparse attention mechanism and the sequential processor to generate the feature representation for that step:

h^{(t)} = S p a r s e A t t e n t i o n M e c h a n i s m^{(t)} ({S e q u e n t i a l P r o c e s s o r}^{(t)} (h^{(t - 1)}))

(12)

where

h^{(t - 1)}

is the feature representation of the previous step, which is processed by the sequential processor and sparse attention mechanism to generate the representation of the current step

h^{(t)}

. This sequential processing allows each step to build a more complex feature understanding based on the previous step.

A distinguishing characteristic of the SAB is that feature representations from all steps are accumulated rather than using only the final step’s output. The final feature representation

h_{f i n a l}

is the cumulative output of all the steps:

h_{f i n a l} = \sum_{t = 1}^{t} h^{(t)}

(13)

where

t

is the total number of steps; to balance the model’s complexity and performance, our implementation is set to 7 steps. This aggregation mechanism allows AirTrace-SA to capture different feature relationships across multiple decision steps and generate a comprehensive feature representation.

This multi-step iterative design is well-suited for complex tasks such as air pollution source analysis because the relationships between pollution sources and chemical components are typically multi-layered and multi-faceted. For example, certain chemical components may collectively indicate one pollution source, while the same component in different combinations may indicate other sources. Through a multi-step decision-making mechanism, AirTrace-SA constructs a step-by-step understanding of these complex relationships and integrates information from different steps to form a comprehensive assessment of pollution source contributions.

Eventually, the cumulative feature representation output from the SAB is then passed to the source contribution quantifier for ultimate regression prediction.

3.3. Source Contribution Quantifier(SCQ)

The source contribution quantifier (SCQ) is the final component of the AirTrace-SA model; it is responsible for transforming the feature representations from the SAB into accurate predictions of pollution source contributions. At the core of SCQ is a TabNet Regressor, which is specifically designed to process tabular data through its own multi-step architecture, complementing the multi-step processing of the SAB.

3.3.1. TabNet Regressor

The TabNet Regressor receives the aggregated feature representation

h_{f i n a l}

from the SAB and processes it through several sequential steps before generating the final prediction [36]. Unlike conventional regression models that process features in a single pass, TabNet employs a sequential feature processing approach that bears some similarities to decision trees [22], but with the representational power of neural networks.

The TabNet Regressor’s internal architecture consists of multiple processing steps (also set to

t

steps in our implementation, where

t

= 7). Each step includes two key operations:

First, feature selection identifies the most informative features for the current processing step:

m^{(t)} = F e a t u r e S e l e c t i o n^{(t)} (d^{(t - 1)}, h_{f i n a l})

(14)

where

m^{(t)}

is the feature mask at step

t

,

d^{(t - 1)}

is the decision state from the previous step, and

h_{f i n a l}

is the input feature representation from the SAB. The feature selection mechanism uses an attention-based approach that is similar to, but distinct from, the SAB’s sparse attention. It determines which features to focus on in the current step based on both the input and the processing history.

Next, feature processing transforms the selected features to extract relevant information:

d^{(t)} = F e a t u r e P r o c e s s i n g^{(t)} (m^{(t)} ⊙ h_{f i n a l})

(15)

where

d^{(t)}

is the decision output at step

t

. This processed information is used both as an input to the next step’s feature selection and as part of the sequential decision process.

The sequential nature of TabNet’s processing allows it to focus on different aspects of the data at different steps. Early steps may identify major discriminative features, while later steps might focus on more subtle patterns that help distinguish between similar sources.

This sequential processing can be viewed as the following decision path:

D e c i s i o n P a t h = d^{(1)} \to d^{(2)} \to \dots \to d^{(t)}

(16)

The information from this decision path is then used to make the final prediction. The feature selection at each step depends on what was selected and learned in previous steps, creating a form of progressive feature learning.

3.3.2. Loss Function and Output Mapping

The final prediction is generated by mapping the processed features to the output space through a linear layer:

\hat{y} = W_{o u t p u t} \cdot d^{(t)} + b_{o u t p u t}

(17)

where

\hat{y}

is the predicted contribution of each pollution source,

W_{o u t p u t}

is the weight matrix of the output layer, and

b_{o u t p u t}

is the bias term.

In the regression task, the goal of the model is to minimize the error between the predicted value

\hat{y}

and the true value y [43]. SCQ uses the mean squared error (MSE) as the loss function to measure the gap between the predicted and actual source contributions [46]:

L = \frac{1}{n} \sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}

(18)

where

n

is the number of samples,

{\hat{y}}_{i}

is the predicted value for the ith sample, and y_i is the true value. By minimizing this loss function through backpropagation, the model continuously updates its parameters to improve the accuracy of the prediction.

The SCQ’s multi-step processing complements the SAB’s multi-step mechanism in a synergistic manner. While the SAB focuses on building associations between chemical features and pollution sources through parallel processing and aggregation, the SCQ refines these associations through sequential processing to generate precise contribution predictions.

This dual multi-step approach—parallel in SAB and sequential in SCQ—enables AirTrace-SA to handle the complex, non-linear relationships between chemical components and pollution sources with remarkable accuracy. The SCQ effectively translates the rich feature representations from the SAB into interpretable, quantitative predictions of source contributions, completing the end-to-end pollution source tracing pipeline.

4. Experiment

In this section, relevant research experiments are presented to verify the effectiveness of this model when applied to air pollution source analysis. First, the sources and contents of the datasets used in this experiment are introduced. Next, the research model AirTrace-SA is compared with 1D CNN, Decision Tree, Random Forest, XGBoost, LightGBM [47] and TabNet in a multi-dimensional manner using the

R^{2}

[48],

M A E

[49], and

R M S E

[50] evaluation methods. Then, the performance of the models is evaluated more intuitively by generating corresponding scatter plots of predicted and true values for all air pollution sources. Subsequently, feature importance analysis is performed on the raw features to quantify the extent to which each feature contributes to pollution source identification, generating a clear importance ranking. Finally, a comprehensive ablation study is conducted to validate the necessity and contribution of each component within the AirTrace-SA architecture.

The experiment is conducted on a computer configured with a 12th Gen Intel(R) Core(TM) i5-12400F processor (2.5 GHz), 32,768 MB RAM, and an NVIDIA GeForce RTX 4060 Ti graphics card.

4.1. Dataset

The dataset used in this study is from [25], which contains real air pollution data from five cities in China (Lanzhou, Luoyang, Haikou, Urumqi, and Hangzhou). These data consist of 1402 ambient samples with PM_2.5 chemical composition and their corresponding pollutant contributions, collected as daily measurements from December 2013 to November 2014.

The data characteristics include 17 chemical constituents of the environmental samples as shown in Table 1. They are

{S O}_{4}^{2 -}

,

N O_{3}^{-}

,

C l^{-}

, Na, Mg, Al, Ca, K, Si, Fe, Mn, Ti, Cu, Zn, Pb, OC, and EC. These chemical components serve as the input features for the AirTrace-SA model. The model takes the measured concentrations of these chemical species (in μg/m³) as inputs to predict pollution source contributions.

The data collection and chemical analysis protocols for all five cities followed the methodologies detailed in [51], which included stringent quality assurance/quality control (QA/QC) measures to ensure a high degree of data reliability and consistency [52]. To visually illustrate the distinct pollution characteristics that justify the selection of these diverse cities, Figure 4 presents the average mass concentrations and relative contributions of the chemical components of PM_2.5, highlighting the significant regional variations our model is designed to handle.

To effectively utilize this rich but varied dataset in the AirTrace-SA model, it was essential to address the issue of varying scales among different input variables. Before being fed into the model, the entire set of input features from all cities was normalized using the Z-score method. This procedure, implemented via the StandardScaler function from the Python 3.8.10 scikit-learn library, transforms each feature to have a mean of zero and a standard deviation of one. This standardization is crucial as it ensures that variables with larger numerical ranges do not disproportionately influence model training compared to variables with smaller ranges. This step guarantees that our model’s performance is evaluated on a fair and comparable basis across all variables and cities, focusing on the underlying patterns rather than the raw data magnitudes. The Z-score transformation is defined as:

x^{'} = \frac{x - μ}{σ}

(19)

where

x^{'}

is the normalized value,

x

is the original value,

μ

is the mean of the feature, and

σ

is its standard deviation.

As shown in Table 2, the pollution sources to which the chemical composition corresponds include urban dust, coal, sea salt, motor vehicle, metallurgical dust, secondary nitrate, secondary sulfuric, SOC, construction dust, and other. These 10 pollution source contributions constitute the target variables that the AirTrace-SA model predicts. For each PM_2.5 sample, the model outputs the percentage contribution (0–100%) of each pollution source, with all contributions constrained to sum to 100%.

In environmental science, the source apportionment of air pollution is inherently based on model estimations rather than direct measurements, as a direct quantitative measurement of the percentage contribution from each emission source to ambient PM_2.5 at receptor sites is not achievable with current analytical methods. Therefore, receptor models such as chemical mass balance (CMB) are widely accepted as the standard methodology for quantifying source contributions based on chemical composition analysis [53,54].

The source contributions in our dataset were calculated using the CMB model and CMB-Iteration method as described in [51]. Specifically, the CMB model employs the mass balance equation:

D_{(n \times l)} = G_{(n \times m)} * S_{(m \times l)}

(20)

where

D_{(n \times l)}

represents the concentration matrix of chemical species at the receptor (μg/m³),

G_{(n \times m)}

denotes the source profile matrix (μg/μg), and

S_{(m \times l)}

indicates the source contribution matrix (μg/m³), with

n

,

m

, and

l

representing the quantities of chemical species measured, pollution sources identified, and ambient samples collected, respectively.

Subsequently, the contribution from SOC (secondary organic carbon) is determined using the CMB-Iteration approach [55]. Since SOC lacks a direct source profile for the CMB model, the mass balance equation between receptors and sources can be expressed as:

D_{(n \times l)} - S O C_{(n \times l)} = G_{(n \times m)} * S_{(m \times l)}

(21)

where

D_{(n \times l)}

denotes the original receptor matrix,

S O C_{(n \times l)}

represents the

S O C

component in the receptor,

G_{(n \times m)}

is the source fingerprint matrix, and

S_{(m \times l)}

indicates the source contribution matrix. The expression

D - S O C

corresponds to the primary

O C

concentration, which can be represented by a modified receptor matrix

D_{(n \times l)}^{*}

and formulated as:

D_{(n \times l)}^{*} = G_{(n \times m)} * S_{(m \times l)}

(22)

When we denote

T_{o r g}

(μg/m³) as the total organic carbon and substitute

D_{(n \times l)}^{*}

, the estimated primary organic carbon

P O C^{*}

(μg/m³) is given by:

P O C^{*} = T_{o r g} - S O C

(23)

The

S O C

concentration remains unknown and must be determined through the CMB-Iteration procedure. This iterative method enables the separation of primary and secondary organic carbon contributions, which is essential for comprehensive source apportionment analysis.

These methods are based on the effective variance weighted least squares solution implemented by the EPA (U.S. Environmental Protection Agency) CMB 8.2 [56,57], which is the standard approach for source apportionment studies. Their performance was validated through established metrics (i.e.,

R^{2}

, χ², and % of PM mass apportioned), all meeting the EPA recommended targets [58].

While these contribution values are model-based estimates rather than direct measurements, they represent a widely accepted scientific approach for source apportionment. The use of CMB-derived source contributions as training data is appropriate for our study, as these values encapsulate the complex relationships between chemical compositions and source contributions that our AirTrace-SA model aims to learn.

The dataset used in this study includes five cities distributed across the southeastern coast to the northwestern interior, encompassing Lanzhou in Gansu Province, northwestern China, a typical inland industrial city characterized by significant winter heating demands and coal combustion patterns; Luoyang in Henan Province, central China, known for its heavy industrial base, particularly in metallurgical sectors; Haikou in Hainan Province, southern China, a coastal city with minimal industrial activity but notable marine aerosol influences; Urumqi in the Xinjiang Uygur Autonomous Region, northwestern China, situated near desert regions and subject to frequent dust storms; and Hangzhou in Zhejiang Province, eastern China, representing a rapidly developing metropolitan area with complex mixed pollution sources typical of modern urban environments.

To illustrate the distinct pollution characteristics across these cities, Figure 5 presents the categorical distribution of pollution sources. The ten individual sources were grouped into five major categories: natural sources (urban dust and sea salt), combustion sources (coal and motor vehicle), industrial sources (metallurgical dust and construction dust), secondary pollutants (secondary sulfuric, secondary nitrate, and SOC), and other unclassified sources. The distribution reveals clear city-specific patterns: Lanzhou shows balanced contributions across categories with relatively high secondary pollutants (28.0%); Luoyang exhibits the highest secondary pollutant proportion (44.9%) reflecting its industrial chemistry; Haikou demonstrates the highest natural source contribution (25.5%) due to marine influence and the highest “other” category (26.7%); Urumqi presents the highest combustion source proportion (31.5%) associated with an extreme continental climate; while Hangzhou shows the highest secondary pollutants after Luoyang (39.1%) but the lowest “other” category (5.2%), indicating well-characterized urban pollution. These diverse pollution profiles ensure that our model is tested against a comprehensive range of air quality scenarios, from marine-influenced to combustion-dominated and industrially complex environments.

In order to effectively evaluate the generalization ability of the model, reduce the risk of overfitting, and improve the performance robustness, we use K-fold cross validation on the dataset. The underlying principle is to randomly divide the dataset into

K

equal-sized subsets; each time, we use

K - 1

subset as the training set, and the remaining 1 subset as the independent test set, and we loop

K

times, to ensure that each set of data is used as the test set once [59]. The final model performance is calculated by averaging the results of

K

evaluations:

P e r f o r m a n c e_{a v g} = \frac{1}{K} \sum_{i = 1}^{K} P e r f o r m a n c e_{i}

(24)

where

P e r f o r m a n c e_{i}

is the performance index of the

i

fold. In this study,

K = 10

was chosen due to its ability provide sufficient training data (about 90% for training) and test data (about 10% for testing), while maintaining computational efficiency and ensuring the reliability of the results, especially in the case of a limited sample size [60].

4.2. $R^{2}$ Performance Evaluation

To evaluate the predictive ability of the AirTrace-SA model in air pollution source tracing, this study compares the R² results for each fold of the 10-fold cross-validation and the average R² results between this model and six other models.

The coefficient of determination, R², is a key indicator for assessing the goodness-of-fit of a regression model, which indicates the proportion of variance in the dependent variable explained by the model [48]. R² is calculated as:

R ² = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}{{\sum_{i = 1}^{n} (y_{i} - \bar{y})}^{2}}

(25)

where y_i denotes the true source contribution value,

\hat{y_{i}}

denotes the model predicted value,

\bar{y}

denotes the mean of the true value, and

n

is the sample size. The value of

R^{2}

usually ranges from 0 to 1, and the closer the value is to 1, the better the predictive ability of the model.

As shown in Figure 6, the

R^{2}

performance of the models in the 10-fold cross-validation shows significant differences, with AirTrace-SA’s

R^{2}

values ranging from 0.84 to 0.94, with an average value of 0.88, which demonstrates excellent prediction accuracy and stability. LightGBM achieves the second-highest performance with a mean

R^{2}

of 0.84 (0.79–0.88), though its stability is slightly lower than that of AirTrace-SA. XGBoost and Decision Tree both yield a mean

R^{2}

of 0.82, with wider ranges (0.75–0.86 and 0.74–0.87, respectively), suggesting greater inter-fold fluctuations. TabNet has an average

R^{2}

of 0.81 (range 0.77–0.87), which benefits from the attention mechanism; however, it is not as good as AirTrace-SA in terms of overall accuracy. Random Forest has an average

R^{2}

of 0.78 (range 0.69–0.84), which is affected by the noise of the data and is not stable enough. The average

R^{2}

of 1D CNN is only 0.76 (range 0.71–0.81), which is the weakest performance on non-sequential data.

The advantage of AirTrace-SA comes from its hybrid architecture that effectively captures nonlinear relationships in pollution source data. Specifically, in the subset with high data heterogeneity, its

R^{2}

still reaches 0.87, while the scores of Random Forest and XGBoost models are lower than 0.75, emphasizing its robustness. AirTrace-SA’s performance curves consistently maintain high positions with minimal fluctuation across all folds, confirming its dual leadership in both accuracy and stability. This consistent performance across different data partitions demonstrates the model’s effective adaptation to data diversity and complexity, establishing a solid foundation for subsequent analysis.

4.3. Evaluation of Prediction Error

To further evaluate the prediction accuracy and generalization ability of the AirTrace-SA model, we conducted a cross-sectional comparison of prediction errors for these seven models across five cities. Combined with the cross-city perspective, the prediction ability of the models is evaluated by using the mean absolute error (

M A E

) method for 10 pollution sources within each city; the lower the

M A E

value, the more accurate the model prediction.

The mean absolute error (

M A E

) measures the average absolute difference between the predicted and true values [49] and is calculated as:

M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - \hat{y_{i}}|

(26)

The

M A E

assigns the same weight to all errors. In pollution source analysis, the

M A E

can directly reflect the prediction bias and help to evaluate the prediction stability of the model among different pollution sources and different cities.

As shown in Figure 7, the line chart provides a clear comparison of the average

M A E

for seven models across five cities. When examining performance at the city level, the differences in model performance reveal unique environmental challenges across locations. In Lanzhou, AirTrace-SA leads with the lowest

M A E

of 0.61, significantly outperforming other models, while 1D CNN and TabNet have the highest

M A E

values of 1.03 and 1.01, respectively. In Luoyang, the differences in

M A E

are minimal, with values ranging from AirTrace-SA’s 0.62 to 1D CNN’s 0.80, suggesting more predictable pollution characteristics. Conversely, Haikou exhibits the greatest variation, with AirTrace-SA maintaining a low

M A E

of 0.51 while Random Forest reaches 2.89, indicating substantial performance gaps. In Urumqi, AirTrace-SA achieves an

M A E

of 0.46, whereas Random Forest reaches 2.32. In Hangzhou, AirTrace-SA leads with 0.78, while 1D CNN has the highest

M A E

of 1.80. These results show AirTrace-SA’s adaptability to diverse city-specific conditions, while other models struggle in certain locations.

Examining trends and variability provides further insight into model stability across the cities. Haikou and Urumqi display the greatest

M A E

variability, with Random Forest performing notably poorly at 2.89 in Haikou and 2.32 in Urumqi, potentially due to complex pollution dynamics in these locations. Luoyang, however, shows the least variability, with MAEs ranging from 0.62 to 0.80, suggesting stable prediction conditions that favor consistent performance. AirTrace-SA maintains low MAEs, typically at or below 0.50, across all cities, demonstrating remarkable robustness, while Random Forest and 1D CNN exhibit significant fluctuations. Moderate performers such as TabNet, XGBoost, Decision Tree, and LightGBM show less variability but fail to match AirTrace-SA’s consistency, reinforcing its adaptability to diverse environmental challenges.

Overall, AirTrace-SA emerges as the standout model with an average

M A E

of 0.60 across all five cities, showcasing superior accuracy and stability for air pollution source tracing. Random Forest lags behind with the highest average

M A E

of 1.64, indicating substantial difficulties in maintaining prediction quality. The moderate-performing models—TabNet at 1.24, XGBoost at 1.38, Decision Tree at 1.30, and LightGBM at 1.33—occupy an intermediate range, with TabNet leading this group, while 1D CNN’s average

M A E

of 1.55 places it closer to Random Forest. AirTrace-SA’s consistent outperformance across varied urban contexts underscores its reliability, making it a highly effective tool for addressing the complexities of pollution source analysis. In order to reveal more comprehensively the error characteristics of AirTrace-SA on each city and pollution source, we plotted the error distribution graphs, which show in detail the absolute error frequency distributions and

M A E

values of the 10 pollution sources in the five cities, and these graphs are helpful for analyzing the concentration of the errors and whether there is a long-tailed distribution [61], as shown in Appendix A.

To observe the overall performance of AirTrace-SA on

M A E

method more intuitively, we made a heat map [62] of the experimental results, which is illustrated in Figure 8. Its

M A E

is concentrated in the low value range for most of the source categories, with a predominantly lighter color distribution, indicating the overall low level of its prediction error, especially in the source categories of Haikou and Urumqi, where the heat map presents lighter color blocks, which suggests the higher accuracy of its prediction in these cities. In contrast, some categories in Hangzhou (e.g., other) show darker color blocks in the heat map, reflecting a relative increase in their errors. This observation aligns with the patterns identified in the tabular results. In general, AirTrace-SA possesses excellent error control capability, which stems primarily from the source association bridge’s multi-step decision mechanism. The SAB’s iterative refinement of feature representations through sparse attention allows the model to selectively focus on the most relevant chemical markers for each pollution source. This targeted feature selection is particularly effective when dealing with the heterogeneity of pollution sources across different cities, enabling AirTrace-SA to maintain stable prediction performance even in complex urban environments with varying pollution profiles.

4.4. RMSE Performance Comparison

In order to evaluate the error performance of the models in various aspects and to highlight the sensitivity of the root mean square error (

R M S E

) to larger errors, this study compiled the

R M S E

of seven models in five cities on 10 pollution sources, which were grouped into 10 pollution sources and listed the

R M S E

and average

R M S E

of each model in the five cities under different pollution sources.

The formula is:

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}

(27)

The

R M S E

assigns higher penalty weights to large errors compared to

M A E

because the errors are squared, making

R M S E

particularly suitable for assessing model sensitivity to outliers [50]. In the following analysis,

R M S E

can help to identify significant deviations of the model for some specific pollution sources, exploring inter-model differences as well as inter-city error features.

As shown in Figure 9, the line chart provides a comprehensive comparison of the average

R M S E

for seven models across ten pollution sources. AirTrace-SA excels in specific categories, achieving an

R M S E

of 0.57 for construction dust and 0.60 for sea salt, indicating its precision in handling these sources effectively within the dataset. In contrast, Random Forest struggles notably with secondary sulfuric at 3.04 and other at 3.74, suggesting vulnerability to sources with higher variability or mixed contributions. 1D CNN performs relatively well with motor vehicles at 1.64 but falters with secondary nitrate at 2.81, while TabNet shows a balanced approach with a low 1.49 for sea salt but a higher value of 2.00 for secondary sulfuric. These source-specific insights highlight how AirTrace-SA’s design may better address the predictability of certain pollution sources compared to the inconsistent performance of other models across individual categories.

The chart further illustrates the models’ sensitivity to outliers and variability across the pollution sources, a critical aspect given

R M S E

’s emphasis on penalizing large errors. Secondary sulfuric and other sources exhibit the highest

R M S E

peaks, with Random Forest reaching 3.04 and 3.74, respectively, indicating its susceptibility to significant deviations in these complex categories. AirTrace-SA maintains a robust response with values of 1.38 and 1.62, respectively, showcasing its ability to mitigate outlier impacts. Urban dust and construction dust also show moderate variability, where AirTrace-SA’s 1.54 and 0.57 outperform Random Forest’s 3.01 and 0.85, respectively. This pattern suggests that models such as TabNet with 1.76 and LightGBM with 1.81 offer moderate resilience, but their fluctuations (e.g., TabNet’s 2.00 for Secondary sulfuric) indicate less consistency than AirTrace-SA when facing outlier-heavy sources.

Finally, the analysis based on the above figure underscores AirTrace-SA’s superiority as a tool for air pollution source tracing, with an average

R M S E

of 1.06 across ten pollution sources reflecting its ability to handle a wide range of pollution types effectively. The pronounced errors of Random Forest, with an average

R M S E

of 2.21, and 1D CNN, with an average

R M S E

of 2.03, in challenging sources such as secondary sulfuric and other suggest these models may be less suitable for applications requiring high precision under variable conditions. Moderate performers—TabNet (1.76), XGBoost (1.91), Decision Tree (1.81), and LightGBM (1.81)—provide viable alternatives but lack the consistent low-error profile of AirTrace-SA. This comprehensive evaluation emphasizes AirTrace-SA’s robustness and adaptability, making it well-suited for environments where accurate prediction across diverse pollution sources is crucial. Meanwhile, we verify the model’s error performance using the

R M S E

heat map of AirTrace-SA below, which intuitively summarizes its prediction ability in different cities and pollution sources. As shown in Figure 10, the

R M S E

distribution of AirTrace-SA for five cities and 10 pollution sources is shown in color shades, with the color from light yellow to dark red indicating the error from low to high, ranging from 0.28 to 2.83. The

R M S E

performance of AirTrace-SA for most pollution sources and cities is mainly light in color. Among the city perspectives, Haikou and Urumqi perform particularly well, demonstrating their ability to control prediction errors. From the perspective of pollution sources, the model performs well and is stable for pollution sources such as sea salt, metallurgical dust, and construction dust. The color is the lightest, and the error values are generally lower than 0.60. Relatively high

R M S E

values are observed for categories such as secondary sulfuric and secondary nitrate, particularly in cities such as Luoyang, Lanzhou, and Hangzhou, as indicated by the darker color patterns. This suggests that the model’s performance can be further improved in handling pollutants with complex formation processes and sensitivity to diverse environmental factors. Overall, AirTrace-SA demonstrates superior performance compared to the other models, likely due to its hybrid architecture, which enables it to effectively capture essential data patterns.

4.5. Prediction and Truth Scatter Plot

In order to visually evaluate the prediction performance of the AirTrace-SA model in air pollution source analysis, as shown in Figure 11, a scatter plot is plotted in this study to analyze the prediction accuracy of the model for the 10 pollution sources. The plot demonstrates the correspondence between the predicted contribution (y-axis in %) and the true contribution (x-axis in %); ideally, the data points should be tightly distributed around the identity function [63]. The

R^{2}

value above each subplot reflects the model’s ability to account for data variability, with

R^{2}

values closer to 1 indicating higher prediction accuracy. The following analysis focuses on the overall performance of the model on different pollution sources and potential prediction challenges.

The

R^{2}

values of AirTrace-SA on the 10 pollution sources range from 0.796 to 0.949, indicating that the model’s ability to make predictions about different pollution sources varied. Overall, the model performs best in predicting motor vehicles (

R^{2}

= 0.949), other sources (

R^{2}

= 0.945), and secondary nitrate (

R^{2}

= 0.938), and the data points are closely distributed around the ideal line, which shows high consistency. The relatively stable chemical composition and contribution patterns of these sources, such as motor vehicle emissions, which are usually highly correlated with components such as

N O_{x}

and EC, may help the model to capture their features more accurately. In contrast, SOC has the lowest

R^{2}

value (0.796), and the distribution of data points is more dispersed, with significant overestimation and underestimation. This may reflect the complexity of the SOC formation process, which involves a variety of atmospheric chemical reactions and environmental factors and increases the difficulty of prediction.

From the overall pattern of the scatter plot, the prediction accuracy of the model shows some contribution dependence. In the low contribution range (0–10%), the distribution of data points is generally more dispersed, and there are more deviations from the ideal line; for example, urban dust (

R^{2}

= 0.828) and sea salt (

R^{2}

= 0.855) show obvious overestimations or underestimations in the 0–5% range. As the contribution increases (10–20% and above), the data points tend to move towards the ideal line with a more concentrated distribution, and this trend is particularly noticeable for coal combustion (

R^{2}

= 0.918) and secondary nitrate (

R^{2}

= 0.938). This appears to suggest that high-contributing sources usually have more distinctive feature patterns for model identification, while the low-contributing regions may be affected by data noise or feature overlap.

In addition, the predicted performance of the secondarily produced pollutants shows some challenges. Although the overall performance of secondary sulfuric (

R^{2}

= 0.874) and secondary nitrate (

R^{2}

= 0.938) is good, the former has more scattering points in the 10–20% range, which shows both overestimation and underestimation. This may be related to the fact that the secondary production process is affected by meteorological conditions (e.g., temperature and humidity) as well as precursor concentration, and the difficulty in predicting SOC is particularly prominent, as the scatter plot shows large dispersion over the whole contribution range. It also tends to underestimate, especially in the high contribution range (8–12%), indicating the limitations of the model in capturing the relevant features of organic compounds.

It should be noted that the distribution of true source contributions shows clustering at certain values (particularly 0% for absent sources), reflecting real-world conditions where many sources have zero or minimal contributions in specific samples. These “true values” are derived from CMB model calculations rather than direct measurements, as direct quantitative measurements of individual source contributions at receptor sites are not currently possible with the available monitoring techniques [53]. The apparent discretization results from CMB mass balance constraints and source-specific contribution patterns (e.g., consistently zero sea salt in inland cities). The vertical spread of predictions at low contribution levels represents the inherent uncertainty in distinguishing between absent and minimal sources. Despite this boundary condition challenge, the model maintains strong predictive performance for substantial contributions (>10%), which are most relevant for pollution control decisions.

It is particularly noteworthy that the other category achieves an exceptionally high

R^{2}

value of 0.945 despite containing a variety of unclassified pollution sources and showing higher absolute errors across cities (

M A E

ranging from 0.42 to 1.69,

R M S E

from 0.61 to 2.83). This seemingly paradoxical result of high explanatory power alongside larger prediction errors occurs because the other category spans a wide contribution range (0–27%), creating a large natural variance in the data. When the total variation in the dependent variable is substantial, the model can maintain a high

R^{2}

value even in the presence of larger absolute errors, as it effectively captures relative patterns and trends within this heterogeneous category. This indicates that AirTrace-SA explains a large portion of the variance, despite discrepancies in absolute predictions. Its multi-step decision mechanism enhances its ability to identify latent structures across diverse pollution sources, facilitating the recognition of broad relationships between chemical features and their contributions. Nonetheless, the elevated

M A E

and

R M S E

values underscore the persistent difficulty in producing accurate quantitative estimates, particularly in this complex and unclassified category. Overestimation in the high contribution range (20–27%) further reflects the challenges of modeling heterogeneous pollution sources.

A joint analysis of the scatter plots for the ten pollution sources indicates that AirTrace-SA exhibits strong predictive capability in air pollution source attribution. The plots highlight the model’s effectiveness in handling high-contribution and stable-source categories, while also revealing its limitations in low-contribution ranges and complex secondary pollutants. This trend may suggest that high-concentration pollutants possess more distinguishable features, making them easier for the model to identify. Overall, the scatter plot analysis underscores both the strengths and the constraints of AirTrace-SA, affirming its potential in source apportionment while pointing to areas requiring further refinement.

4.6. Feature Importance Analysis

We perform feature importance analysis on the features using TabNet Regressor in the AirTrace-SA model to explore the intrinsic mechanism of its excellent performance in the air pollution source tracing task in five cities.

By accumulating the attention weights in each step

α_{i}

, the model is able to generate global importance scores for each feature. These scores can explain the model’s decision-making process and help us understand which features play key roles in the regression task [64]. Its formula is:

I m p o r t a n c e (x_{i}) = \sum_{t = 1}^{N} α_{i}^{(t)}

(28)

These importance scores indicate the relative contribution of each feature across all decision-making steps, providing additional interpretability to this study. Feature importance analysis not only reveals the basis for modeling decisions but also provides a scientific explanation for air pollution causes and helps us understand which chemical components play a decisive role in identifying different pollution sources.

As shown in Figure 12,

{S O}_{4}^{2 -}

dominates the model with the highest importance value of 0.117, far exceeding the other indicators;

N O_{3}^{-}

ranks second with an importance value of 0.083; and

C l^{-}

ranks third with an importance value of 0.073. These three indicators together constitute the dominant factors in the model decision. Na (0.064), Mg (0.059), Al (0.054), Ca (0.053), and K (0.051) form the second group, with importance values ranging from 0.050 to 0.065, which have a significant impact on the model prediction. The remaining elements between Si (0.050) and EC (0.049) form the base tier of feature importance with closer importance values, indicating that the model gave similar, although still not negligible, attention to these features.

The importance of

{S O}_{4}^{2 -}

and

N O_{3}^{-}

as the main secondary inorganic aerosol (SIA) components is much higher than that of other indicators, which shows the central position of secondary pollution processes in the analysis of air pollution sources.

{S O}_{4}^{2 -}

mainly comes from the oxidation process of

S O_{2}

, which is an important marker of coal combustion and industrial emissions, while

N O_{3}^{-}

mainly comes from the oxidation of

N O_{x}

, which is closely related to motor vehicle emissions and combustion processes. The high importance of these two indicators is highly consistent with the excellent performance of AirTrace-SA in the secondary sulfuric (

R^{2}

= 0.874) and secondary nitrate (

R^{2}

= 0.938) predictions. From a physical consistency perspective, these importance rankings align perfectly with atmospheric chemistry principles. The

{S O}_{4}^{2 -}

dominance (0.117) reflects the oxidation pathway

S O_{2}

+

O H^{-}

→

H S O_{3}^{-}

→

{S O}_{4}^{2 -}

, a fundamental process in urban atmospheres. Similarly, the high importance of

N O_{3}^{-}

(0.083) corresponds to the

N O_{x}

-to-nitrate conversion through both gas-phase (

N O_{2}

+

O H^{-}

) and heterogeneous reactions [65].

C l^{-}

and Na, as typical markers of sea salt, rank third and fourth, respectively, indicating that the model correctly identifies the important impact of marine sources on air quality [66]. The coupled importance of these two elements (

C l^{-}

: 0.073, Na: 0.064) reflects their co-occurrence in marine aerosols, with their similar importance values demonstrating the model’s ability to recognize physically related species. This chemical association, learned without explicit constraints, validates the physical consistency of our approach. The high significance of these two markers explains the good performance of the model in sea salt (

R^{2}

= 0.855) predictions, especially in data from coastal cities such as Haikou.

Moderate importance is assigned to elements with high crustal abundance such as Mg, Al, Ca and K, which are commonly associated with urban dust, construction dust and wind-sand dust sources. Their moderate importance reflects the balanced performance of the model in urban dust (

R^{2}

= 0.828) and construction dust (

R^{2}

= 0.861) predictions.

Although of relatively low individual importance, heavy metal elements such as Fe, Mn, Ti, Cu, Zn and Pb, as well as OC and EC, together provide the necessary information for the identification of sources such as industrial emissions, motor vehicle emissions and biomass combustion. This ability to integrate collective information explains the model’s excellent performance in the prediction of complex source categories such as motor vehicle (

R^{2}

= 0.949). Meanwhile, the relatively low significance of OC and EC partly explains the relative weakness of the model in SOC prediction. It can be inferred that the current feature ensemble lacks sufficient organics-related data, which likely restricts the model’s ability to accurately represent this complex secondary process.

The clear hierarchical structure of feature importance distribution demonstrates AirTrace-SA’s effectiveness in distinguishing the diagnostic value of different features. This ability stems from the combined effect of the HFE extracting multi-scale patterns and the sparse attention mechanism in SAB prioritizing the most relevant chemical components for each pollution source. Additionally, the multi-step decision process integrates information across different processing stages, further enhancing the model’s ability to identify key chemical markers. These insights deepen our understanding of AirTrace-SA’s internal mechanisms and provide a scientific foundation for future environmental monitoring and model optimization.

4.7. Ablation Study and Component Analysis

To validate the necessity of each component in AirTrace-SA and understand their individual contributions to source apportionment performance, we conducted a systematic ablation study. This analysis provides insights into how different architectural choices impact the model’s ability to capture complex pollution formation mechanisms.

As shown in Table 3, we evaluated four model variants through 10-fold cross-validation: (1) full AirTrace-SA with all components intact, (2) AirTrace-SA without the hierarchical feature extractor (w/o HFE), where raw chemical concentration data (17 species per sample) directly enter subsequent modules, (3) AirTrace-SA without the source association bridge (w/o SAB), bypassing the multi-step attention mechanism, and (4) Simple SCQ, replacing the TabNet regressor with linear regression to assess the importance of non-linear modeling.

Removing the HFE module results in the

R^{2}

dropping from 0.887 to 0.826 and the

M A E

increasing from 0.591% to 0.862%. This performance degradation demonstrates that learning hierarchical representations of chemical components is crucial for accurate source apportionment. The HFE captures multi-scale chemical relationships that reflect atmospheric transformation processes—for instance, the correlation between primary emissions (

S O_{2}

,

N O_{x}

) and their secondary products (

{S O}_{4}^{2 -}

,

N O_{3}^{-}

). Without this component, the model cannot effectively learn these transformation pathways, leading to reduced predictive accuracy, particularly for secondary pollutants.

The absence of SAB causes similar performance deterioration, with

R^{2}

dropping to 0.828 and the

R M S E

increasing from 1.167% to 1.470%. Interestingly, the

M A E

for this variant (0.789%) is slightly lower than w/o HFE, suggesting that, while SAB is critical for capturing complex source patterns, it may introduce some prediction variance. The SAB’s multi-step attention mechanism enables the progressive refinement of source-receptor relationships, learning that Na-

C l^{-}

combinations indicate marine sources while Al-Si-Ca-Mg clusters represent crustal emissions. This iterative process mirrors how atmospheric scientists identify pollution sources through the systematic analysis of chemical markers.

The most dramatic performance collapse occurs with Simple SCQ, where

R^{2}

drops from 0.887 to 0.504 and the

M A E

increases from 0.591% to 1.776%. The

R M S E

of 2.592 ± 0.174% indicates severe prediction errors across all pollution sources. This stark contrast highlights that linear models cannot capture the non-linear interactions between chemical species and their sources. Complex air pollution processes—such as photochemical reactions, meteorological influences, and source mixing—require sophisticated modeling approaches. TabNet’s ability to perform instance-wise feature selection and multi-step decisions proves essential for maintaining both accuracy and physical constraints.

The ablation results reveal that AirTrace-SA’s architecture directly corresponds to physical processes in pollution source apportionment. The progression from raw chemical concentrations (input) through hierarchical feature learning (HFE), source pattern recognition (SAB), to quantitative apportionment (SCQ) mirrors the actual workflow of receptor modeling. Each component addresses specific challenges: HFE handles chemical transformations, SAB manages source–receptor complexity, and SCQ ensures physically plausible contributions. The relatively small performance gap between w/o HFE (

R^{2}

= 0.826) and w/o SAB (

R^{2}

= 0.828) compared to the large drop for Simple SCQ (

R^{2}

= 0.504) emphasizes that, while feature extraction and association are important, the non-linear quantification mechanism is absolutely critical for accurate source apportionment.

These findings demonstrate that AirTrace-SA achieves superior performance not through unnecessary complexity but through principled design, where each component serves an essential, physically grounded purpose in the source apportionment process.

5. Discussion

5.1. Model Limitations and Future Directions

Despite AirTrace-SA’s strong performance, several technical limitations merit discussion. First, our model provides point estimates without quantifying prediction uncertainty. While our model achieves high average accuracy (

R^{2}

= 0.88), it does not provide confidence intervals or uncertainty bounds for individual predictions. This limitation is particularly relevant for regulatory decision-making and risk assessment where confidence levels are essential for interpreting results. Second, AirTrace-SA requires the complete measurements of all 17 chemical species without robust missing data handling capabilities. This requirement may limit practical deployment at monitoring stations with partial analytical equipment or temporary instrument failures. Third, prediction accuracy decreases for low-contribution sources (<10%), as evidenced by increased scatter in Figure 11. The model struggles to distinguish between true absence and minimal presence, particularly affecting sporadic sources such as sea salt in inland cities. Fourth, the heterogeneous “other” category presents significant challenges, with the

M A E

varying from 0.42 to 1.69 across cities. This category aggregates unclassified sources, measurement uncertainties, and emerging pollutants, making accurate predictions inherently difficult.

Future improvements should focus on implementing uncertainty quantification through Bayesian deep learning approaches or ensemble methods to provide confidence intervals for predictions, developing robust imputation methods for missing chemical data using techniques such as matrix factorization or deep generative models, enhancing low-contribution source detection through cost-sensitive learning or specialized architectures for imbalanced data, and exploring semi-supervised learning approaches to better characterize the “Other” category by leveraging unlabeled data. These enhancements would expand AirTrace-SA’s practical applicability while maintaining its computational efficiency and providing users with richer information for decision making.

5.2. Global Applicability and Data Requirements

AirTrace-SA achieves strong performance (

R^{2}

= 0.88) using only single-station daily measurements of 17 chemical species, demonstrating that comprehensive source apportionment is feasible with relatively modest data inputs. This accessibility is particularly valuable for regions with limited monitoring infrastructure. Although our results confirm the sufficiency of this approach for routine applications, incorporating additional data dimensions could further enhance model capabilities. Multi-site networks would enable spatial source tracking and pollution transport analysis, potentially improving predictions for regional sources. Similarly, hourly resolution data could capture diurnal patterns, particularly benefiting traffic and industrial source discrimination.

Regarding global applicability, our evaluation across five environmentally diverse Chinese cities (

M A E

: 0.46–0.78) suggests promising adaptability. The consistent performance from coastal Haikou to arid Urumqi indicates robustness across varying pollution regimes. However, international application requires careful consideration. Regions with emission sources not represented in our training data, such as extensive biomass burning in tropical areas or specific industrial processes in other countries, would necessitate model validation and potential retraining. Establishing the model’s applicability boundaries through international datasets remains an important future direction.

5.3. Temporal Scale Considerations

AirTrace-SA uses only chemical composition for source apportionment, excluding temporal features from the December 2013–November 2014 dataset. This design enables broad applicability to individual PM_2.5 samples without requiring continuous time-series data.

Pollution sources exhibit temporal variations across multiple scales: traffic peaks during rush hours, heating emissions increase in winter, and photochemical reactions intensify in summer. Our model captures these patterns implicitly through chemical signatures. Winter samples with elevated sulfate levels naturally yield higher coal combustion predictions, while high nitrate and SOC indicate secondary formation regardless of the collection date. This chemical-based approach identifies sources through their compositional fingerprints rather than temporal labels.

Future versions could benefit from incorporating temporal information, particularly for sources showing strong diurnal or seasonal patterns. This enhancement would complement the chemical-based approach, potentially improving prediction accuracy while maintaining the model’s core strength in pattern recognition from compositional data.

6. Conclusions

In this study, we proposed AirTrace-SA, a novel hybrid deep learning framework for air pollution source tracing that integrates three synergistic components: a hierarchical feature extractor (HFE) for multi-scale pattern recognition, a source association bridge (SAB) for establishing chemical-to-source associations through sparse attention mechanisms, and a source contribution quantifier (SCQ) based on TabNet regression for precise contribution estimation. Comprehensive experiments on datasets from five Chinese cities demonstrate AirTrace-SA’s superior performance, achieving an average

R^{2}

of 0.88 (ranging from 0.84 to 0.94 across 10-fold cross-validation), an average mean absolute error (

M A E

) of 0.60 (ranging from 0.46 to 0.78 across five cities), and an average root-mean-square error (

R M S E

) of 1.06 (ranging from 0.51 to 1.62 across ten pollution sources). Ablation studies validate the necessity of each component, with

R^{2}

dropping to 0.826 without HFE, 0.828 without SAB, and dramatically to 0.504 when replacing TabNet with linear regression. The model particularly excels in predicting motor vehicle (

R^{2}

= 0.949), secondary nitrate (

R^{2}

= 0.938), and other (

R^{2}

= 0.945) pollution sources. The feature importance analysis reveals

{S O}_{4}^{2 -}

,

N O_{3}^{-}

, and

C l^{-}

to be the most influential components, which aligns with established knowledge of pollution source characteristics.

AirTrace-SA provides a practical tool for automated source apportionment, offering a useful balance between accuracy, interpretability, and computational efficiency. While the current implementation demonstrates strong performance using only daily chemical composition data from single monitoring stations, future enhancements incorporating temporal features, meteorological variables, and international datasets could further expand its capabilities. Overall, this study contributes an effective method for understanding pollution sources and supporting data-driven air quality management strategies.

Author Contributions

Conceptualization, W.Z.; methodology, W.Z.; software, W.Z.; validation, T.S.; formal analysis, T.S., Q.Z. and X.D.; investigation, W.Z. and T.S.; resources, W.Z.; data curation, W.Z., T.S. and X.D.; writing—original draft preparation, W.Z.; writing—review and editing, Q.Z. and X.D.; visualization, W.Z.; supervision, Q.Z. and X.D.; project administration, Q.Z. and X.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Xiamen Research Project for the Natural Science Foundation of Xiamen, China (3502Z202472028), and by the Xiamen Science and Technology Plan Project (3502Z20231042).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used and analyzed during the current study are not publicly available due to their proprietary nature. However, the data are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Figure A1, Figure A2, Figure A3, Figure A4 and Figure A5 show the distribution of city errors for Lanzhou, Luoyang, Haikou, Urumqi, and Hangzhou, respectively. The distribution of the absolute errors of the predictions of the 10 sources in each city sample is shown in the figures. The X-axis is absolute error (unit: %), the Y-axis is frequency, and the red dotted line represents the mean absolute error (

M A E

).

Figure A1. Error distribution for Lanzhou. Lanzhou shows highly peaked error distributions near zero for most pollution sources, with all

M A E

values below 0.9. This central city demonstrates balanced prediction accuracy across different sources, with slightly higher errors in secondary nitrate (0.842), likely due to regional atmospheric chemistry characteristics. The concentrated distributions indicate high prediction precision, and the overall prediction stability reflects AirTrace-SA’s adaptability to diverse urban environments and pollution profiles in northwestern regions.

Figure A1. Error distribution for Lanzhou. Lanzhou shows highly peaked error distributions near zero for most pollution sources, with all

M A E

values below 0.9. This central city demonstrates balanced prediction accuracy across different sources, with slightly higher errors in secondary nitrate (0.842), likely due to regional atmospheric chemistry characteristics. The concentrated distributions indicate high prediction precision, and the overall prediction stability reflects AirTrace-SA’s adaptability to diverse urban environments and pollution profiles in northwestern regions.

Figure A2. Error distribution for Luoyang. Luoyang exhibits extremely concentrated error distributions with sharp peaks, indicating exceptional prediction precision. Construction dust (0.309) and metallurgical dust (0.297) predictions are particularly accurate, while secondary sulfuric (1.044) shows a relatively high error. The pronounced peak patterns suggest AirTrace-SA performs optimally in areas with well-defined pollution characteristics. The steeper distribution peaks compared to other cities may relate to data quality and distinct pollution source patterns, leveraging the model’s hierarchical feature extraction capabilities.

Figure A3. Error distribution for Haikou. Haikou presents consistently low

M A E

values, especially for SOC (0.262) and urban dust (0.544). As a coastal city, its pollution source composition differs significantly from inland areas, yet AirTrace-SA maintains high prediction accuracy, demonstrating the Source Association Bridge’s effectiveness in identifying region-specific pollution patterns. The error distributions concentrate near zero, indicating high model precision in this coastal environment where pollution source compositions are potentially less complex, reducing prediction interference.

Figure A3. Error distribution for Haikou. Haikou presents consistently low

M A E

values, especially for SOC (0.262) and urban dust (0.544). As a coastal city, its pollution source composition differs significantly from inland areas, yet AirTrace-SA maintains high prediction accuracy, demonstrating the Source Association Bridge’s effectiveness in identifying region-specific pollution patterns. The error distributions concentrate near zero, indicating high model precision in this coastal environment where pollution source compositions are potentially less complex, reducing prediction interference.

Figure A4. Error distribution for Urumqi. Urumqi achieves the lowest overall average

M A E

(0.46) among all five cities, with exceptional performance in metallurgical dust prediction (0.174). Despite being an inland city, its sea salt prediction accuracy remains impressive (0.342), showcasing AirTrace-SA’s strong generalization capability across geographically distinct regions. The error distributions are tightly concentrated near zero, demonstrating the model’s effectiveness in accurately identifying pollution sources in this inland city, possibly due to the HFE module’s efficient regional feature extraction.

Figure A4. Error distribution for Urumqi. Urumqi achieves the lowest overall average

M A E

(0.46) among all five cities, with exceptional performance in metallurgical dust prediction (0.174). Despite being an inland city, its sea salt prediction accuracy remains impressive (0.342), showcasing AirTrace-SA’s strong generalization capability across geographically distinct regions. The error distributions are tightly concentrated near zero, demonstrating the model’s effectiveness in accurately identifying pollution sources in this inland city, possibly due to the HFE module’s efficient regional feature extraction.

Figure A5. Error distribution for Hangzhou. Hangzhou shows excellent performance in construction dust (0.388) and metallurgical dust (0.417); however, it exhibits higher errors in the “other” category (1.691) with notable long-tail distribution characteristics indicating the presence of outlier samples. As a developed urban center with complex pollution mixtures, Hangzhou presents greater prediction challenges, particularly for unclassified sources. The higher SOC error (0.702) compared to other cities likely relates to its intensive urban and industrial activities, creating more complex pollutant mixing patterns that challenge the model’s predictive capabilities.

References

Landrigan, P.J. Air pollution and health. Lancet Public Health 2017, 2, e4–e5. [Google Scholar] [CrossRef]
Polichetti, G.; Cocco, S.; Spinali, A.; Trimarco, V.; Nunziata, A. Effects of particulate matter (PM₁₀, PM_2.5 and PM₁) on the cardiovascular system. Toxicology 2009, 261, 1–8. [Google Scholar] [CrossRef]
Bernstein, J.A.; Alexis, N.; Barnes, C.; Bernstein, I.L.; Nel, A.; Peden, D.; Diaz-Sanchez, D.; Tarlo, S.M.; Williams, P.B. Health effects of air pollution. J. Allergy Clin. Immunol. 2004, 114, 1116–1123. [Google Scholar] [CrossRef]
Bell, J.N.B.; Power, S.A.; Jarraud, N.; Agrawal, M.; Davies, C. The effects of air pollution on urban ecosystems and agriculture. Int. J. Sustain. Dev. World Ecol. 2001, 18, 226–235. [Google Scholar] [CrossRef]
Li, T.; Li, Y.; An, D.; Han, Y.; Xu, S.; Lu, Z.; Crittenden, J. Mining of the association rules between industrialization level and air quality to inform high-quality development in China. J. Environ. Manag. 2019, 246, 564–574. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.L.; Cao, F. Fine particulate matter (PM_2.5) in China at a city level. Sci. Rep. 2015, 5, 14884. [Google Scholar] [CrossRef] [PubMed]
Niu, Y.; Chen, R.; Kan, H. Air pollution, disease burden, and health economic loss in China. Ambient Air Pollut. Health Impact China 2017, 233–242. [Google Scholar]
Amnuaylojaroen, T.; Parasin, N. Pathogenesis of PM_2.5-related disorders in different age groups: Children, adults, and the elderly. Epigenomes 2024, 8, 13. [Google Scholar] [CrossRef] [PubMed]
Hung, C.C.; Hsiao, H.E.; Lin, C.C.; Hsu, H.H. Air Pollution Source Tracing Framework: Leveraging Microsensors and Wind Analysis for Pollution Source Identification. In Proceedings of the International Conference on Technologies and Applications of Artificial Intelligence, Yunlin, Taiwan, 1–2 December 2023; Springer: Singapore, 2024; pp. 142–154. [Google Scholar]
Lelieveld, J.; Evans, J.S.; Fnais, M.; Giannadaki, D.; Pozzer, A. The contribution of outdoor air pollution sources to premature mortality on a global scale. Nature 2015, 525, 367–371. [Google Scholar] [CrossRef]
Cheremisinoff, N.P. Handbook of Air Pollution Prevention and Control; Elsevier: Amsterdam, The Netherlands, 2012. [Google Scholar]
Steyn, D.G. Air pollution in coastal cities. In Air Pollution Modeling and Its Application XI; Springer: Boston, MA, USA, 1996; pp. 505–518. [Google Scholar]
Mayer, H. Air pollution in cities. Atmos. Environ. 1999, 33, 4029–4037. [Google Scholar] [CrossRef]
Gao, H.; Chen, J.; Wang, B.; Tan, S.C.; Lee, C.M.; Yao, X.; Shi, J. A study of air pollution of city clusters. Atmos. Environ. 2011, 45, 3069–3077. [Google Scholar] [CrossRef]
Blanchard, C.L. Methods for attributing ambient air pollutants to emission sources. Annu. Rev. Energy Environ. 1999, 24, 329–365. [Google Scholar] [CrossRef]
Yadav, S.; Yadav, A.; Singh, A.; Goyal, G.; Sagwan, A.; Chhikara, S.K. Application of AI-based tools in air pollution study. In Artificial Intelligence for Air Quality Monitoring and Prediction; CRC Press: Boca Raton, FL, USA, 2015; pp. 112–136. [Google Scholar]
Byun, D.; Schere, K.L. Review of the governing equations, computational algorithms, and other components of the Models-3 Community Multiscale Air Quality (CMAQ) modeling system. Appl. Mech. Rev. 2006, 59, 51–77. [Google Scholar] [CrossRef]
Zhao, Y.; Yuan, M.; Huang, X.; Chen, F.; Zhang, J. Quantification and evaluation of atmospheric ammonia emissions with different methods: A case study for the Yangtze River Delta region, China. Atmos. Chem. Phys. 2020, 20, 4275–4294. [Google Scholar] [CrossRef]
Koo, T.W.; Hong, M.S.; Moon, S.H.; Kim, H.J. Pollutant Sources Contribution Analysis of PM_2.5 using The CMB Receptor Model. J. Korean Appl. Sci. Technol. 2019, 36, 866–875. [Google Scholar]
Zhang, G.; Ding, C.; Jiang, X.; Pan, G.; Wei, X.; Sun, Y. Chemical compositions and sources contribution of atmospheric particles at a typical steel industrial urban site. Sci. Rep. 2020, 10, 7654. [Google Scholar] [CrossRef]
Liu, X.; Lu, D.; Zhang, A.; Liu, Q.; Jiang, G. Data-driven machine learning in environmental pollution: Gains and problems. Environ. Sci. Technol. 2022, 56, 2124–2133. [Google Scholar] [CrossRef]
Ying, L.U. Decision tree methods: Applications for classification and prediction. Shanghai Arch. Psychiatry 2015, 27, 130. [Google Scholar]
Choi, Y.; Kang, B.; Kim, D. Utilizing Machine Learning-based Classification Models for Tracking Air Pollution Sources: A Case Study in Korea. Aerosol Air Qual. Res. 2024, 24, 230222. [Google Scholar] [CrossRef]
Rigatti, S.J. Random forest. J. Insur. Med. 2017, 47, 31–39. [Google Scholar] [CrossRef]
Du, X.; Zeng, F.; Shi, G.; Feng, Y. Smart pollution source tracing via gradient tree boosting regression. In Proceedings of the 2019 International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), Taiyuan, China, 8–10 November 2019; pp. 341–344. [Google Scholar]
Jakkula, V. Tutorial on Support Vector Machine (SVM); School of EECS, Washington State University: Pullman, WA, USA, 2006; Volume 37, p. 3. [Google Scholar]
Kaya, K.; Gündüz Öğüdücü, Ş. Deep flexible sequential (DFS) model for air pollution forecasting. Sci. Rep. 2020, 10, 3346. [Google Scholar] [CrossRef] [PubMed]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Li, J.; An, X.; Li, Q.; Wang, C.; Yu, H.; Zhou, X.; Geng, Y. Application of XGBoost algorithm in the optimization of pollutant concentration. Atmos. Res. 2022, 276, 106238. [Google Scholar] [CrossRef]
Ayturan, Y.A.; Ayturan, Z.C.; Altun, H.O. Air pollution modelling with deep learning: A review. Int. J. Environ. Pollut. Environ. Model. 2018, 1, 58–62. [Google Scholar]
Kiranyaz, S.; Avci, O.; Abdeljaber, O.; Ince, T.; Gabbouj, M.; Inman, D.J. 1D convolutional neural networks and applications: A survey. Mech. Syst. Signal Process. 2021, 151, 107398. [Google Scholar] [CrossRef]
Ragab, M.G.; Abdulkadir, S.J.; Aziz, N.; Al-Tashi, Q.; Alyousifi, Y.; Alhussian, H.; Alqushaibi, A. A novel one-dimensional CNN with exponential adaptive gradients for air pollution index prediction. Sustainability 2020, 12, 10090. [Google Scholar] [CrossRef]
Hao, Y.; Bi, C.; Yang, L.; Qiu, X.; Li, Y.; Yu, C. Tracing-U-Net: An Attention Based U-Net model for Air Pollution Source Tracing from Sparse Dataset. In Proceedings of the 2024 International Conference on Information Technology, Data Science, and Optimization, Xiamen, China, 18–20 October 2024; pp. 83–90. [Google Scholar]
Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2222–2232. [Google Scholar] [CrossRef]
Liu, D.R.; Lee, S.J.; Huang, Y.; Chiu, C.J. Air pollution forecasting based on attention-based LSTM neural network and ensemble learning. Expert Syst. 2020, 37, e12511. [Google Scholar] [CrossRef]
Arik, S.Ö.; Pfister, T. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 6679–6687. [Google Scholar]
Gavrishchaka, V.; Yang, Z.; Miao, R.; Senyukova, O. Advantages of hybrid deep learning frameworks in applications with limited data. Int. J. Mach. Learn. Comput. 2018, 8, 549–558. [Google Scholar]
Lee, Y. Source Apportionment and Spatiotemporal Analysis of PM_2.5 Using Machine Learning and Receptor Models. Doctoral Dissertation, Seoul National University, Seoul, Republic of Korea, 2023. [Google Scholar]
Ma, X.; Liu, H.; Peng, Z. Improving WRF-Chem PM_2.5 predictions by combining data assimilation and deep-learning-based bias correction. Environ. Int. 2025, 195, 109199. [Google Scholar] [CrossRef]
Lee, Y.; Park, J.; Kim, J.; Woo, J.H.; Lee, J.H. Rapid PM_2.5-Induced Health Impact Assessment: A Novel Approach Using Conditional U-Net CMAQ Surrogate Model. Atmosphere 2024, 15, 1186. [Google Scholar] [CrossRef]
Schmidt-Hieber, J. Nonparametric regression using deep neural networks with ReLU activation function. Ann. Statist. 2020, 48, 1916–1921. [Google Scholar]
Martins, A.; Astudillo, R. From softmax to sparsemax: A sparse model of attention and multi-label classification. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; PMLR: Brooklyn, NY, USA, 2016; pp. 1614–1623. [Google Scholar]
Christoffersen, P.; Jacobs, K. The importance of the loss function in option valuation. J. Financ. Econ. 2024, 72, 291–318. [Google Scholar] [CrossRef]
Chen, J.; Liao, K.; Fang, Y.; Chen, D.; Wu, J. Tabcaps: A capsule neural network for tabular data classification with bow routing. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Wang, R.; Fu, B.; Fu, G.; Wang, M. Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, Halifax, NS, Canada, 14 August 2017; pp. 1–7. [Google Scholar]
Marmolin, H. Subjective MSE measures. IEEE Trans. Syst. Man Cybern. 1986, 16, 486–489. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Nakagawa, S.; Schielzeth, H. A general and simple method for obtaining R² from generalized linear mixed-effects models. Methods Ecol. Evol. 2013, 4, 133–142. [Google Scholar] [CrossRef]
Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
Hodson, T.O. Root mean square error (RMSE) or mean absolute error (MAE): When to use them or not. Geosci. Model Dev. Discuss. 2022, 15, 5481–5487. [Google Scholar] [CrossRef]
Liu, B.; Li, T.; Yang, J.; Wu, J.; Wang, J.; Gao, J.; Yang, H. Source apportionment and a novel approach of estimating regional contributions to ambient PM_2.5 in Haikou, China. Environ. Pollut. 2017, 223, 334–345. [Google Scholar] [CrossRef]
Konieczka, P. The role of and the place of method validation in the quality assurance and quality control (QA/QC) system. Crit. Rev. Anal. Chem. 2007, 37, 173–190. [Google Scholar] [CrossRef]
Watson, J.G.; Antony Chen, L.W.; Chow, J.C.; Doraiswamy, P.; Lowenthal, D.H. Source apportionment: Findings from the US supersites program. J. Air Waste Manage. Assoc. 2008, 58, 265–288. [Google Scholar] [CrossRef]
Chow, J.C.; Watson, J.G.; Lowenthal, D.H.; Chen, L.W.A.; Zielinska, B.; Mazzoleni, L.R.; Magliano, K.L. Evaluation of organic markers for chemical mass balance source apportionment at the Fresno Supersite. Atmos. Chem. Phys. 2007, 7, 1741–1754. [Google Scholar] [CrossRef]
Shi, G.L.; Tian, Y.Z.; Zhang, Y.F.; Ye, W.Y.; Li, X.; Tie, X.X.; Zhu, T. Estimation of the concentrations of primary and secondary organic carbon in ambient particulate matter: Application of the CMB-Iteration method. Atmos. Environ. 2011, 45, 5692–5698. [Google Scholar] [CrossRef]
Coulter, C.T. EPA-CMB8.2 Users Manual. Available online: http://www.epa.gov/sites/default/files/2020-10/documents/epa-cmb82manual.pdf (accessed on 1 July 2025).
Christensen, W.F.; Gunst, R.F. Measurement error models in chemical mass balance analysis of air quality data. Atmos. Environ. 2004, 38, 733–744. [Google Scholar] [CrossRef]
Watson, J.G.; Chow, J.C.; Fujita, E. Protocol for Applying and Validating the CMB Model for PM_2.5 and VOC; US Environmental Protection Agency: Washington, DC, USA, 2004. [Google Scholar]
Anguita, D.; Ghelardoni, L.; Ghio, A.; Oneto, L.; Ridella, S. The ‘K’ in K-fold Cross Validation. In Proceedings of the 20th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, 25–27 April 2012; pp. 441–446. [Google Scholar]
Fushiki, T. Estimation of prediction error by using K-fold cross-validation. Stat. Comput. 2011, 21, 137–146. [Google Scholar] [CrossRef]
Purdom, E.; Holmes, S.P. Error distribution for gene expression data. Stat. Appl. Genet. Mol. Biol. 2005, 4, 16. [Google Scholar] [CrossRef]
Gu, Z. Complex heatmap visualization. iMeta 2002, 1, e43. [Google Scholar] [CrossRef]
Sainani, K.L. The value of scatter plots. PMR 2016, 8, 1213–1217. [Google Scholar] [CrossRef]
Zien, A.; Krämer, N.; Sonnenburg, S.; Rätsch, G. The feature importance ranking measure. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2019; pp. 694–709. [Google Scholar]
Seinfeld, J.H.; Pandis, S.N. Atmospheric Chemistry and Physics: From Air Pollution to Climate Change; John Wiley & Sons: Hoboken, NJ, USA, 2016. [Google Scholar]
O’Dowd, C.D.; De Leeuw, G. Marine aerosol production: A review of the current knowledge. Philos. Trans. R. Soc. A 2007, 365, 1753–1774. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the AirTrace-SA model structure. The diagram illustrates the integrated three-component architecture consisting of the hierarchical feature extractor (HFE), source association bridge (SAB), and source contribution quantifier (SCQ). This end-to-end framework processes tabular chemical concentration data through progressive feature extraction, source association mapping, and precise contribution quantification.

Figure 2. Flowchart of the sparse attention mechanism. The process generates attention scores for features and applies Sparsemax activation to create a sparse weight distribution.

Figure 3. Flowchart of the sequential processor. The processor applies non-linear transformations to selected features through shared and step-dependent layers.

Figure 4. Comparison of the average PM_2.5 chemical composition across five cities. The bars show the absolute mass concentrations (μg/m³) of major species (

{S O}_{4}^{2 -}

,

N O_{3}^{-}

, OC, EC,

C l^{-}

, Na) and the sum of other mineral and trace elements (others).

Figure 4. Comparison of the average PM_2.5 chemical composition across five cities. The bars show the absolute mass concentrations (μg/m³) of major species (

{S O}_{4}^{2 -}

,

N O_{3}^{-}

, OC, EC,

C l^{-}

, Na) and the sum of other mineral and trace elements (others).

Figure 5. Categorical distribution of PM_2.5 pollution sources across five Chinese cities. The ten pollution sources are aggregated into five major categories: natural sources (urban dust and sea salt), combustion sources (coal and motor vehicle), industrial sources (metallurgical dust and construction dust), secondary pollutants (secondary sulfuric, secondary nitrate, and SOC), and other unclassified sources. The stacked bars represent the percentage contribution of each category to the total PM_2.5 mass concentration (%).

Figure 6. Comparison of

R^{2}

performance across 10-fold cross-validation for the seven different models. AirTrace-SA consistently maintains the highest position.

Figure 6. Comparison of

R^{2}

performance across 10-fold cross-validation for the seven different models. AirTrace-SA consistently maintains the highest position.

Figure 7. Cross-city performance analysis of seven models: mean absolute error (

M A E

) comparison demonstrating AirTrace-SA’s superior consistency across diverse urban environments.

Figure 7. Cross-city performance analysis of seven models: mean absolute error (

M A E

) comparison demonstrating AirTrace-SA’s superior consistency across diverse urban environments.

Figure 8. Heatmap of

M A E

for AirTrace-SA across five cities and pollution sources. The heatmap uses color gradients to visualize prediction errors, with lighter colors indicating lower

M A E

values.

Figure 8. Heatmap of

M A E

for AirTrace-SA across five cities and pollution sources. The heatmap uses color gradients to visualize prediction errors, with lighter colors indicating lower

M A E

values.

Figure 9. Source-specific

R M S E

performance comparison of seven models: evaluating predictive accuracy across ten pollution source categories, with AirTrace-SA leading in overall robustness.

Figure 9. Source-specific

R M S E

performance comparison of seven models: evaluating predictive accuracy across ten pollution source categories, with AirTrace-SA leading in overall robustness.

Figure 10. Heatmap of

R M S E

for AirTrace-SA across five cities and pollution sources. The color-coded visualization shows the error distribution from light yellow (low

R M S E

) to dark red (high

R M S E

), revealing pattern-specific performance variations.

Figure 10. Heatmap of

R M S E

for AirTrace-SA across five cities and pollution sources. The color-coded visualization shows the error distribution from light yellow (low

R M S E

) to dark red (high

R M S E

), revealing pattern-specific performance variations.

Figure 11. Scatter plot of predicted vs. actual values of source contributions for AirTrace-SA. Each blue point represents a sample where the x-coordinate shows the true contribution percentage and the y-coordinate shows the model’s predicted percentage for that sample. Points closer to the identity function indicate more accurate predictions.

Figure 12. Feature importance distribution of the AirTrace-SA model. The chart ranks chemical components according to their importance in the model’s decision-making process.

Table 1. Example of daily PM_2.5 chemical component concentrations from a single ambient sample in Lanzhou.

Chemical Composition	Concentration (μg/m³)
${S O}_{4}^{2 -}$	6.54
$N O_{3}^{-}$	6.18
$C l^{-}$	3.64
Na	2.23
Mg	1.13
Al	2.40
Ca	4.36
K	1.11
Si	11.6
Fe	2.03
Mn	0.05
Ti	0.07
Cu	0.02
Zn	0.13
Pb	0.09
OC	8.40
EC	3.65

Table 2. Example of daily pollution source contributions from a single ambient PM_2.5 sample in Lanzhou.

Pollution Source	Contribution (%)
Urban dust	21.40
Construction dust	3.50
Coal	8.70
Metallurgical dust	5.30
Motor vehicle	16.10
Secondary sulfuric Secondary nitrate SOC	13.10 7.00 7.90
Sea salt	0.00
Other	17.00

Table 3. Component-wise ablation results of AirTrace-SA showing the impact of removing individual modules on pollution source apportionment performance.

	$R^{2}$	$M A E$ (%)	$R M S E$ (%)
Full AirTrace-SA	0.887 ± 0.024	0.591 ± 0.083	1.167 ± 0.171
w/o HFE	0.826 ± 0.038	0.862 ± 0.145	1.477 ± 0.182
w/o SAB	0.828 ± 0.042	0.789 ± 0.130	1.470 ± 0.186
Simple SCQ	0.504 ± 0.030	1.776 ± 0.105	2.592 ± 0.174

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, W.; Zhang, Q.; Shu, T.; Du, X. AirTrace-SA: Air Pollution Tracing for Source Attribution. Information 2025, 16, 603. https://doi.org/10.3390/info16070603

AMA Style

Zhao W, Zhang Q, Shu T, Du X. AirTrace-SA: Air Pollution Tracing for Source Attribution. Information. 2025; 16(7):603. https://doi.org/10.3390/info16070603

Chicago/Turabian Style

Zhao, Wenchuan, Qi Zhang, Ting Shu, and Xia Du. 2025. "AirTrace-SA: Air Pollution Tracing for Source Attribution" Information 16, no. 7: 603. https://doi.org/10.3390/info16070603

APA Style

Zhao, W., Zhang, Q., Shu, T., & Du, X. (2025). AirTrace-SA: Air Pollution Tracing for Source Attribution. Information, 16(7), 603. https://doi.org/10.3390/info16070603

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AirTrace-SA: Air Pollution Tracing for Source Attribution

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Hierarchical Feature Extractor (HFE)

3.2. Source Association Bridge (SAB)

3.2.1. Sparse Attention Mechanism

3.2.2. Sequential Processor

3.2.3. Multi-Step Decision Mechanism and Feature Aggregation

3.3. Source Contribution Quantifier(SCQ)

3.3.1. TabNet Regressor

3.3.2. Loss Function and Output Mapping

4. Experiment

4.1. Dataset

4.2. R 2 Performance Evaluation

4.3. Evaluation of Prediction Error

4.4. RMSE Performance Comparison

4.5. Prediction and Truth Scatter Plot

4.6. Feature Importance Analysis

4.7. Ablation Study and Component Analysis

5. Discussion

5.1. Model Limitations and Future Directions

5.2. Global Applicability and Data Requirements

5.3. Temporal Scale Considerations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2. $R^{2}$ Performance Evaluation