1. Introduction
Global industrialization and urbanization continue to advance, and, alongside economic growth, air pollution, particularly particulate matter (PM
2.5 and PM
10), has become a critical global challenge; it causes four million deaths annually and has severe ecological impacts [
1,
2,
3,
4]. The problem is especially severe in China’s rapidly developing regions, such as the Beijing–Tianjin–Hebei region, the Yangtze River Delta region and the Pearl River Delta region, where air quality frequently exceeds national standards, resulting in hundreds of billions of dollars in annual economic losses [
5,
6,
7]. Studies indicate that every 10 μg/m
3 increase in PM
2.5 concentrations correlates with 15–20% higher heart disease incidence and 18% increased chronic obstructive pulmonary disease (COPD) risk in elderly populations [
8].
Air pollution source tracing is key to air pollution prevention and control, focusing on analyzing the concentrations of chemical components in air pollutants to accurately identify the pollution sources and their contribution [
9]. Common pollutant sources include industrial emissions, traffic exhaust, coal combustion, urban dust and secondary pollutants, and so on [
10]. After determining the contribution of pollution sources to PM
2.5 and understanding the causes of air pollution in the region, targeted emission reduction strategies can be created [
11].
The challenge of pollution source tracing mainly stems from the heterogeneity and dynamics of the data. Coastal cities contend with interactions between sea salt and industrial emissions, while inland cities face dust–coal combustion complexities [
12,
13]. Secondary pollutants formed through atmospheric reactions require integration with meteorological modeling [
14]. Traditional methods have provided valuable insights for decades, though they face challenges in handling increasingly complex pollution patterns and nonlinear source interactions in modern urban environments [
15]. Numerous AI approaches (e.g., Random Forest, XGBoost, and 1D CNN) have been established as powerful benchmarks for this task, demonstrating strong predictive capabilities. However, a key challenge remains in developing a unified framework that can simultaneously handle complex, nonlinear source interactions and provide clear interpretability without sacrificing cross-regional generalization [
16].
In this context, this study proposes AirTrace-SA (Air Pollution Tracing for Source Attribution), a novel hybrid deep learning approach for accurate air pollution source identification and quantification. AirTrace-SA integrates three key components: a hierarchical feature extractor (HFE) that derives multi-scale representations from chemical components, a source association bridge (SAB) that establishes chemical-to-source mappings through a multi-step decision mechanism, and a source contribution quantifier (SCQ) that precisely predicts pollution source contributions using TabNet regression capabilities. In summary, this study makes the following contributions:
Innovative fusion of models. This study introduces an innovative architecture that synergistically combines hierarchical feature extraction with multi-step decision processing, creating a powerful framework specifically designed for pollution source tracing challenges.
Improving the accuracy of pollution source tracing. The model significantly improves source tracing accuracy through its Source Association Bridge, which employs sparse attention mechanisms and sequential processors to distinguish overlapping pollution signals.
Enhancing generalization capability. K-fold cross-validation and multi-city data evaluation were used to maintain stable performance under different geographic and climatic conditions, as well as different pollution types, solving the generalization problem faced by traditional models in cross-regional applications, better adapting to various complex environments, and improving their stability and reliability.
Reducing analysis costs. Our method reduces the analysis cost and improves work efficiency by capturing the complex feature patterns from the existing data and the powerful pollutant source analysis capability, which reduces reliance on a large number of field surveys and on the manual analysis of samples.
The rest of the study is organized as follows:
Section 2 describes the current state of research and surveys related to this study. Next,
Section 3 introduces the main methods used in the AirTrace-SA model. Then, in
Section 4, some relevant research experiments are conducted to demonstrate the effectiveness of the model when it is applied to air pollution source tracing.
Section 5 provides a comprehensive discussion of the model’s limitations, global applicability, and temporal considerations. Finally,
Section 6 summarizes the study and describes future research directions.
2. Related Work
Traditional air pollution source tracing methods, including source models and receptor models, have been successfully applied worldwide and continue to provide valuable insights. While each approach has specific requirements and constraints, they form complementary tools that can be selected or combined based on application needs.
Source models such as the Community Multiscale Air Quality Model (CMAQ) provide comprehensive regional air quality simulations by integrating chemical reactions and transport processes [
17]. Zhao et al. successfully applied CMAQ in the Yangtze Delta region, accurately capturing
distributions while revealing the challenges of NH
3 estimation [
18]. CMAQ’s strength lies in its process-based approach and scenario analysis capabilities, making it invaluable for policy evaluation. While its performance depends on emission inventory quality, CMAQ remains the gold standard for understanding atmospheric processes at regional scales. Receptor models such as Chemical Mass Balance (CMB) offer direct source apportionment through mass balance calculations [
19]. Zhang et al. demonstrated CMB’s effectiveness in Tianjin, accurately identifying the contributions of coal combustion (15.2%) and dust (12.5%) to PM
2.5 [
20]. CMB’s mathematical transparency and regulatory acceptance make it particularly valuable when source profiles are well-established. For regions with evolving emission patterns, data-driven approaches can complement CMB by adapting to changing source characteristics.
Due to its powerful data processing abilities, machine learning has been increasingly emphasized in air pollution source tracing [
21]. Tree-based methods have gained traction in source apportionment. Choi et al. applied Decision Trees in Seoul, achieving
values of 0.65–0.73 with excellent interpretability for preliminary analysis [
22,
23]. Random Forest extends this approach through ensemble learning, with Du et al.’s SPST model reducing prediction errors below 2% [
24,
25]. These methods excel at capturing nonlinear relationships and handling missing data, although their performance can vary across different urban environments. AirTrace-SA offers a complementary approach by combining hierarchical feature extraction with attention-enhanced source association, addressing cross-regional generalization challenges differently. Advanced machine learning algorithms offer improved accuracy. The Support Vector Machine (SVM) maximizes classification boundaries, with Kaya et al. achieving
values of 0.85–0.91 for PM
10 prediction in Punjab [
26,
27]. XGBoost’s gradient boosting approach provided superior results in Beijing compared to traditional statistics [
28,
29]. Both methods effectively handle nonlinear patterns, though they require careful parameter tuning. AirTrace-SA complements these approaches by automating feature extraction through deep learning while maintaining comparable computational efficiency.
Deep learning approaches have transformed pollution analysis by automatically discovering complex patterns in environmental data [
30]. The one-dimensional convolutional neural network (1D CNN) efficiently processes temporal sequences, with Ragab et al. achieving a
of 2.036 for air pollution prediction through local feature extraction [
31,
32]. While computational requirements are superior to traditional methods, this investment often yields superior accuracy for time-series analysis. Hao et al. proposed a Tracing-U-Net model that combines the CNN and U-Net architectures to capture the spatial features of pollutants through convolutional operations, and the average prediction error on sparse datasets is less than 8%, demonstrating the data efficiency potential of deep learning [
33]. The model particularly excels in capturing location-specific pollution patterns, though careful domain adaptation may be needed for cross-regional applications. Attention-based LSTM networks developed by Liu et al. demonstrate superior PM
2.5 prediction by focusing on relevant features [
34,
35], whereas the model still faces the challenge of high data requirements; in the case of data scarcity in particular, the robustness of the model may be affected. TabNet, proposed by Arik et al., represents a particularly promising development for pollution source analysis [
36]. It employs sequential attention mechanisms for instance-wise feature selection, making it inherently interpretable—a crucial requirement for environmental applications. TabNet’s multi-step decision architecture progressively refines predictions while providing feature importance rankings, allowing researchers to understand which chemical components drive source identification. Its success on various tabular datasets demonstrates strong potential for environmental data. However, when applied to highly overlapping pollution signatures, single-architecture approaches may face limitations in disambiguating complex source mixtures. Research demonstrates that hybrid models combining multiple architectures often outperform standalone approaches [
37]. AirTrace-SA leverages this principle by integrating hierarchical feature extraction, attention-based source association, and TabNet regression within a unified framework, addressing the limitations of individual methods while maintaining interpretability.
Recent advances have explored hybrid approaches, combining traditional models with machine learning techniques. Lee integrated machine learning with Bayesian Spatial Multivariate Receptor Modeling (BSMRM), enabling the spatial prediction of PM
2.5 sources at unmonitored locations while maintaining the physical interpretability of receptor models [
38]. Ma et al. combined WRF-Chem with deep learning bias correction, achieving
reductions of 38.90–48.86% across multiple urban agglomerations, with the hybrid approach significantly outperforming individual methods [
39]. Lee et al. developed a conditional surrogate, achieving
> 0.95 for PM
2.5 concentration prediction, but this high accuracy is for total PM
2.5 mass rather than source-specific contributions [
40]. These hybrid methods excel in their specific domains: BSMRM provides spatial coverage but requires extensive multi-site data; WRF-Chem hybrids improve concentration forecasts but need significant computational resources even with deep learning acceleration; CMAQ surrogates achieve rapid concentration predictions but do not perform source apportionment. AirTrace-SA addresses the distinct challenge of directly quantifying contributions from multiple individual pollution sources without requiring multi-site monitoring networks or computationally intensive atmospheric simulations, making it particularly suitable for regions with limited monitoring infrastructure or computational resources.
Building upon these insights, we introduce AirTrace-SA (Air Pollution Tracing for Source Attribution), a specialized hybrid architecture for pollution source tracing. The model features an HFE module for extracting multi-scale chemical patterns, an SAB component that employs iterative decision steps to map features to sources, and an SCQ unit that utilizes advanced tabular learning techniques for precise contribution estimation. Unlike traditional methods, AirTrace-SA does not need to rely on cumbersome emission lists or predefined source fingerprints. Instead, it automatically extracts complex feature patterns from existing data, which significantly reduces the cost of manual data collection and improves adaptability. Compared with machine learning approaches that have difficulties in capturing deep nonlinear interactions or maintaining generalization across heterogeneous datasets, AirTrace-SA models the complex relationships among pollution sources more effectively through a multi-step decision-making mechanism. In addition, compared to standalone deep learning models, which may have difficulty in fully extracting features when dealing with noisy or highly overlapping pollutant data, the hybrid architecture of AirTrace-SA enhances the deep feature representation capability and improves robustness.
Overall, AirTrace-SA offers advantages in accuracy, generalizability, and interpretability, providing an innovative and efficient solution in the field of air pollution source tracing.
3. Method
This section presents a detailed explanation of the AirTrace-SA model for air pollution source tracing. AirTrace-SA integrates three key components in an end-to-end framework to transform chemical concentration data into precise pollution source contribution predictions.
Figure 1 shows the schematic diagram of the proposed method. First, the input chemical concentration data in tabular form are entered into the hierarchical feature extractor (HFE). The HFE processes these data through a series of shared layers, followed by step-dependent layers, each activated by ReLU activation functions [
41]. These layers progressively extract multi-scale features from raw chemical data, capturing both general patterns and specific chemical markers relevant to pollution source identification. The output layer of HFE delivers a comprehensive feature representation that encodes the complex relationships between chemical components.
This feature representation is then passed to the source association bridge (SAB), which forms the core of AirTrace-SA’s innovative approach. The SAB employs a multi-step decision mechanism that iterates for
steps to establish robust associations between chemical features and pollution sources. In each step, the process begins with a sparse attention mechanism that selectively focuses on the most relevant features for a particular source [
42], followed by a sequential processor that enhances these selected features. This sequential processing allows the model to progressively refine its understanding of complex pollution signatures. The outputs from all steps are then combined through an elementwise sum operation, creating an output aggregation that captures multi-faceted relationships between chemical components and pollution sources.
Finally, the output aggregation is subsequently fed into the source contribution quantifier (SCQ), which employs a TabNet Regressor with its own multi-step processing capability [
36]. Within the SCQ, each step involves feature selection to identify the most relevant aspects of the aggregated features, followed by feature processing to transform these selections into predictive insights. After multiple processing steps, the final output passes through the MSE loss function to optimize the model’s performance during training [
43]. The SCQ ultimately generates precise quantitative predictions of the contribution of each pollution source.
This cascaded architecture enables AirTrace-SA to perform complex feature extraction, association building, and contribution quantification in a unified framework. The following subsections will elaborate on each component’s internal structure and functional mechanisms.
4. Experiment
In this section, relevant research experiments are presented to verify the effectiveness of this model when applied to air pollution source analysis. First, the sources and contents of the datasets used in this experiment are introduced. Next, the research model AirTrace-SA is compared with 1D CNN, Decision Tree, Random Forest, XGBoost, LightGBM [
47] and TabNet in a multi-dimensional manner using the
[
48],
[
49], and
[
50] evaluation methods. Then, the performance of the models is evaluated more intuitively by generating corresponding scatter plots of predicted and true values for all air pollution sources. Subsequently, feature importance analysis is performed on the raw features to quantify the extent to which each feature contributes to pollution source identification, generating a clear importance ranking. Finally, a comprehensive ablation study is conducted to validate the necessity and contribution of each component within the AirTrace-SA architecture.
The experiment is conducted on a computer configured with a 12th Gen Intel(R) Core(TM) i5-12400F processor (2.5 GHz), 32,768 MB RAM, and an NVIDIA GeForce RTX 4060 Ti graphics card.
4.1. Dataset
The dataset used in this study is from [
25], which contains real air pollution data from five cities in China (Lanzhou, Luoyang, Haikou, Urumqi, and Hangzhou). These data consist of 1402 ambient samples with PM
2.5 chemical composition and their corresponding pollutant contributions, collected as daily measurements from December 2013 to November 2014.
The data characteristics include 17 chemical constituents of the environmental samples as shown in
Table 1. They are
,
,
, Na, Mg, Al, Ca, K, Si, Fe, Mn, Ti, Cu, Zn, Pb, OC, and EC. These chemical components serve as the input features for the AirTrace-SA model. The model takes the measured concentrations of these chemical species (in μg/m
3) as inputs to predict pollution source contributions.
The data collection and chemical analysis protocols for all five cities followed the methodologies detailed in [
51], which included stringent quality assurance/quality control (QA/QC) measures to ensure a high degree of data reliability and consistency [
52]. To visually illustrate the distinct pollution characteristics that justify the selection of these diverse cities,
Figure 4 presents the average mass concentrations and relative contributions of the chemical components of PM
2.5, highlighting the significant regional variations our model is designed to handle.
To effectively utilize this rich but varied dataset in the AirTrace-SA model, it was essential to address the issue of varying scales among different input variables. Before being fed into the model, the entire set of input features from all cities was normalized using the Z-score method. This procedure, implemented via the StandardScaler function from the Python 3.8.10 scikit-learn library, transforms each feature to have a mean of zero and a standard deviation of one. This standardization is crucial as it ensures that variables with larger numerical ranges do not disproportionately influence model training compared to variables with smaller ranges. This step guarantees that our model’s performance is evaluated on a fair and comparable basis across all variables and cities, focusing on the underlying patterns rather than the raw data magnitudes. The Z-score transformation is defined as:
where
is the normalized value,
is the original value,
is the mean of the feature, and
is its standard deviation.
As shown in
Table 2, the pollution sources to which the chemical composition corresponds include urban dust, coal, sea salt, motor vehicle, metallurgical dust, secondary nitrate, secondary sulfuric, SOC, construction dust, and other. These 10 pollution source contributions constitute the target variables that the AirTrace-SA model predicts. For each PM
2.5 sample, the model outputs the percentage contribution (0–100%) of each pollution source, with all contributions constrained to sum to 100%.
In environmental science, the source apportionment of air pollution is inherently based on model estimations rather than direct measurements, as a direct quantitative measurement of the percentage contribution from each emission source to ambient PM
2.5 at receptor sites is not achievable with current analytical methods. Therefore, receptor models such as chemical mass balance (CMB) are widely accepted as the standard methodology for quantifying source contributions based on chemical composition analysis [
53,
54].
The source contributions in our dataset were calculated using the CMB model and CMB-Iteration method as described in [
51]. Specifically, the CMB model employs the mass balance equation:
where
represents the concentration matrix of chemical species at the receptor (μg/m
3),
denotes the source profile matrix (μg/μg), and
indicates the source contribution matrix (μg/m
3), with
,
, and
representing the quantities of chemical species measured, pollution sources identified, and ambient samples collected, respectively.
Subsequently, the contribution from SOC (secondary organic carbon) is determined using the CMB-Iteration approach [
55]. Since SOC lacks a direct source profile for the CMB model, the mass balance equation between receptors and sources can be expressed as:
where
denotes the original receptor matrix,
represents the
component in the receptor,
is the source fingerprint matrix, and
indicates the source contribution matrix. The expression
corresponds to the primary
concentration, which can be represented by a modified receptor matrix
and formulated as:
When we denote
(μg/m
3) as the total organic carbon and substitute
, the estimated primary organic carbon
(μg/m
3) is given by:
The concentration remains unknown and must be determined through the CMB-Iteration procedure. This iterative method enables the separation of primary and secondary organic carbon contributions, which is essential for comprehensive source apportionment analysis.
These methods are based on the effective variance weighted least squares solution implemented by the EPA (U.S. Environmental Protection Agency) CMB 8.2 [
56,
57], which is the standard approach for source apportionment studies. Their performance was validated through established metrics (i.e.,
, χ
2, and % of PM mass apportioned), all meeting the EPA recommended targets [
58].
While these contribution values are model-based estimates rather than direct measurements, they represent a widely accepted scientific approach for source apportionment. The use of CMB-derived source contributions as training data is appropriate for our study, as these values encapsulate the complex relationships between chemical compositions and source contributions that our AirTrace-SA model aims to learn.
The dataset used in this study includes five cities distributed across the southeastern coast to the northwestern interior, encompassing Lanzhou in Gansu Province, northwestern China, a typical inland industrial city characterized by significant winter heating demands and coal combustion patterns; Luoyang in Henan Province, central China, known for its heavy industrial base, particularly in metallurgical sectors; Haikou in Hainan Province, southern China, a coastal city with minimal industrial activity but notable marine aerosol influences; Urumqi in the Xinjiang Uygur Autonomous Region, northwestern China, situated near desert regions and subject to frequent dust storms; and Hangzhou in Zhejiang Province, eastern China, representing a rapidly developing metropolitan area with complex mixed pollution sources typical of modern urban environments.
To illustrate the distinct pollution characteristics across these cities,
Figure 5 presents the categorical distribution of pollution sources. The ten individual sources were grouped into five major categories: natural sources (urban dust and sea salt), combustion sources (coal and motor vehicle), industrial sources (metallurgical dust and construction dust), secondary pollutants (secondary sulfuric, secondary nitrate, and SOC), and other unclassified sources. The distribution reveals clear city-specific patterns: Lanzhou shows balanced contributions across categories with relatively high secondary pollutants (28.0%); Luoyang exhibits the highest secondary pollutant proportion (44.9%) reflecting its industrial chemistry; Haikou demonstrates the highest natural source contribution (25.5%) due to marine influence and the highest “other” category (26.7%); Urumqi presents the highest combustion source proportion (31.5%) associated with an extreme continental climate; while Hangzhou shows the highest secondary pollutants after Luoyang (39.1%) but the lowest “other” category (5.2%), indicating well-characterized urban pollution. These diverse pollution profiles ensure that our model is tested against a comprehensive range of air quality scenarios, from marine-influenced to combustion-dominated and industrially complex environments.
In order to effectively evaluate the generalization ability of the model, reduce the risk of overfitting, and improve the performance robustness, we use K-fold cross validation on the dataset. The underlying principle is to randomly divide the dataset into
equal-sized subsets; each time, we use
subset as the training set, and the remaining 1 subset as the independent test set, and we loop
times, to ensure that each set of data is used as the test set once [
59]. The final model performance is calculated by averaging the results of
evaluations:
where
is the performance index of the
fold. In this study,
was chosen due to its ability provide sufficient training data (about 90% for training) and test data (about 10% for testing), while maintaining computational efficiency and ensuring the reliability of the results, especially in the case of a limited sample size [
60].
4.3. Evaluation of Prediction Error
To further evaluate the prediction accuracy and generalization ability of the AirTrace-SA model, we conducted a cross-sectional comparison of prediction errors for these seven models across five cities. Combined with the cross-city perspective, the prediction ability of the models is evaluated by using the mean absolute error () method for 10 pollution sources within each city; the lower the value, the more accurate the model prediction.
The mean absolute error (
) measures the average absolute difference between the predicted and true values [
49] and is calculated as:
The assigns the same weight to all errors. In pollution source analysis, the can directly reflect the prediction bias and help to evaluate the prediction stability of the model among different pollution sources and different cities.
As shown in
Figure 7, the line chart provides a clear comparison of the average
for seven models across five cities. When examining performance at the city level, the differences in model performance reveal unique environmental challenges across locations. In Lanzhou, AirTrace-SA leads with the lowest
of 0.61, significantly outperforming other models, while 1D CNN and TabNet have the highest
values of 1.03 and 1.01, respectively. In Luoyang, the differences in
are minimal, with values ranging from AirTrace-SA’s 0.62 to 1D CNN’s 0.80, suggesting more predictable pollution characteristics. Conversely, Haikou exhibits the greatest variation, with AirTrace-SA maintaining a low
of 0.51 while Random Forest reaches 2.89, indicating substantial performance gaps. In Urumqi, AirTrace-SA achieves an
of 0.46, whereas Random Forest reaches 2.32. In Hangzhou, AirTrace-SA leads with 0.78, while 1D CNN has the highest
of 1.80. These results show AirTrace-SA’s adaptability to diverse city-specific conditions, while other models struggle in certain locations.
Examining trends and variability provides further insight into model stability across the cities. Haikou and Urumqi display the greatest variability, with Random Forest performing notably poorly at 2.89 in Haikou and 2.32 in Urumqi, potentially due to complex pollution dynamics in these locations. Luoyang, however, shows the least variability, with MAEs ranging from 0.62 to 0.80, suggesting stable prediction conditions that favor consistent performance. AirTrace-SA maintains low MAEs, typically at or below 0.50, across all cities, demonstrating remarkable robustness, while Random Forest and 1D CNN exhibit significant fluctuations. Moderate performers such as TabNet, XGBoost, Decision Tree, and LightGBM show less variability but fail to match AirTrace-SA’s consistency, reinforcing its adaptability to diverse environmental challenges.
Overall, AirTrace-SA emerges as the standout model with an average
of 0.60 across all five cities, showcasing superior accuracy and stability for air pollution source tracing. Random Forest lags behind with the highest average
of 1.64, indicating substantial difficulties in maintaining prediction quality. The moderate-performing models—TabNet at 1.24, XGBoost at 1.38, Decision Tree at 1.30, and LightGBM at 1.33—occupy an intermediate range, with TabNet leading this group, while 1D CNN’s average
of 1.55 places it closer to Random Forest. AirTrace-SA’s consistent outperformance across varied urban contexts underscores its reliability, making it a highly effective tool for addressing the complexities of pollution source analysis. In order to reveal more comprehensively the error characteristics of AirTrace-SA on each city and pollution source, we plotted the error distribution graphs, which show in detail the absolute error frequency distributions and
values of the 10 pollution sources in the five cities, and these graphs are helpful for analyzing the concentration of the errors and whether there is a long-tailed distribution [
61], as shown in
Appendix A.
To observe the overall performance of AirTrace-SA on
method more intuitively, we made a heat map [
62] of the experimental results, which is illustrated in
Figure 8. Its
is concentrated in the low value range for most of the source categories, with a predominantly lighter color distribution, indicating the overall low level of its prediction error, especially in the source categories of Haikou and Urumqi, where the heat map presents lighter color blocks, which suggests the higher accuracy of its prediction in these cities. In contrast, some categories in Hangzhou (e.g., other) show darker color blocks in the heat map, reflecting a relative increase in their errors. This observation aligns with the patterns identified in the tabular results. In general, AirTrace-SA possesses excellent error control capability, which stems primarily from the source association bridge’s multi-step decision mechanism. The SAB’s iterative refinement of feature representations through sparse attention allows the model to selectively focus on the most relevant chemical markers for each pollution source. This targeted feature selection is particularly effective when dealing with the heterogeneity of pollution sources across different cities, enabling AirTrace-SA to maintain stable prediction performance even in complex urban environments with varying pollution profiles.
4.4. RMSE Performance Comparison
In order to evaluate the error performance of the models in various aspects and to highlight the sensitivity of the root mean square error () to larger errors, this study compiled the of seven models in five cities on 10 pollution sources, which were grouped into 10 pollution sources and listed the and average of each model in the five cities under different pollution sources.
The
assigns higher penalty weights to large errors compared to
because the errors are squared, making
particularly suitable for assessing model sensitivity to outliers [
50]. In the following analysis,
can help to identify significant deviations of the model for some specific pollution sources, exploring inter-model differences as well as inter-city error features.
As shown in
Figure 9, the line chart provides a comprehensive comparison of the average
for seven models across ten pollution sources. AirTrace-SA excels in specific categories, achieving an
of 0.57 for construction dust and 0.60 for sea salt, indicating its precision in handling these sources effectively within the dataset. In contrast, Random Forest struggles notably with secondary sulfuric at 3.04 and other at 3.74, suggesting vulnerability to sources with higher variability or mixed contributions. 1D CNN performs relatively well with motor vehicles at 1.64 but falters with secondary nitrate at 2.81, while TabNet shows a balanced approach with a low 1.49 for sea salt but a higher value of 2.00 for secondary sulfuric. These source-specific insights highlight how AirTrace-SA’s design may better address the predictability of certain pollution sources compared to the inconsistent performance of other models across individual categories.
The chart further illustrates the models’ sensitivity to outliers and variability across the pollution sources, a critical aspect given ’s emphasis on penalizing large errors. Secondary sulfuric and other sources exhibit the highest peaks, with Random Forest reaching 3.04 and 3.74, respectively, indicating its susceptibility to significant deviations in these complex categories. AirTrace-SA maintains a robust response with values of 1.38 and 1.62, respectively, showcasing its ability to mitigate outlier impacts. Urban dust and construction dust also show moderate variability, where AirTrace-SA’s 1.54 and 0.57 outperform Random Forest’s 3.01 and 0.85, respectively. This pattern suggests that models such as TabNet with 1.76 and LightGBM with 1.81 offer moderate resilience, but their fluctuations (e.g., TabNet’s 2.00 for Secondary sulfuric) indicate less consistency than AirTrace-SA when facing outlier-heavy sources.
Finally, the analysis based on the above figure underscores AirTrace-SA’s superiority as a tool for air pollution source tracing, with an average
of 1.06 across ten pollution sources reflecting its ability to handle a wide range of pollution types effectively. The pronounced errors of Random Forest, with an average
of 2.21, and 1D CNN, with an average
of 2.03, in challenging sources such as secondary sulfuric and other suggest these models may be less suitable for applications requiring high precision under variable conditions. Moderate performers—TabNet (1.76), XGBoost (1.91), Decision Tree (1.81), and LightGBM (1.81)—provide viable alternatives but lack the consistent low-error profile of AirTrace-SA. This comprehensive evaluation emphasizes AirTrace-SA’s robustness and adaptability, making it well-suited for environments where accurate prediction across diverse pollution sources is crucial. Meanwhile, we verify the model’s error performance using the
heat map of AirTrace-SA below, which intuitively summarizes its prediction ability in different cities and pollution sources. As shown in
Figure 10, the
distribution of AirTrace-SA for five cities and 10 pollution sources is shown in color shades, with the color from light yellow to dark red indicating the error from low to high, ranging from 0.28 to 2.83. The
performance of AirTrace-SA for most pollution sources and cities is mainly light in color. Among the city perspectives, Haikou and Urumqi perform particularly well, demonstrating their ability to control prediction errors. From the perspective of pollution sources, the model performs well and is stable for pollution sources such as sea salt, metallurgical dust, and construction dust. The color is the lightest, and the error values are generally lower than 0.60. Relatively high
values are observed for categories such as secondary sulfuric and secondary nitrate, particularly in cities such as Luoyang, Lanzhou, and Hangzhou, as indicated by the darker color patterns. This suggests that the model’s performance can be further improved in handling pollutants with complex formation processes and sensitivity to diverse environmental factors. Overall, AirTrace-SA demonstrates superior performance compared to the other models, likely due to its hybrid architecture, which enables it to effectively capture essential data patterns.
4.5. Prediction and Truth Scatter Plot
In order to visually evaluate the prediction performance of the AirTrace-SA model in air pollution source analysis, as shown in
Figure 11, a scatter plot is plotted in this study to analyze the prediction accuracy of the model for the 10 pollution sources. The plot demonstrates the correspondence between the predicted contribution (
y-axis in %) and the true contribution (
x-axis in %); ideally, the data points should be tightly distributed around the identity function [
63]. The
value above each subplot reflects the model’s ability to account for data variability, with
values closer to 1 indicating higher prediction accuracy. The following analysis focuses on the overall performance of the model on different pollution sources and potential prediction challenges.
The values of AirTrace-SA on the 10 pollution sources range from 0.796 to 0.949, indicating that the model’s ability to make predictions about different pollution sources varied. Overall, the model performs best in predicting motor vehicles ( = 0.949), other sources ( = 0.945), and secondary nitrate ( = 0.938), and the data points are closely distributed around the ideal line, which shows high consistency. The relatively stable chemical composition and contribution patterns of these sources, such as motor vehicle emissions, which are usually highly correlated with components such as and EC, may help the model to capture their features more accurately. In contrast, SOC has the lowest value (0.796), and the distribution of data points is more dispersed, with significant overestimation and underestimation. This may reflect the complexity of the SOC formation process, which involves a variety of atmospheric chemical reactions and environmental factors and increases the difficulty of prediction.
From the overall pattern of the scatter plot, the prediction accuracy of the model shows some contribution dependence. In the low contribution range (0–10%), the distribution of data points is generally more dispersed, and there are more deviations from the ideal line; for example, urban dust ( = 0.828) and sea salt ( = 0.855) show obvious overestimations or underestimations in the 0–5% range. As the contribution increases (10–20% and above), the data points tend to move towards the ideal line with a more concentrated distribution, and this trend is particularly noticeable for coal combustion ( = 0.918) and secondary nitrate ( = 0.938). This appears to suggest that high-contributing sources usually have more distinctive feature patterns for model identification, while the low-contributing regions may be affected by data noise or feature overlap.
In addition, the predicted performance of the secondarily produced pollutants shows some challenges. Although the overall performance of secondary sulfuric ( = 0.874) and secondary nitrate ( = 0.938) is good, the former has more scattering points in the 10–20% range, which shows both overestimation and underestimation. This may be related to the fact that the secondary production process is affected by meteorological conditions (e.g., temperature and humidity) as well as precursor concentration, and the difficulty in predicting SOC is particularly prominent, as the scatter plot shows large dispersion over the whole contribution range. It also tends to underestimate, especially in the high contribution range (8–12%), indicating the limitations of the model in capturing the relevant features of organic compounds.
It should be noted that the distribution of true source contributions shows clustering at certain values (particularly 0% for absent sources), reflecting real-world conditions where many sources have zero or minimal contributions in specific samples. These “true values” are derived from CMB model calculations rather than direct measurements, as direct quantitative measurements of individual source contributions at receptor sites are not currently possible with the available monitoring techniques [
53]. The apparent discretization results from CMB mass balance constraints and source-specific contribution patterns (e.g., consistently zero sea salt in inland cities). The vertical spread of predictions at low contribution levels represents the inherent uncertainty in distinguishing between absent and minimal sources. Despite this boundary condition challenge, the model maintains strong predictive performance for substantial contributions (>10%), which are most relevant for pollution control decisions.
It is particularly noteworthy that the other category achieves an exceptionally high value of 0.945 despite containing a variety of unclassified pollution sources and showing higher absolute errors across cities ( ranging from 0.42 to 1.69, from 0.61 to 2.83). This seemingly paradoxical result of high explanatory power alongside larger prediction errors occurs because the other category spans a wide contribution range (0–27%), creating a large natural variance in the data. When the total variation in the dependent variable is substantial, the model can maintain a high value even in the presence of larger absolute errors, as it effectively captures relative patterns and trends within this heterogeneous category. This indicates that AirTrace-SA explains a large portion of the variance, despite discrepancies in absolute predictions. Its multi-step decision mechanism enhances its ability to identify latent structures across diverse pollution sources, facilitating the recognition of broad relationships between chemical features and their contributions. Nonetheless, the elevated and values underscore the persistent difficulty in producing accurate quantitative estimates, particularly in this complex and unclassified category. Overestimation in the high contribution range (20–27%) further reflects the challenges of modeling heterogeneous pollution sources.
A joint analysis of the scatter plots for the ten pollution sources indicates that AirTrace-SA exhibits strong predictive capability in air pollution source attribution. The plots highlight the model’s effectiveness in handling high-contribution and stable-source categories, while also revealing its limitations in low-contribution ranges and complex secondary pollutants. This trend may suggest that high-concentration pollutants possess more distinguishable features, making them easier for the model to identify. Overall, the scatter plot analysis underscores both the strengths and the constraints of AirTrace-SA, affirming its potential in source apportionment while pointing to areas requiring further refinement.
4.6. Feature Importance Analysis
We perform feature importance analysis on the features using TabNet Regressor in the AirTrace-SA model to explore the intrinsic mechanism of its excellent performance in the air pollution source tracing task in five cities.
By accumulating the attention weights in each step
, the model is able to generate global importance scores for each feature. These scores can explain the model’s decision-making process and help us understand which features play key roles in the regression task [
64]. Its formula is:
These importance scores indicate the relative contribution of each feature across all decision-making steps, providing additional interpretability to this study. Feature importance analysis not only reveals the basis for modeling decisions but also provides a scientific explanation for air pollution causes and helps us understand which chemical components play a decisive role in identifying different pollution sources.
As shown in
Figure 12,
dominates the model with the highest importance value of 0.117, far exceeding the other indicators;
ranks second with an importance value of 0.083; and
ranks third with an importance value of 0.073. These three indicators together constitute the dominant factors in the model decision. Na (0.064), Mg (0.059), Al (0.054), Ca (0.053), and K (0.051) form the second group, with importance values ranging from 0.050 to 0.065, which have a significant impact on the model prediction. The remaining elements between Si (0.050) and EC (0.049) form the base tier of feature importance with closer importance values, indicating that the model gave similar, although still not negligible, attention to these features.
The importance of
and
as the main secondary inorganic aerosol (SIA) components is much higher than that of other indicators, which shows the central position of secondary pollution processes in the analysis of air pollution sources.
mainly comes from the oxidation process of
, which is an important marker of coal combustion and industrial emissions, while
mainly comes from the oxidation of
, which is closely related to motor vehicle emissions and combustion processes. The high importance of these two indicators is highly consistent with the excellent performance of AirTrace-SA in the secondary sulfuric (
= 0.874) and secondary nitrate (
= 0.938) predictions. From a physical consistency perspective, these importance rankings align perfectly with atmospheric chemistry principles. The
dominance (0.117) reflects the oxidation pathway
+
→
→
, a fundamental process in urban atmospheres. Similarly, the high importance of
(0.083) corresponds to the
-to-nitrate conversion through both gas-phase (
+
) and heterogeneous reactions [
65].
and Na, as typical markers of sea salt, rank third and fourth, respectively, indicating that the model correctly identifies the important impact of marine sources on air quality [
66]. The coupled importance of these two elements (
: 0.073, Na: 0.064) reflects their co-occurrence in marine aerosols, with their similar importance values demonstrating the model’s ability to recognize physically related species. This chemical association, learned without explicit constraints, validates the physical consistency of our approach. The high significance of these two markers explains the good performance of the model in sea salt (
= 0.855) predictions, especially in data from coastal cities such as Haikou.
Moderate importance is assigned to elements with high crustal abundance such as Mg, Al, Ca and K, which are commonly associated with urban dust, construction dust and wind-sand dust sources. Their moderate importance reflects the balanced performance of the model in urban dust ( = 0.828) and construction dust ( = 0.861) predictions.
Although of relatively low individual importance, heavy metal elements such as Fe, Mn, Ti, Cu, Zn and Pb, as well as OC and EC, together provide the necessary information for the identification of sources such as industrial emissions, motor vehicle emissions and biomass combustion. This ability to integrate collective information explains the model’s excellent performance in the prediction of complex source categories such as motor vehicle ( = 0.949). Meanwhile, the relatively low significance of OC and EC partly explains the relative weakness of the model in SOC prediction. It can be inferred that the current feature ensemble lacks sufficient organics-related data, which likely restricts the model’s ability to accurately represent this complex secondary process.
The clear hierarchical structure of feature importance distribution demonstrates AirTrace-SA’s effectiveness in distinguishing the diagnostic value of different features. This ability stems from the combined effect of the HFE extracting multi-scale patterns and the sparse attention mechanism in SAB prioritizing the most relevant chemical components for each pollution source. Additionally, the multi-step decision process integrates information across different processing stages, further enhancing the model’s ability to identify key chemical markers. These insights deepen our understanding of AirTrace-SA’s internal mechanisms and provide a scientific foundation for future environmental monitoring and model optimization.
4.7. Ablation Study and Component Analysis
To validate the necessity of each component in AirTrace-SA and understand their individual contributions to source apportionment performance, we conducted a systematic ablation study. This analysis provides insights into how different architectural choices impact the model’s ability to capture complex pollution formation mechanisms.
As shown in
Table 3, we evaluated four model variants through 10-fold cross-validation: (1) full AirTrace-SA with all components intact, (2) AirTrace-SA without the hierarchical feature extractor (w/o HFE), where raw chemical concentration data (17 species per sample) directly enter subsequent modules, (3) AirTrace-SA without the source association bridge (w/o SAB), bypassing the multi-step attention mechanism, and (4) Simple SCQ, replacing the TabNet regressor with linear regression to assess the importance of non-linear modeling.
Removing the HFE module results in the dropping from 0.887 to 0.826 and the increasing from 0.591% to 0.862%. This performance degradation demonstrates that learning hierarchical representations of chemical components is crucial for accurate source apportionment. The HFE captures multi-scale chemical relationships that reflect atmospheric transformation processes—for instance, the correlation between primary emissions (, ) and their secondary products (, ). Without this component, the model cannot effectively learn these transformation pathways, leading to reduced predictive accuracy, particularly for secondary pollutants.
The absence of SAB causes similar performance deterioration, with dropping to 0.828 and the increasing from 1.167% to 1.470%. Interestingly, the for this variant (0.789%) is slightly lower than w/o HFE, suggesting that, while SAB is critical for capturing complex source patterns, it may introduce some prediction variance. The SAB’s multi-step attention mechanism enables the progressive refinement of source-receptor relationships, learning that Na- combinations indicate marine sources while Al-Si-Ca-Mg clusters represent crustal emissions. This iterative process mirrors how atmospheric scientists identify pollution sources through the systematic analysis of chemical markers.
The most dramatic performance collapse occurs with Simple SCQ, where drops from 0.887 to 0.504 and the increases from 0.591% to 1.776%. The of 2.592 ± 0.174% indicates severe prediction errors across all pollution sources. This stark contrast highlights that linear models cannot capture the non-linear interactions between chemical species and their sources. Complex air pollution processes—such as photochemical reactions, meteorological influences, and source mixing—require sophisticated modeling approaches. TabNet’s ability to perform instance-wise feature selection and multi-step decisions proves essential for maintaining both accuracy and physical constraints.
The ablation results reveal that AirTrace-SA’s architecture directly corresponds to physical processes in pollution source apportionment. The progression from raw chemical concentrations (input) through hierarchical feature learning (HFE), source pattern recognition (SAB), to quantitative apportionment (SCQ) mirrors the actual workflow of receptor modeling. Each component addresses specific challenges: HFE handles chemical transformations, SAB manages source–receptor complexity, and SCQ ensures physically plausible contributions. The relatively small performance gap between w/o HFE ( = 0.826) and w/o SAB ( = 0.828) compared to the large drop for Simple SCQ ( = 0.504) emphasizes that, while feature extraction and association are important, the non-linear quantification mechanism is absolutely critical for accurate source apportionment.
These findings demonstrate that AirTrace-SA achieves superior performance not through unnecessary complexity but through principled design, where each component serves an essential, physically grounded purpose in the source apportionment process.