1. Introduction
Airborne particulate matter (PM) consists of tiny solid or liquid particles that float in the air [
1]. These particles are typically classified by their aerodynamic diameter into several key sizes: PM
(particles smaller than 1
m), PM
(particles smaller than 2.5
m), and PM
(particles smaller than 10
m). These particles pose considerable health risks, including lung cancer, stroke, asthma, and cardiovascular disease. Studies have particularly highlighted that PM
, due to its ability to penetrate deeply into the lungs and enter the bloodstream, poses the most significant health hazard [
2,
3,
4].
Beyond health implications, PM also plays a critical role in climate dynamics by modifying the atmospheric balance of incoming and outgoing electromagnetic radiation. This modification affects various atmospheric conditions, including temperature, wind patterns, and precipitation. The presence of particulate matter can lead to the formation of fog and acid rain and contributes to the greenhouse effect, as discussed in [
5,
6,
7,
8,
9,
10,
11].
Given the strong link between various health problems and PM, which exhibits significant variations over time and in different locations, it is crucial to conduct comprehensive studies to better understand the distribution of PM with high temporal and spatial precision [
3,
11]. Although ground-based monitoring stations are vital, their sparse and uneven distribution across regions makes it difficult to achieve continuous nationwide coverage. To overcome these limitations, numerous studies have explored the use of remote sensing techniques and the expansion of ground observation networks. Consequently, contemporary aerosol detection technologies are mainly categorized into remote sensing and in situ observation systems [
12].
A significant hurdle in expanding the reach of precise ground-based monitoring networks is the associated expense. Consequently, a focus has been on creating calibration techniques for affordable airborne particulate sensors. These methods leverage machine learning to improve the accuracy of sensors in measuring particulate matter [
13]. These enhanced sensors offer a way to complement the data collected by the environmental agency monitoring networks [
14]. Part of our ongoing research involves the development and implementation of an environmental sensing system. This initiative aims to fill geographical gaps in data collection by establishing observation stations on the ground. These stations are designed to provide high-temporal-resolution data, specifically in the Dallas area, thus augmenting existing environmental monitoring efforts.
Research indicates that useful information on surface-level PM
concentrations can be gleaned using satellite-derived aerosol optical depth (AOD) data in conjunction with multivariate nonlinear machine learning. This allows us to take into account a variety of contextual factors such as weather conditions and other specific geographical contextual information. As a result, incorporating seasonal information and additional data can uncover temporal patterns and spatial characteristics. These insights enable the identification of changes in the relationship between AOD values and PM
concentrations [
3,
15].
Lary et al. [
3] developed a machine learning model to provide daily distributions of PM
by utilizing a combination of remote sensing and meteorological datasets, along with ground-based particulate matter measurements spanning from 1997 to 2014. Their research outlines the methodology used and presents global average results for this period, showing that the newly developed PM
data product can accurately mirror global PM
observations, thus serving as a valuable resource for epidemiological studies.
In a separate study, Yu et al. [
10] enhanced the modeling of PM
concentrations with high spatial-temporal resolution. They incorporated data from the Next Generation Weather Radar (NEXRAD), along with information from the European Centre for Medium-Range Weather Forecasts (ECMWF), AOD measurements from the Geostationary Operational Environmental Satellite (GOES-16), and PM
concentrations measured by in situ sensors from the Environmental Protection Agency (EPA) across the United States. This approach was designed to improve the accuracy and detail of PM
concentration modeling.
Objectives
This study is driven by two main goals. The first goal is to highlight the importance of collecting high-temporal-resolution data and feature variable observations that are synchronized both spatially and temporally with particulate matter (PM) measurements for accurate PM modeling. We used an especially designed system of IoT sensors, both solar and grid-powered, to detect particulate matter and other environmental parameters, deployed extensively in a densely populated area of North Texas. Our system, named MINTS-AI (Multiscale Multiuse Multimodal Integrated Interactive Intelligent Sensing for Actionable Insights), provides access to a wide range of PM sizes, including PM, PM, PM, PM, PM, and PM. These sizes have been carefully modeled using available feature variables such as weather conditions and light intensity, directly collected at the location of PM data gathering, thus eliminating the need for data interpolation to match specific coordinates. The ability of the system to record data at exceptionally high frequencies (every second) is crucial for understanding the dynamic nature of PM concentrations and their interaction with environmental factors. This approach underscores the potential loss of critical PM distribution characteristics when the spatial and temporal alignment of the feature variables and the PM data are not precise. Moreover, incorporating a comprehensive range of light-intensity measurements, which include over ten distinct levels, significantly enhances the precision of PM modeling alongside other environmental variables.
The second goal broadens the detection capabilities for PM through a blend of on-site and remote sensing techniques, making use of a rich dataset augmented with relevant features. On-site detection involved collecting ground-level PM data from our own IoT sensor network (MINTS-AI), as well as data from the OpenAQ network and the National Environmental Protection Agency (EPA) in the United States. We also compiled aerosol optical depth (AOD) data from the Geostationary Operational Environmental Satellite-16 (GOES-16), meteorological information from the European Centre for Medium-Range Weather Forecasts (ECMWF), aerosol assimilation data with air pollutants from the GrADS Data Server, and additional solar and geographical data from 2020 to the present.
2. Materials
AOD, temperature, pressure, relative humidity, height of the planetary boundary layer, wind speed, and direction are identified as crucial contextual variables for modeling and estimating PM
concentrations through satellite-based remote sensing and meteorological data [
16]. In addition to these, other specific data types were recognized as beneficial for accurately modeling PM
levels. This includes key meteorological parameters from the European Centre for Medium-Range Weather Forecasts (ECMWF), AOD products from the GOES-16 satellite, relevant air pollutants from the MERRA-2 database, solar variables, and various ancillary variables. The primary data for PM
, used in this context, were sourced from three platforms: the EPA Air Quality System (AQS), the OpenAQ global air quality data platform, and 30 sensors from the UTD MINTS monitoring network.
Data collection for this study, encompassing PM
, meteorological variables, AOD, and solar angles, varied in temporal and spatial resolutions and spanned from January 2020 to June 2023. To analyze these data, tree-based machine learning methods [
11] were used. These methods were chosen for their effectiveness in handling the highly time-sensitive nature of the data, including the target variable PM
and other influencing environmental factors.
2.1. PM Ground Observations
2.1.1. MINTS Sensors
Temporal and spatial resolution plays a critical role in air monitoring and modeling systems because air quality can change significantly over the different micro-environments encountered on very small temporal and spatial scales. Harrison et al. (2015) [
17] well demonstrated this point, highlighting the challenges in accurately capturing these variations. However, one major obstacle is the significant maintenance costs of the sensing devices, coupled with the fact that the existing number of ground-based monitoring sites is too limited to provide comprehensive spatial coverage. To address these challenges, numerous studies, including one by Xiaohoe et al. (2021) [
11], have been carried out to improve the precision and coverage of PM
data collection efforts.
This study focuses on the development of environmental sensing systems and models to estimate particulate matter, using the foundation provided by the MINTS-AI platform. MINTS-AI, a project spearheaded by the Physics Department at the University of Texas at Dallas, is a collaborative initiative that champions open source and open data principles. The platform has been instrumental in the design and deployment of in situ environmental sensing systems across the Dallas–Fort Worth (DFW) metroplex
Figure 1. These systems, which utilize affordable airborne particle sensors combined with machine learning techniques, have been strategically positioned to effectively monitor environmental conditions. The data collected by these sensors are readily available for real-time analysis via an online dashboard, as detailed by [
18].
The central and UTD nodes are integral components of MINTS’s advanced stationary sensor systems, playing a key role in environmental data collection via IoT sensors. These systems are equipped with a variety of sensors designed to measure particulate matter, gases, ambient light intensity, and climatic conditions. Particulate matter levels are monitored using the IPS 7100 sensors from Pierra Systems, which are celebrated for their affordability, precision, and high sensitivity. These laser-scattering sensors have a specified accuracy of ±10% for particulate counting (PC), are adept at providing precise and real-time measurements of airborne particulate matter, ranging from PM
to ultrafine PM
, including particle counts and sizes. In particular, the IPS 7100 boasts low-power consumption with the capability to collect and sample rapidly every second [
19].
Additionally, the system incorporates cost-effective gas sensors like the SCD30 for estimating CO
levels and the MICS6814 for gauging concentrations of CO, N
, H
, NH
, CH
, C
H
, C
H
, and C
H
OH. The BME280 sensor is used to measure temperature, humidity, and pressure, thus aiding in climate analysis. The light intensity is tracked via a sensor capable of detecting peaks across a wavelength range of 300 to 1100 nm. The central node also features an ozone module that employs Optical Absorption Spectroscopy to ascertain ozone levels. This expansive sensor network is actively deployed at various sites in the Dallas–Fort Worth metroplex, dedicated to measuring and reporting particle matter concentrations [
12].
For our first study, the primary data on all particulate matter (PM) size fractions and other relevant variables, as well as one of the key sources of ground-truth PM observations for PM modeling, were obtained from the central and UTD Nodes of the UTD MINTS-AI platform. This platform oversees 32 monitoring locations distributed throughout north Texas in Dallas, Collin, and Tarrant counties. A significant number of these monitoring sites are located in Richardson, near the University of Texas at Dallas, with additional sites in Fort Worth, Carrollton, and Plano. At each site, sensors are configured to collect data on particulate matter, gases, and climatic conditions at high temporal resolution, capturing readings every 3 s. However, the scope for PM reference data is somewhat constrained by the relatively limited number of monitoring locations within a somewhat confined area.
2.1.2. EPA
A primary source of PM
data in the United States is the EPA’s in situ monitoring network, which includes more than 500 ground-based stations scattered throughout the country [
20]. These networks are considered among the most reliable sources for aerosol information. The Air Quality System (AQS) of the EPA is a database that aggregates ambient air pollution data, including PM
and PM
, collected by the EPA along with state, local, and tribal air pollution control agencies through hundreds of monitors nationwide [
21]. However, negative data values in the AQS can occur due to equipment failures and measurement noise, particularly under very clean atmospheric conditions [
11]. For this study, PM
data, sampled on an hourly basis, were retrieved using the AQS API
Figure 2. These datasets were then employed as ground-truth observations for the purposes of model training and validation.
2.1.3. OpenAQ
In addition to the EPA, OpenAQ, a non-profit organization, facilitates global access to air quality data. It aggregates and standardizes air quality data from all over the world, offering it through a free, open source data platform. Since its launch in 2015, OpenAQ has been collecting historical and real-time data from reference-grade government monitoring stations. The platform covers particulate matter (PM) and various gaseous pollutants, including NO, NO, and CH. As the largest open source air quality data repository worldwide, OpenAQ provides an API for easy programmatic access to its comprehensive database.
The OpenAQ database incorporates data from approximately 1000 ground-based monitoring stations across the US, including stations from the EPA’s in situ monitoring networks [
22]. For this study, OpenAQ serves as an additional source of hourly sampled PM
data, which are utilized for modeling training and validation.
2.2. GOES-16 AOD
In this research, the AOD data from the GOES-16 satellite were utilized as one of the key input features. GOES-16, a geostationary weather satellite operated by the National Oceanic and Atmospheric Administration (NOAA) of the United States, is located in a stationary orbit above the Western Hemisphere [
23,
24,
25,
26,
27]. AOD, with a spatial resolution as fine as 0.5 km and a temporal resolution reaching up to 30 s, plays a significant role in this study’s analysis.
The quality and reliability of AOD data are indicated by a data quality flag (DQF), which ranges from 0 to 3. This flag helps users assess the confidence level in the AOD measurements. However, it is important to note that AOD retrieval is challenging in cloudy areas, and the accuracy of AOD data near clouds is less certain. The connection between AOD and PM
concentrations is influenced by various factors, including meteorological conditions such as relative humidity and the height of the planetary boundary layer [
15,
16], which means that this relationship can change over time and at different locations.
2.3. ECMWF Meteorological Data
The levels of airborne particulate matter are significantly influenced by weather conditions, including wind speed, pressure, and temperature. Under elevated relative humidity (RH) conditions, particles experience hygroscopic growth, a process wherein water vapor condenses onto their surfaces, resulting in an increase in particle diameter compared to normal conditions [
28,
29]. This growth enhances light scattering, significantly impacting aerosol optical depth (AOD) values. Therefore, accounting for RH is crucial when modeling particulate matter (PM) concentrations based on AOD measurements.
For this study, historical weather data were acquired through the Climate Data Store (CDS) Application Programming Interface (API). The CDS is an extensive digital service that provides a unified web interface to access a wide range of climate and environmental data, including historical, current, and projected future conditions from various sources [
30]. This service is developed and managed by the European Centre for Medium-Range Weather Forecasts (ECMWF). The ECMWF has created ERA5-Land, a reanalysis dataset that offers a detailed collection of global atmospheric data spanning from 1979 to the present. ERA5-Land applies the reanalysis technique, which integrates model data with observations from around the world to produce a globally comprehensive and consistent dataset in accordance with physical laws. This dataset is structured on a fixed data grid with a spatial resolution of 9 km and provides data updates on an hourly basis. The vertical extent of ERA5-Land ranges from 2 m above the ground to a soil depth of 289 cm [
31]. The meteorological variables of ERA5-Land that are used for PM
modeling are detailed in
Table 1.
2.4. MERRA-2 Data
The MERRA-2 dataset, developed by NASA, represents the second iteration of the Modern-Era Retrospective Analysis for Research and Applications. It is an atmospheric reanalysis dataset that combines observational data with sophisticated modeling techniques to create a continuous and high-quality historical account of the Earth’s climate system. MERRA-2 utilizes the Goddard Earth Observing System Model, Version 5 (GEOS-5) data assimilation system, which organizes data on a grid with a horizontal resolution of 0.625° by 0.5°. This dataset offers both instantaneous and time-averaged products, available in three-hour intervals [
32].
This study incorporates data on air pollutants such as black carbon, sulfate, and nitrate from the MERRA-2 database to improve the precision of its models. Anthropogenic atmospheric aerosols, such as black carbon, are known to adversely affect the global climate [
33]. Studies, including that of Menon et al. (2002) [
34], have shown that efforts to reduce black carbon emissions could decelerate the global temperature rise. Additionally, atmospheric aerosols influence atmospheric chemistry; sources such as coal-fired power plants, metal smelting operations, and vehicle emissions release sulfur and nitrogen oxides into the atmosphere. These oxides can react with photochemical products and airborne particles, resulting in the formation of acid aerosols [
35].
Sulfate aerosols arise from the oxidation of sulfur dioxide (SO
) emissions from human activities, such as the burning of fossil fuels, and natural events such as volcanic eruptions. They can significantly affect the climate by reflecting sunlight back into space [
36], leading to cooling effects. Nitrate aerosols, produced by the oxidation of nitrogen oxides (NO
) from fossil fuel combustion and biomass burning, contribute to haze and reduced visibility. These aerosols also pose health risks to humans [
37]. The formation and impact of these pollutants highlight their importance in understanding and modeling climate and air quality dynamics.
2.5. Solar Illumination
Essentially, AOD measures how much sunlight is prevented from reaching the Earth’s surface by aerosols in a vertical column of air from the surface to the top of the atmosphere. The geometry of solar illumination is crucial in defining the context of AOD measurements. Solar angles are closely related to the local time and have a huge influence on the AOD quality. The AOD value will not be retrieved due to extreme solar angles [
10,
38]. In PM
estimation models, two significant solar-related variables are considered: the solar zenith angle and the solar azimuth angle. These angles influence the distance that sunlight travels through the atmosphere of Earth to reach the surface.
2.6. Ancillary Data
In addition to data that change quickly over time, variables that change more slowly can also provide valuable information on environmental, geological, and socioeconomic factors that influence the spatial and temporal distribution of particulate matter concentrations [
39]. This study incorporated slowly varying variables such as the population density, elevation, soil type, lithology, land cover, crop type, building footprint, and livestock distribution as important contextual ancillary data. These variables help understand the broader environmental and human factors that can impact the levels of particulate matter.
Population density can significantly influence particulate matter levels due to increased human activities, such as traffic and industrial operations that emit pollutants. The Socioeconomic Data and Applications Center (SEDAC) [
40], a component of NASA, provides data on population density in the form of raster datasets. These datasets offer estimates of the population per square kilometer, aligned with figures from national censuses and population registers for the years 2000, 2005, 2010, 2015, and 2020. The available global raster files have a resolution of 30 arc seconds, roughly equivalent to 1 km at the equator.
Topographic features such as mountains and valleys play an important role in the dispersion and accumulation of particulate matter, while trees and other forms of vegetation serve as natural filters, capturing particulate matter and thus mitigating air pollution [
41]. Geographic variables such as elevation, soil type, lithology, cropland, and land cover offer information on the geological characteristics that could affect the levels of particles.
The Cropland Data Layer (CDL) is a geospatial product generated by the United States Department of Agriculture (USDA) using moderate-resolution satellite imagery combined with extensive agricultural ground truth, identifying around 250 different crop types. This dataset, with a spatial resolution of 30 m, covers the entire continental United States.
Soil data are provided by the National Cooperative Soil Survey through the Web Soil Survey (WSS), an initiative of the USDA Natural Resources Conservation Service (NRCS), which details approximately 100 soil suborder categories [
42].
The National Land Cover Database (NLCD) offers detailed information on land cover and changes over time within the United States. With a 30-m resolution, the NLCD categorizes land into 16 classes, including various types such as water bodies, urban areas, barren lands, forests, shrublands, grasslands, agricultural areas, and wetlands [
43,
44].
Bathymetric data, crucial for mapping ocean floors and land elevations, are provided by the General Bathymetric Chart of the Oceans (GEBCO), an international consortium of ocean mapping experts. This dataset presents elevation data on a grid with 15 arc second intervals [
45].
Lithology, which encompasses the geochemical, mineralogical, and physical properties of rocks, influences numerous Earth surface processes, including the transport of materials to ecosystems, soils, rivers, and oceans. The Global Lithological Map (GLiM) was developed by Hartmann and Moosdorf (2012) [
46] by synthesizing regional geological maps and literature, offering a representation of global rock types at a spatial resolution of 0.5°. This classification includes 16 lithological classes, providing a comprehensive view of the Earth’s surface composition.
Building footprint data are crucial for identifying the number of buildings around a specific location, which can influence wind dynamics and consequently affect PM concentration levels. Microsoft Maps offers a comprehensive open dataset of building footprints for the United States. This dataset is created through the application of computer vision algorithms in satellite imagery, resulting in 129,591,852 polygonal representations of building footprints in all 50 states of the United States and the District of Columbia [
47].
Gridded Livestock Data (GLD) provides a comprehensive overview of the global distribution of various species of livestock in 2015, including cattle, sheep, goats, buffaloes, horses, pigs, chickens, and ducks. This dataset is accessible for free through the Harvard Dataverse repository. It features a spatial resolution of 5 min of arc, which is roughly equivalent to 10 km at the equator. The data detail the total number of each species per pixel (5 min of arc). It is available in two formats: a dasymetric product and an areal-weighted product, both derived using redistribution methods. For this study, we chose to use the dasymetric product in the TIFF file format. This decision was influenced by the significant environmental impact of livestock farming, especially in terms of greenhouse gas emissions from enteric fermentation and manure management, together with the disruption of nitrogen and phosphorus cycles [
48].
4. Results
4.1. MINTS All PM Size Fraction Modeling
In this section, we specifically focus on the use of data only from the MINTS sensing system. The modeling efforts are categorized into three main groups, each defined by a unique set of feature variables. Additionally, each main group is further divided into seven subcategories, targeting different PM size fractions.
Of these main groups, Group-2, which utilizes all the features available from the MINTS system, shows the highest correlation coefficients (R values) in the test data compared to the other groups (
Table 5). Within Group-2, the variation in R values between subcategories is relatively minor. In particular, when using just three meteorological variables (temperature, pressure, and humidity) in Group-1, the models show impressively high performance on the test data, with R values reaching around 0.92. Group-3, designed to explore the effect of light intensity from various frequency channels on different PM size fractions, found that models for PM
, relying solely on light intensity data, produced higher R values on the test data than those for other PM size fractions within the same group.
Scatter plots were created to illustrate the correlation between predicted and actual PM levels for all specified groups and across different PM size categories. This paper selectively features the most illustrative scatter plots for visual analysis.
Figure 3 shows the scatter plots for the smallest (PM
) and largest (PM
) PM size fractions within Group-2, which showed a superior performance compared to the other groups. Additionally,
Figure 4 shows plots depicting the relative importance of various features in the models analyzed. These graphs clearly demonstrate that carbon dioxide, pressure, temperature, and humidity are crucial factors for both PM
and PM
sizes. Furthermore, for the smallest particles (PM
), light intensities in the ultraviolet A and B spectrum play a vital role. In contrast, for the larger particles (PM
), light intensities in the violet and full spectrum ranges make significant contributions to the predictive accuracy of the models.
Figure 5 and
Figure 6 illustrate the scatter and feature importance plots for PM
and PM
, focusing on Group-3 (incorporate only light sensing variables within MINTS system). These plots are instrumental in highlighting the light intensity frequency ranges that significantly impact model development, clearly differentiating between the sizes of the particles.
Consistent with the size-dependent light scattering properties of aerosols, our analysis reveals that for fine particle modeling (PM), light intensities in the ultraviolet A and B frequency ranges contain valuable information. On the other hand, for the larger particle size (PM), light intensities in the red and violet frequency ranges play a more critical role in the construction of predictive models. This clarification of the importance of the features provides insight into the unique characteristics and variables useful for modeling each PM size fraction.
4.2. Complimentary In Situ and Remote Sensing PM Modeling
This section looks at the creation of four national PM estimation models, each notable for its high temporal resolution and distinguished by different target variables and PM observation sources. Additionally, two regional PM models were developed, categorized based on the observation sources used. The purpose of classifying these regional models is to demonstrate the benefits of improving PM estimation models with additional ground-based observations and to evaluate the effectiveness of incorporating MINTS data.
The national dataset includes a comprehensive collection of approximately 1,521,790 observations and 53 predictor variables. The regional dataset contains about 61,889 observations with the same set of feature variables, all employed in the model training and testing phases. The data were split into training and testing segments in a 90:10 ratio. Training data were used for model fitting, with the performance of the models evaluated in both datasets.
Table 6 offers a detailed examination of essential evaluation metrics, such as the correlation coefficients between actual observations and the predictions made by machine learning, model R scores, and root mean square error (RMSE) figures, all based on test data. These metrics collectively facilitate an evaluation of the models’ accuracy and predictive capability.
The base model, referred to as Model-1, utilizes PM data collected from a variety of sources, including the Environmental Protection Agency (EPA), OpenAQ, and the MINTS-AI environmental sensing system. This initial model exclusively relies on ECMWF meteorological data and Aerosol Optical Depth (AOD) feature variables from the GOES-16 satellite, achieving a correlation coefficient (R) of 0.793. The introduction of additional data to the base model leads to an improvement in the R-value, which climbs from 0.793 to 0.816. Following this, Model-3, which integrates both supplementary data and MERRA-2 data, reaches an R value of 0.849, indicating a further improvement in model performance. In contrast, removing the MINTS-AI environmental sensing data from Model-3 results in a decrease in the R value to 0.834. Importantly, incorporating MINTS data into the regional model, identified as Model-5, significantly improves the model performance, demonstrating the valuable impact of the MINTS data on the accuracy of PM estimations.
The scatter diagram comparing the measured versus estimated values for Model-3 (seen in
Figure 7) visually demonstrates the correlation between actual (measured) and predicted (estimated) values for a specific target variable. This plot is instrumental in pinpointing the strengths of the model and areas that need refinement, thus serving as a crucial tool for assessing model performance and identifying potential enhancements. To aid in the analysis of overlapping data points, marginal histograms are incorporated into the figure. Furthermore, the importance ranking of the predictors (shown in
Figure 8) is designed to highlight the contribution of each variable to Model-3’s predictive capability. Variables ranked with higher importance scores exert a more substantial influence on the model predictions. In particular, the most critical variables, according to the feature importance chart, include aerosol optical depth (AOD) analysis (utilizing AOD data from MERRA-2), specific humidity, AOD from GOES-16, dew point temperature, carbon monoxide, and carbon dioxide.
4.3. Nationwide PM Model Validation
Model-3, which incorporates all available features and PM data sources, stands out for its exceptional performance in mapping ground-level PM concentrations throughout the United States. The detail and precision of this PM mapping are influenced by the resolution of the remote sensing data employed. A comprehensive input dataset for the machine learning model was prepared through several preprocessing steps. To ensure uniformity in all ground-level PM concentration maps, the ECMWF meteorological data grid, which measures approximately 10 km × 10 km and covers the whole US region, is used as the standard coordinate framework. This grid array was transformed from a two-dimensional shape into a one-dimensional format and then combined into a tabular structure, such as a dataframe. This coordinate dataframe, containing latitude and longitude, was used as the reference coordinate dataframe. However, when using data from different sources, which may follow various coordinate systems, it becomes necessary to align them with the standard grid using linear interpolation to ensure consistency. The low dynamic ancillary data were augmented to this reference coordinate dataframe by matching the locations. Since all other feature variables vary with timestamps, this reference coordinate dataframe with matched ancillary data was duplicated for hourly timestamps. ECMWF meteorological data were incorporated into the corresponding hourly reference dataframe by matching the spatial coordinates. Time-interpolated MERRA-2 data were also integrated into the respective timestamp dataframes by matching location coordinates. AOD data were then aligned with the respective timestamp dataframes by matching location coordinates. Solar angles for specific datetime dataframes were generated using the spatial coordinates. The resulting dataframes for each timestamp contained spatially matched feature variables data. These enriched dataframes were sequentially inputted into the machine learning model to generate the hourly dataframes of estimated PM concentrations at all location coordinates. These output dataframes were transformed into two-dimensional arrays of latitude, longitude, and estimated PM to visualize the PM reconstruction maps.
Wildfires significantly contribute to the increase and change in the composition of airborne particulate matter, including both primary and secondary pollutants, which can affect human health and the environment. Large wildfire events in the United States have been linked to specific weather conditions, such as droughts, high temperatures, low humidity, and strong winds, which are conducive to the ignition and propagation of wildfires.
Figure 9 illustrates the PM
concentrations on the ground as estimated by Model-3 during one of the most significant wildfire events in the US, the Santa Clara Unit (SCU) Lightning Complex fire in California in 2020. This fire, sparked by dry lightning on August 16, was eventually contained in early October.
Figure 9a,b offer visual insights into the ground-level PM
concentrations recorded at two different times: 9 PM and midnight on 2 October 2023. These visualizations were produced using a modified version of Model-3, specifically trained without incorporating MERRA-2 Aerosol Optical Depth (AOD) data. On the other hand,
Figure 9c,d depict the PM
concentrations at the same times, but were generated using the original version of Model-3, which includes a comprehensive set of feature parameters. Both variations of the model successfully identified areas of high PM
concentrations in California, with the pollution spreading to the northeast over the three-hour interval. In particular, the specialized version of Model-3 encounters limitations due to the absence of GOES-16 AOD data in areas covered by clouds, resulting in gaps in the PM
concentration estimates. To overcome these limitations, the original Model-3 supplements missing GOES-16 AOD observations with MERRA-2 AOD data, ensuring a more detailed portrayal of PM
concentrations throughout the region. The chosen color scale adheres to the guidelines of the World Health Organization (WHO), setting the threshold at 25
g/m
for the annual mean concentration of PM
, beyond which there is a significant risk to health. This threshold is used as the upper limit to visualize the map data, in accordance with global health standards.
The coverage of the MINTS sensing system is limited to the north Texas region. To comprehensively evaluate the performance of the model in PM
reconstruction, our analysis exclusively focuses on results within the state of Texas. Specifically, we scrutinize data from three distinct timestamps on 1 January 2023, comparing them with PM
observations collected by two MINTS in situ sites located in Joppa and Austin, represented by solid black circles on the maps in
Figure 10. This figure visually presents the PM
reconstruction results generated by Model-3 at these three timestamps, each separated by a minimum interval of 11 h. Similarly,
Figure 11 provides a time series illustrating PM
observations recorded by the ground sensors of the two MINTS in the cities of Joppa (blue) and Austin (orange).
In particular, the three gray dashed lines in
Figure 11 correspond to the timestamps of the PM
reconstruction maps shown in
Figure 10. Specifically,
Figure 10a depicts a relatively less polluted environment at both locations around 7 PM Central Time on 31 December 2022 (equivalent to 1 January 2023, at 01:00 UTC). This finding aligns with similar observations of lower pollution concentrations made by the Austin MINTS ground sensor at the same time (corresponding to the first gray dashed line). Approximately 13 h later, the model captures elevated PM
concentrations near Austin, while concentrations in the Joppa area remain lower (
Figure 10b). This pattern closely mirrors the observations recorded by the two MINTS ground sensors, with high PM
concentrations observed in Austin and lower levels in Joppa. In a subsequent timeframe, approximately 24 h after the initial observation, the model indicates an expansion of higher PM
concentrations, particularly in the Joppa area (
Figure 10c). This trend is aligned with the simultaneous observation of higher concentrations by both MINTS ground sensors at both locations.
4.4. Time Fraction of PM Concentration Exceed Thresholds in 2022
Since 2000, there has been a notable 42% decrease in overall PM
levels in the United States, attributed to the implementation of clean air regulations. Despite this progress, concerns remain about the need for further reductions. In February 2024, responding to these concerns, the Environmental Protection Agency (EPA) revised the national standards of ambient air quality for PM. Specifically, the annual primary PM
standard was revised downward from 12
g/m
to 9
g/m
, aiming to mitigate the adverse health impacts and associated costs. The EPA estimates that adhering to this new standard could lead to potential savings of up to USD 46 billion in avoided healthcare and hospitalization costs by 2032 [
52,
53,
54,
55,
56,
57,
58,
59,
60].
In this section, we used our Model-3 machine learning to estimate hourly PM concentrations across the entire United States for the year 2022. The resulting dataset allows us to calculate the fraction of time during which PM concentrations exceeded five distinct threshold levels (8 g/m, 9 g/m, 10 g/m, 11 g/m, and 12 g/m) throughout the entirety of 2022. The accompanying figure illustrates maps showing the percentage of time that PM concentrations exceeded the specified threshold levels, with color-coded representations corresponding to the percentage values.
As shown in
Figure 12a, certain areas in the eastern United States and California exhibit elevated percentage values, indicating that these regions experienced PM
concentrations exceeding the threshold of 12
g/m
for more than 20% of the time throughout the year 2022. However,
Figure 12d illustrates that the entire United States shows elevated percentage values, suggesting that the entire nation encountered PM
concentrations exceeding the threshold of 9
g/m
for more than 20% of the time in 2022. In particular, the eastern United States and California regions sustained PM
concentrations that exceeded the threshold of 9
g/m
for more than 50% of the time during the same period. These estimates underscore the importance of regulatory measures aiming to maintain annual PM
concentrations below 9
g/m
.
5. Conclusions
Environmental agencies often depend on a small set of airborne particulate monitoring stations, which are often unevenly spread out, leading to low temporal resolution in PM observations. These inherent constraints limit the precision of PM modeling due to the significant variability in PM concentrations at fine scales and over time. To address these issues, the UTD MINTS-AI platform has implemented a specialized environmental monitoring network tailored for use in local communities in Texas. This network is specifically designed to gather PM data, along with relevant environmental variables, with high temporal resolution and fine spatial detail.
In this paper, we concentrated on two distinct studies related to PM modeling. In the first study, we underscored the significance of raw data collection within a synchronized temporal and spatial coordinate system for effective PM modeling. In the second study, we enhanced PM modeling by employing an asynchronized temporal and spatial coordinate system, leveraging pertinent remote sensing data.
In the first study, in order to underscore the significance of a synchronized temporal and spatial coordinate system, we exclusively utilized data only from the MINTS sensing system recorded between September 2021 and June 2023. This restriction of data collection to the MINTS sensing system was intentional, as it allows access to both PM data and other pertinent environmental data at precisely the same location with synchronized time stamps. The decision to utilize the extra tree regression model, based on its strong performance in prior research and efficient computational processing, proved successful in tackling these challenges. Modeling activities were categorized based on environmental factors, incorporating all available feature variables (all available variables from the embedded sensors within MINTS system) that exhibited a superior performance across different PM size fractions. Specifically, variables such as carbon dioxide, pressure, temperature, and humidity emerged as the most influential during the modeling phase. Moreover, it was discovered that high-frequency band light intensities played a secondary role in modeling fine PM sizes, whereas low-frequency band light intensities had a more significant impact on modeling larger PM sizes. It is noteworthy that the modeling of the fine PM size fraction (PM) resulted in higher correlation coefficient (R) values compared to coarser PM size fractions in Group-3, which solely relied on the light intensity variables. This result indicates that, for smaller particle sizes, Mie scattering can be beneficial in accurately capturing specific particle characteristics. This can be attributed to the fact that the diameter of PM particles falls within the ultraviolet wavelength range, which improves the model’s capability to capture finer details of PM concentrations. Importantly, when a model is built solely on light intensity data from different frequency bands, it becomes clear that variations in the fine PM size fraction can be effectively captured by high-frequency band intensities.
It is important to highlight that using only three environmental factors, namely temperature, pressure, and humidity, has been proven to be effective in modeling various PM size fractions with high performance, as evidenced by high R values, as long as the data were collected in a synchronized temporal and spatial coordinate system. This effectiveness can be attributed to the advantage of having data collected at the exact geographical location where PM observations are made. This means that all data are gathered at the same coordinates with synchronized timestamps, eliminating the need for data alignment or interpolation, which are crucial in PM modeling. Additionally, the data are captured at a high temporal resolution, allowing for a comprehensive representation of PM variations and related changes in feature variables. Importantly, the timestamps for different variables are closely synchronized, reducing the introduction of noise that often occurs during data alignment processes. This synchronization enhances the model’s capability to detect subtle nuances in PM fluctuations. However, it is crucial to recognize that such ideal circumstances are often unattainable in real-world situations. When modeling PM, which involves integrating environmental data from different sources, requiring spatial and temporal data alignment, a more extensive set of environmental factors is typically needed to achieve satisfactory model performance. This was demonstrated in the second study, where PM2.5 modeling incorporated complementary in situ and remote sensing approaches.
With the development of nationwide PM models in the second study, a diverse array of predictor variables was harnessed. This included high-temporal AOD data derived from the GOES-16 geostationary satellite, meteorological variables sourced from the ECMWF, ancillary data gathered from various external sources, location-specific solar angles, and reanalysis data related to AOD and air pollutant gases, obtained from the MERRA-2 database. The model training process was stratified into categories based on the inclusion of feature variables and the sources of ground observations of PM. As noted above, these variables originate from disparate sources, each characterized by distinct coordinate systems and temporal resolutions. To align these datasets, a linear interpolation method was applied, albeit with noticeable consequences on model performance. Interestingly, the model that incorporated all available feature parameters and utilized data from all sources of PM observation exhibited the most favorable performance, particularly in terms of R values, in the context of the nationwide PM modeling. In particular, among the most influential variables that contributed to this performance were AOD, specific humidity, dew point temperature, carbon monoxide, and carbon dioxide.
Based on the comparative analysis of models, it becomes evident that the inclusion of auxiliary and MERRA-2 data as supplementary feature variables improves the accuracy of the model, as reflected in higher R values. This augmentation helps to better discern variations in PM concentrations with respect to both temporal and spatial dimensions. Furthermore, the integration of environmental sensing data from the MINTS-AI platform, although limited to a small number of sites within the Texas region, has a positive impact on the precision of nationwide PM models. These findings underscore the potential advantages of incorporating additional ground-based observations and their associated data into PM modeling, as they contribute to improved model accuracy.
Although the increase in the R value for the national model resulting from the integration of MINTS environmental sensing data may not be substantial, due to the limited number of MINTS sites located primarily in Texas, there is a discernible enhancement in regional models with the inclusion of MINTS data. This observation suggests that PM exhibits intricate variations on a very fine spatial scale. To capture more nuanced features or to achieve highly accurate PM estimates, it is imperative to expand the network of ground sensing systems, ensuring an even distribution in a broader geographical area.
Using our analysis approach to reconstruct the fine-time resolution PM distribution across the entire United States for our study period, we found that the entire nation encountered PM levels that exceeded 9 g/m for more than 20% of the time of our analysis period, with the eastern United States and California experiencing concentrations exceeding 9 g/m for over 50% of the time, highlighting the importance of regulatory efforts to maintain annual PM concentrations below 9 g/m.