Prediction of the Tropospheric NO2 Column Concentration and Distribution Using the Time Sequence-Based versus Influencing Factor-Based Random Forest Regression Model

Geng, Tunyang; Ju, Tianzhen; Li, Bingnan; An, Bin; Su, Haohai

doi:10.3390/su15032748

Open AccessArticle

Prediction of the Tropospheric NO₂ Column Concentration and Distribution Using the Time Sequence-Based versus Influencing Factor-Based Random Forest Regression Model

by

Tunyang Geng

¹,

Tianzhen Ju

^1,*,

Bingnan Li

²

,

Bin An

³ and

Haohai Su

¹

College of Geography and Environmental Sciences, Northwest Normal University, Lanzhou 730070, China

²

Faculty of Atmospheric Remote Sensing, Shaanxi Normal University, Xi’an 710062, China

³

Meteorology of Zhangjiachuan Hui Autonomous County, Tianshui 741000, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(3), 2748; https://doi.org/10.3390/su15032748

Submission received: 26 December 2022 / Revised: 19 January 2023 / Accepted: 30 January 2023 / Published: 2 February 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The prediction of air pollutants has always been an issue of great concern to the whole of society. In recent years, the prediction and simulation of air pollutants via machine learning have been widely used. In this study, we collected meteorological data and tropospheric NO₂ column concentration data in Beijing, China, between 2012 and 2020, and compared the two methods of time sequence-based and influencing factor-based random forest regression in predicting the tropospheric NO₂ column concentration. The results showed that prediction of the tropospheric NO₂ column concentration using random forest regression was affected by the changes of human activities, especially emergency events and policy variations. The advantage of time sequence analysis lies in its ability to calculate the distribution of air pollutants with a long-time scale of prediction, but it may produce large errors in numerical value. The advantage of influencing factor prediction lies in its high precision and that it can identify the specific impact of each influencing factor on the NO₂ column concentration, but it needs more data and work quantities before it can make a prediction about the future.

Keywords:

nitrogen dioxide; air monitoring; meteorological factors; random forest regression model

1. Introduction

With the development of industries, air pollution has become a problem of increasing concern and aroused widespread attention from the whole of society. Nitrogen oxides are the main air pollutants and are directly or indirectly related to atmospheric environment problems, such as photochemical smog, acid deposition and stratospheric ozone depletion, among others [1,2,3]. NO₂ is the main component of nitrogen oxides in the atmosphere, and its monitoring and prediction can, to a greater extent, serve as a guide to the control of atmospheric nitrogen oxides and therefore help formulate policies for their emission, reduction and control. Large numbers of mathematical and machine-learning models have been developed to calculate and describe the distribution and change of atmospheric NO₂. Weather research and prediction in combination with the weather research and forecasting community multiscale air quality modeling system (WRF-CMAQ) and weather research and forecasting-chemistry (WRF-Chem) have been used extensively [4,5,6]. Shin et al. (2018) [7] made a linear regression analysis of NO₂ in Japanese metropolises using the spatiotemporal random tree model and found that it was advantageous to use this model to simulate spatiotemporal changes of NO₂. Zhan et al. (2018) [8] established a new model known as random forest space-time Kriging (RF-STK) and used it to assess the exposure risks of NO₂ and SO₂ in some regions of China.

The most critical issue in the management of air pollution is the prediction of the concentration and distribution of the pollutants, and air pollution cannot be controlled by only analyzing the pollution that has occurred. Moolchand et al. (2021) [9] established a modified model of extrapolating air pollutants based on historical and current meteorological datasets and calculated the results from 196 cities in India on various classifiers, finding that the accuracy of linear robust regression was 94–96%. This accuracy could be improved to some extent after using various types of clustering algorithms, showing that the optimal accuracy of the decision-tree classifier was 99.7%, and the use of the random forest classifier could raise the accuracy by 0.02%, indicating that the accuracy of machine-learning algorithms is superior to that of the linear model in predicting air pollutants. Sriram et al. (2021) [10] predicted the air quality index (AQI) in Delhi by using the decision tree, support vector machine (SVM), naive Bayes classifier, logistics regression, random forest and K-nearest neighbor as the supervised machine-learning algorithms, finding that the decision tree method produced the best results with an overall accuracy of 99.8%. The results of the prediction models, based on big data analysis and machine learning, can help assess the current air quality and compare the assessments. In the present study, we established a NO₂ column concentration distribution prediction model based on the random forest regression mainly by using the time sequence analysis and influencing factor prediction methods with the purpose of compare their advantages and disadvantages of the two methods and their respective application settings. Wang et al. [11] used TROPOMI and HRRR data to develop a random forest model of ozone to estimate ground-level ozone concentrations in California. This model allows the contribution of satellite data products to be assessed in a concise modelling framework, and their findings suggest that TROPOMI data improve the estimation of extremes in ground-level ozone modelling. It could also accelerate future research on the application of satellite data products and high-resolution meteorological data to predict ground-level ozone concentrations. Long et al. [12] developed models for estimating daily ground-level NO₂ in China using four tree-based machine learning models (decision tree (DT), gradient boosted decision tree (GBDT), random forest (RF) and extra tree (ET)), and found that the estimated high-resolution results were consistent with ground-based observations of NO₂ through spatio-temporal analysis and comparison, and that of the four models, the extra-tree model with the spatio-temporal information (based on the ST-ET) model outperformed the remaining three models for the 2019 estimation. This is, in addition, to the large number of studies based on tree models, which demonstrate the generalizability of tree-based machine learning models for atmospheric pollution studies at a global scale.

Much of the past research exists in the discussion of studies of one or several different models. Rarely has there been an analysis of different ideas and approaches to one model. Moreover, in the traditional use of machine learning models, the results of a single model are mostly used as a conclusion. In contrast to previous studies, we discuss two commonly used methods for prediction and analysis based on random forest regression models (RFR). The advantages, disadvantages and applicability of both methods are investigated, while we also provide a more detailed quantitative analysis of the relationship between influencing factors and atmospheric pollutants as an extension to the random forest regression model.

In a study by Rui F et al. (2019) [13], it was shown that machine learning takes less than one percent of the computation time of the traditional atmospheric models. Simulating hours of seven air pollutants for 4 months in 2018 using WRF-based would take more than 6 days. The same data would take less than 1 h for machine learning using a personal laptop with four cores. Considering that the random forest model has a faster computing speed and lower technical requirements than other models, such as the WRF and neural network models, it is more suitable for social communication. Therefore, we choose the random forest regression model for our research discussion.

Beijing is a world-famous ancient capital and modern international city, as well as the capital and the political, economic and cultural center of China, located in the north of China and North China Plain, adjacent to Tianjin in the east and Hebei in the west with the center at 116°20′ E and 39°56′ N (Figure 1).

Geographically, Beijing is high in the northwest and low in the southeast; its west, north and northeast sides are surrounded by mountains, and the southeast side is a plain gently inclining to the Bohai Sea. The climate of Beijing belongs to the warm temperate semi-humid and semi-arid monsoon climate, hot and rainy in summer and cold and dry in winter.

As the capital of China, Beijing is the city that responds most promptly to policy and is also the earliest to monitor air pollutants in China. The changes in air pollutants in Beijing are representative of most major cities in China.

2. Data and Methods

2.1. Data Sources

Satellite data were obtained from the ozone monitoring instrument (OMI) aboard NASA’s Aura satellite (https://disc.gsfc.nasa.gov/ (accessed on 15 October 2021)) [14]. In the present study, we used the product of OMI/Aura NO₂ tropospheric column L3global grid 0.25 × 0.25 degrees V3. As this product has undergone data filtration and only preserves the cloud fraction data <30%, it is unnecessary to do additional filtration. In addition, hourly real-time monitoring data of air quality released by the National Urban Air Quality Real-time Publishing Platform of China’s Environmental Monitoring Station were used (http://www.cnemc.cn/ (accessed on 15 October 2021)). The data used in this study were the mean daily value calculated from NO₂ data per hour.

Using the re-analysis data released by the National Centers for Environmental Prediction (NCEP)/National Cholesterol Education program/National Center for Atmospheric Research (NCAR) (https://psl.noaa.gov/data/gridded/data.ncep.reanalysis.html (accessed on 15 October 2021)) and the lifted index selected (LI, °C) from it, tropospheric temperature (K), atmospheric pressure (Pa), precipitable water volume (PWV, kg/m²) and relative humidity (RH%) were calculated.

2.2. Methods

In the Python Sklearn random forest regression module, the max depth determined the downward frequency of the decision trees: the deeper the max depth, the more accurate the fitting result. However, excessive max depth may result in excessive fitting. The number of trees determines the size of the random forest model: the more trees, the more accurate the result obtained [15]. The random number determines the occurrence of events. If there is no specified random number, each calculation would produce a different result, and therefore the specified random number can help the client find better hyperparameters. The learning curve of the drawn model indicates that an excessively complex model will reduce the accuracy of the model, meaning that the excessive number of trees and excessive depth will increase the time of calculation and reduce the accuracy of the model. For this reason, accurate selection of the hyperparameter can greatly increase the accuracy and speed of the random forest model (Figure 2).

Based on the above knowledge, three main hyperparameters are required to establish a random forest: the number of decision trees to be produced (n_estimator), the depth of the tree model (max_depth) and the random number (random_state) [16].

In this study, we used Python GDAL, Pandas, Numpy, Scipy, Sklearn and Jupyter modules to treat data and generate images, among which the GDAL module has great power in calculating grid images. In this study, we used GDAL to read raster in raster calculation followed by matrix operation. To ensure the accuracy of the model and the occurrence of excessive fitting, we selected the hyperparameter R² score less than 0.98 to establish the model.

The time sequence prediction model was established by selecting the NO₂ column distribution for n successive year as the target value of NO₂ concentration distribution of tag value n + 1 year, and training was performed on it to obtain the optimal hyperparameters. Using the trained model, we predicated the NO₂ concentration of n + 2 years and obtained good prediction results.

As no grid images representing large numbers of human activity data were available, especially industrial and traffic data, and only monthly or yearly mean data were available, we only selected part of the meteorological data as influencing data in establishing the influencing factor prediction model in this study, which does not mean that these are the only influencing factors.

Prediction models using influence factors, due to the large amount of human activity data, especially industrial and traffic data, do not exist as raster images, only monthly average or annual average data, so this paper only selects some meteorological data as influence factors. This paper only discusses the scenarios of using two methods and does not analyze the NO₂ column concentration in the study area in depth, so the influence factors selected are only those that can make the model established and relatively accurate.

The model R² and RMSE shown in this paper are only for the training set, and the RMSE for the predicted data set is discussed in detail in the paper.

Figure 3 shows the flow diagram for the adjustment of the model parameters used in this paper.

3. Results and Discussion

3.1. Changes of the Tropospheric NO₂ Column Concentration in Beijing from 2012 to 2020

As shown in Figure 4, the NO₂ column concentration in the target areas decreased gradually yearly from 2012 to 2020, the highest mean value being 17.49 ± 2.80 × 10¹⁵ molec/cm² in 2012 and the lowest mean value being 7.80 ± 1.66 × 10¹⁵ molec/cm². This was cumulatively and similar to the observations of Chi et al. in 2021 [17]. Compared with 2014, the NO₂ column concentration in 2013 decreased significantly, mainly because of the publication of the “Action Plan of Prevention and Control of Air Pollution” in China during 2013 and 2014 [18]; the main elements are the strengthening of the treatment of air pollutants, the limitation of air pollutant emissions, the requirement to use clean energy, the use of clean technology, the improvement of the monitoring system and the establishment of an early warning system, etc. [19].

3.2. The Time Sequence Prediction Model

As air pollutants present a typical seasonal distribution, it is necessary to establish corresponding models of calculation according to the different months. We selected March, June, September and December to establish the model and used the NO₂ column concentration data from 2012 to 2019 to predict the NO₂ column concentration in 2020. The data engineering and prediction results are presented in Model 1/Table 1 and Figure 5/Table 2, respectively.

As shown in Figure 5 and Table 2, the error of the result, obtained by Model 1, was relatively great, especially for the results obtained in March and December, in which the maximum error was 40.4% and 61.53%, respectively. Considering the outbreak of COVID-19 pandemic in 2020, human activities may be greatly limited by the pandemic outbreak. To verify this hypothesis, we established a prediction model to predict the NO₂ column concentration in 2019. The data engineering and prediction results are presented in Model 2/Table 3 and Figure 6/Table 4, respectively.

As shown in Figure 6 and Table 4, Model 2 was superior to Model 1, especially in the result error; the maximum error appeared in March 2019, being 30.65%. As shown in the distribution map, the error exceeded 20% in only a few areas. The mean error of the four months was less than 10%, and the maximum root mean square error (RMSE) of the four months was 6.71%. The prediction result of the NO₂ column concentration distribution was more accurate as compared with Model 1. These results confirmed the hypothesis that human activity changes in 2020 had a great impact on the time sequence-based prediction model. Other than emergency events, policy variations also had a huge impact on human activities and air pollutant emission.

Given the great policy variations in 2014, the data engineering and result of the 2019 NO₂ column concentration prediction model based on 2014–2018 are shown in Model 3/Table 5 and Figure 7/Table 6, respectively.

As shown in Figure 7 and Table 6, the error of Model 3 was smaller than that of Model 2. The maximum RMSE of the four months appeared in March 2019, being 5.39%. The RMSE of the root mean square error of the four months was less than 5%, except March 2019. All other results of Model 3 were superior to Model 2. Knowing that the higher the learning frequency, the better the prediction result (theoretically, the more the characteristic years, the better the prediction result in principle), the phenomenon that Model 3 was superior to Model 2 demonstrates that the time sequence-based prediction model by taking into consideration the human activities or emission policy variations is better than that without considering the human activities or emission policy variations. In addition, fewer months means faster calculation, indicating that policy variations and limitations on human activities should be considered when time-sequence prediction is performed. Although the prediction error was relatively high in some target areas when time sequence was used to predict the NO₂ column concentration, its result of NO₂ column concentration distribution is acceptable.

The accuracy has been significantly improved compared to traditional models [6,17]. A comparison of the previous studies using machine learning models found that the precision of our estimates was similar to the results of other studies, but slightly lower than that of similar studies that introduced other influencing factors [11,12].

3.3. Prediction of Influencing Factors

The model established based on the meteorological factors and NO₂ column concentration from 2014 to 2018 alone was unable to predict the NO₂ column concentration in 2019, and therefore data from the ground monitoring stations were added. The data engineering (Model 4/Table 7) and results are shown in Table 8.

As shown in Table 8, the result error was smaller than that of the time-sequence-based prediction model (Model 2/3) and the prediction result was closer to the actual value. However, the NO₂ concentration data obtained from the ground monitoring stations in 2019 were required during model establishment. As a result, it could only predict the pollution events that had occurred. If the time sequence-based prediction model was first used to predict the meteorological data followed by using the predicted data obtained to predict the pollutants, the error would be increased.

The method has predictive power and is more accurate than traditional studies [20,21]. The accuracy of the predictions is similar to previous studies using machine learning methods [16]. If the data obtained from the ground monitoring stations were used to predict tropospheric NO₂ column concentration, the result would to some extent lose its predictive meaning, because the predicted air pollutants have occurred at the time of prediction. Model 4 is more similar to an inverting model. The influencing factor-based prediction model is able to obtain the impact of each influencing factor on the NO₂ column concentration within the time interval in the target area via the importance interface and identify which influencing factor produces the greater impact on the NO₂ column concentration. The results are listed in Table 9.

Figure 8 is the partial dependence plot (PDP) of the impact of each influencing factor on the NO₂ column concentration in March 2019 by using the important parameters obtained through the importance interface. By using this PDP and multiple linear regression, we can establish the conditional function relationship specific to air pollutants.

f (x) = \{\begin{matrix} 0.41 X_{1} - 0.74 X_{2} - 0.25 X_{3} - 0.06 X_{4} + 0.46 X_{5} + 0.31 X_{6} X_{1} \leq 32.98 \\ - 1.28 X_{1} - 1.81 X_{2} + 3.72 X_{3} + 0.34 X_{4} - 1.76 X_{5} + 0.04 X_{6} 37.23 < | X_{1} \end{matrix}

(1)

The results are normalized results. As there are not enough data when

X_{1}

is located in 32.98~37.23, we were unable to establish the functional relationship.

f (x)

: tropospheric NO₂ column concentration;

X_{1}

: tropospheric temperature,

X_{2}

: LI;

X_{3}

: PWV;

X_{4}

: atmospheric pressure;

X_{5}

: RH;

X_{6}

: ground monitoring station NO₂ concentration.

The results of the calculation of the No2 column concentration for March 2019 based on the obtained functional relationship are shown in Figure 9. It can be seen that the results are very close to the measured values of OMI with a trend line slope of 0.96 R² of 0.96 RMSE of 0.34 × 10¹⁵ molec/cm². It can be proved that the obtained functional relationship can describe the relationship between the influencing factors and the NO₂ column concentration.

The above demonstrates that the result displayed by the functional relationship calculated by multiple linear regression is somewhat different from that calculated by RFR, especially in the ordination of PWV and tropospheric temperature, mainly due to the following reasons: (1) the relationship between NO₂ and the influencing factors is complex and not simply a linear relationship, and therefore the multiple linear regression model can only partially reflect good fitting; (2) the 32.98~37.23 interval is lost, but this is the interval in which the greatest change may occur; (3) the concentration range of the NO₂ concentration released by the ground monitoring stations is not clearly defined. The cause may be that classification of the concentration range needs sufficiently large data in each range to ensure the accuracy of the result obtained by the multiple linear regression model. Finer classification of concentration ranges often means less data in each range; it is usually difficult to control this conflict point because it is liable to make an a priori judgement to obtain a better functional relationship, which is unacceptable to result analysis. The relationship between various influencing factors and the NO₂ column concentration needs to be further explored in future research.

The limitations of the modeling approach discussed in this paper can be avoided by selecting more detailed and richer impact factors, e.g., Brokamp et al. (2018) [22] and Hu et al. (2017) [23] developed a daily pm2.5 prediction model for the U.S., using data mainly including AOD, meteorology and land use. Predictions based on influencing factors for pollutants, such as NO₂, SO₂, and O₃, can be made by adding local emission data, such as emission inventories, but the time scale of their prediction is short and it is difficult to achieve long-time scale prediction. We will conduct research in this area in subsequent studies.

4. Conclusions

Human activities and emission policy variations should be taken into full consideration in using the time sequence-based air pollutant RFR model. Although the result obtained by this model is not accurate enough, it can be used to predict air pollutant distributions and has the positive significance for governments or enterprises in formulating pollutant emission policies.
The influencing factor-based air pollutant RFR prediction model is more accurate than the time sequence-based air pollutant RFR model in predicting pollutant concentrations, but it is unable to predict the overall pollutant distributions. It needs a large and complex amount of work to select influencing factors and perform data processing. Regardless it can calculate the impact of each influencing factor on air pollutants. It is therefore of great significance in analyzing the specific impact of each influencing factor on air pollutants.

Author Contributions

All authors contributed to the study conception and design. Conceptualization, methodology were performed by T.J. Software, validation, formal analysis, writing—original draft were performed by T.G. and B.L. Writing—review and editing, data curation, visualization were performed by B.A. and H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (2016YFC0500907) and the Natural Science Foundation of Gansu Province (CN) (17YF1FA120) at the Key Laboratory of Resource Environment and Sustainable Development of OasisGansu Province.

Institutional Review Board Statement

We declare that we do not have human participants, human data or human issue.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data and Python code that support the findings of this study are openly accessible on request.

Conflicts of Interest

The authors have no relevant financial or non-financial interests to disclose.

References

Carmona-Cabezas, R.; Gómez-Gómez, J.; Gutiérrez de Ravé, E.; Jiménez-Hornero, F.J. Checking complex networks indicators in search of singular episodes of the photochemical smog. Chemosphere 2020, 241, 125085. [Google Scholar] [CrossRef] [PubMed]
Fan, H.; Zhao, C.; Yang, Y. A comprehensive analysis of the spatio-temporal variation of urban air pollution in China during 2014–2018. Atmos. Environ. 2020, 220, 117066. [Google Scholar] [CrossRef]
Xie, Z.; Du, Y.; Zeng, Y.; Li, Y.; Yan, M.; Jiao, S. Effects of precipitation variation on severe acid rain in southern China. J. Geogr. Sci. 2009, 19, 489–501. [Google Scholar] [CrossRef]
An, X.; Zhu, T.; Wang, Z.; Li, C.; Wang, Y. A modeling analysis of a heavy air pollution episode occurred in Beijing. Atmos. Chem. Phys. 2007, 7, 3103–3114. [Google Scholar] [CrossRef]
Xu, R.; Tie, X.; Li, G.; Zhao, S.; Cao, J.; Feng, T.; Long, X. Effect of biomass burning on black carbon (BC) in South Asia and Tibetan Plateau: The analysis of WRF-Chem modeling. Sci. Total Environ. 2018, 645, 901–912. [Google Scholar] [CrossRef] [PubMed]
Shen, Y.; Jiang, F.; Feng, S.; Zheng, Y.; Cai, Z.; Lyu, X. Impact of weather and emission changes on NO2 concentrations in China during 2014–2019. Environ. Pollut. 2021, 269, 116163. [Google Scholar] [CrossRef] [PubMed]
Araki, S.; Shima, M.; Yamamoto, K. Spatiotemporal land use random forest model for estimating metropolitan NO2 exposure in Japan. Sci. Total Environ. 2018, 634, 1269–1277. [Google Scholar] [CrossRef] [PubMed]
Zhan, Y.; Luo, Y.; Deng, X.; Zhang, K.; Zhang, M.; Grieneisen, M.L.; Di, B. Satellite-Based Estimates of Daily NO₂ Exposure in China Using Hybrid Random Forest and Spatiotemporal Kriging Model. Environ. Sci. Technol. 2018, 52, 4180–4189. [Google Scholar] [CrossRef] [PubMed]
Sharma, M.; Samyak, J.; Sidhant, M.; Hussain, S.T. Forecasting and Prediction of Air Pollutants Concentrates Using Machine Learning Techniques: The Case of India. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1022, 012123. [Google Scholar] [CrossRef]
Yarragunta, S.; Nabi, M.; Jeyanthi, P.; Revathy, S. Prediction of Air Pollutants Using Supervised Machine Learning. In Proceedings of the 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 6–8 May 2021; pp. 1633–1640. [Google Scholar] [CrossRef]
Wang, W.; Liu, X.; Bi, J.; Liu, Y. A machine learning model to estimate ground-level ozone concentrations in California using TROPOMI data and high-resolution meteorology. Environ. Int. 2022, 158, 106917. [Google Scholar] [CrossRef] [PubMed]
Long, S.; Wei, X.; Zhang, F.; Zhang, R.; Xu, J.; Wu, K.; Li, Q.; Li, W. Estimating daily ground-level NO2 concentrations over China based on TROPOMI observations and machine learning approach. Atmos. Environ. 2022, 289, 119310. [Google Scholar] [CrossRef]
Feng, R.; Zheng, H.-J.; Gao, H.; Zhang, A.-R.; Huang, C.; Zhang, J.-X.; Luo, K.; Fan, J.-R. Recurrent Neural Network and random forest for analysis and accurate forecast of atmospheric pollutants: A case study in Hangzhou, China. J. Clean. Prod. 2019, 231, 1005–1015. [Google Scholar] [CrossRef]
Nickolay, A.; Krotkov, L.N.; Lamsal, S.V.; Marchenko, E.A.; Celarier, E.J.; Bucsela, W.H.; Swartz, J.J.; the OMI Core Team. OMI/Aura NO2 Cloud-Screened Total and Tropospheric Column L3 Global Gridded 0.25 Degree × 0.25 Degree V3, NASA Goddard Space Flight Center, Goddard Earth Sciences Data and Information Services Center (GES DISC); GES DISC: Greenbelt, MD, USA, 2019. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Probst, P.; Wright, M.N.; Boulesteix, A.L. Hyperparameters and tuning strategies for random forest. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2019, 9, e1301. [Google Scholar] [CrossRef]
Chi, Y.; Fan, M.; Zhao, C.; Sun, L.; Yang, Y.; Yang, X.; Tao, J. Ground-level NO2 concentration estimation based on OMI tropospheric NO₂ and its spatiotemporal characteristics in typical regions of China. Atmos. Res. 2021, 264, 105821. [Google Scholar] [CrossRef]
Lu, Z.; Huang, L.; Liu, J.; Zhou, Y.; Chen, M.; Hu, J. Carbon dioxide mitigation co-benefit analysis of energy- related measures in the Air Pollution Prevention and Control Action Plan in the Jing-Jin-Ji region of China. Resour. Conserv. Recycl. X 2019, 1, 100006. [Google Scholar] [CrossRef]
Central People’s Government of the People’s Republic of China. Action Plan of Prevention and Control of Air Pollution, 2012-9-10. Available online: https://www.gov.cn/zhengce/content/2013-09/13/content_4561.htm (accessed on 1 November 2022).
Zhang, H.; Wang, Y.; Hu, J.; Ying, Q.; Hu, X.-M. Relationships between meteorological parameters and criteria air pollutants in three megacities in China. Environ. Res. 2015, 140, 242–254. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Kang, S.; Ji, Z.; Yin, X.; Tripathee, L. Investigating air pollutant concentrations, impact factors, and emission control strategies in western China by using a regional climate-chemistry model. Chemosphere 2020, 246, 125767. [Google Scholar] [CrossRef]
Brokamp, C.; Jandarov, R.; Hossain, M.; Ryan, P. Predicting Daily Urban Fine Particulate Matter Concentrations Using a Random Forest Model. Environ. Sci. Technol. 2018, 52, 4173–4179. [Google Scholar] [CrossRef]
Hu, X.; Belle, J.H.; Meng, X.; Wildani, A.; Waller, L.A.; Strickland, M.J.; Liu, Y. Estimating PM_2.5 Concentrations in the Conterminous United States Using the Random Forest Approach. Environ. Sci. Technol. 2017, 51, 6936–6944. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Brief description of the situation in Beijing, China.

Figure 2. Learning curve of the random forest regression model.

Figure 3. Workflow diagram.

Figure 4. Distribution of the tropospheric NO₂ column concentration in Beijing, China between 2012 and 2020 (Annual average value).

Figure 5. Error distribution between the result obtained by the NO₂ column concentration model and the actual result obtained by OMI in 2020.

Figure 6. Error distribution between the result obtained by the NO₂ column concentration model and the actual result obtained by OMI in 2019.

Figure 7. Error distribution between the result obtained by the adjusted NO₂ column concentration model and the actual result obtained by OMI in 2019.

Figure 8. Partial dependence plot between various influencing factors and the NO₂ column concentration obtained by Mode 4 using random forest regression. (A): the NO2 concentration obtained by the ground monitoring station; (B): precipitable water volume; (C): tropospheric temperature; (D): lifted index selected; (E): atmospheric pressure; (F): relative humidity.

Figure 9. Comparison of the results calculated from the functional relationship with the measured values of OMI.

Table 1. Data engineering of the 2020 NO₂ column concentration prediction model (x for month).

	Feature							Target
Training set	2012-x	2013-x	2014-x	2015-x	2016-x	2017-x	2018-x	2019-x
Predictive dataset	2013-x	2014-x	2015-x	2016-x	2017-x	2018-x	2019-x	--

Table 2. Result error of 2020 NO₂ column concentration prediction.

	2020-March	2020-June	2020-September	2020-December
Max	26.64%	40.40%	16.63%	61.53%
Min	<0.01%	<0.01%	<0.01%	<0.01%
RMSE	8.51%	9.74%	3.87%	23.98%

Table 3. Data engineering of the 2019 NO₂ column concentration prediction model (x for month).

	Feature				Target
Training set	2014-x	2015-x	2016-x	2017-x	2018-x
Predictive dataset	2015-x	2016-x	2017-x	2018-x	--

Table 4. Result error of 2020 NO₂ column concentration prediction.

	2019-March	2019-June	2019-September	2019-December
Max	24.86%	30.65%	15.38%	21.12%
Min	<0.01%	<0.01%	<0.01%	<0.01%
RMSE	5.55%	6.71%	3.05%	5.26%

Table 5. Adjusted data engineering of the 2019 NO₂ column concentration prediction model (x for month).

	Feature				Target
Training set	2014-x	2015-x	2016-x	2017-x	2018-x
Predictive dataset	2015-x	2016-x	2017-x	2018-x	--

Table 6. Results error of adjusted 2019 NO₂ column concentration prediction.

	2020-March	2020-June	2020-September	2020-December
Max	19.64%	18.87%	24.93%	17.65%
Min	<0.01%	<0.01%	<0.01%	<0.01%
RMSE	5.39%	4.69%	2.94%	4.54%

Table 7. Data engineering of the influencing factor-based NO₂ column concentration prediction model.

		Feature						Target
Training set	2014	Tropospheric Temperature	LI	PWV	RH	pressure	Ground NO₂	OMI NO₂
	2015	Tropospheric Temperature	LI	PWV	RH	pressure	Ground NO₂	OMI NO₂
	2016	Tropospheric Temperature	LI	PWV	RH	pressure	Ground NO₂	OMI NO₂
	2017	Tropospheric Temperature	LI	PWV	RH	pressure	Ground NO₂	OMI NO₂
	2018	Tropospheric Temperature	LI	PWV	RH	pressure	Ground NO₂	OMI NO₂
Predictive dataset	2019	Tropospheric Temperature	LI	PWV	RH	pressure	Ground NO₂	--

Table 8. Result of the influencing factor-based NO₂ column concentration prediction model.

		A1001	A1002	A1003	A1004	A1005	A1006	A1007	A1008	A1009	A1010	A1011	A1012
March	RFR	9.55	10.18	11.81	9.08	10.36	10.7	13.65	11.53	10.16	10.62	13.68	11.34
	OMI	8.11	9.35	12.39	9.7	10.15	11.25	13.78	11.6	9.84	11.29	13.53	11.27
	Error	15.08%	8.15%	−4.91%	−6.83%	2.03%	−5.14%	−0.95%	−0.61%	3.15%	−6.31%	1.10%	0.62%
June	RFR	8.8	8.5	7.9	7.74	10.1	9.6	8.32	8.14	10.25	9.63	8.28	8.05
	OMI	9.28	8.58	7.46	7.42	10.52	9.69	8.22	8.3	10.13	9.47	8.08	7.92
	Error	−5.45%	−0.94%	5.57%	4.13%	−4.16%	−0.94%	1.20%	−1.97%	1.17%	1.66%	2.42%	1.61%
September	RFR	8.63	7.8	7.22	6.83	8.97	8.42	9.01	7.87	8.97	8.48	8.89	7.6
	OMI	7.37	7.61	7.84	6.3	8.91	8.76	9.12	7.41	8.84	8.51	8.94	7.23
	Error	14.60%	2.44%	−8.59%	7.76%	0.67%	−4.04%	−1.22%	5.84%	1.45%	−0.35%	−0.56%	4.87%
December	RFR	15.77	15.14	16.22	14.78	18.26	19.62	20.68	16.87	17.94	19.41	20.43	16.79
	OMI	16.13	15.98	18.1	15.32	18.52	19.35	20.83	16.98	17.85	19.14	20.69	16.51
	Error	−2.28%	−5.55%	−11.59%	−3.65%	−1.42%	1.38%	−0.73%	−0.65%	0.50%	1.39%	−1.27%	1.67%

Table 9. Important parameters of the influencing factor-based NO₂ column concentration prediction model.

	Tropospheric Temperature	LI	PWV	Pressure	RH	Ground NO₂
March	0.13	0.08	0.35	0.03	0.02	0.38
June	0.05	0.27	0.07	0.14	0.33	0.14
September	0.04	0.02	0.02	0.13	0.16	0.63
December	0.03	0.07	0.03	0.49	0.2	0.28

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Geng, T.; Ju, T.; Li, B.; An, B.; Su, H. Prediction of the Tropospheric NO₂ Column Concentration and Distribution Using the Time Sequence-Based versus Influencing Factor-Based Random Forest Regression Model. Sustainability 2023, 15, 2748. https://doi.org/10.3390/su15032748

AMA Style

Geng T, Ju T, Li B, An B, Su H. Prediction of the Tropospheric NO₂ Column Concentration and Distribution Using the Time Sequence-Based versus Influencing Factor-Based Random Forest Regression Model. Sustainability. 2023; 15(3):2748. https://doi.org/10.3390/su15032748

Chicago/Turabian Style

Geng, Tunyang, Tianzhen Ju, Bingnan Li, Bin An, and Haohai Su. 2023. "Prediction of the Tropospheric NO₂ Column Concentration and Distribution Using the Time Sequence-Based versus Influencing Factor-Based Random Forest Regression Model" Sustainability 15, no. 3: 2748. https://doi.org/10.3390/su15032748

APA Style

Geng, T., Ju, T., Li, B., An, B., & Su, H. (2023). Prediction of the Tropospheric NO₂ Column Concentration and Distribution Using the Time Sequence-Based versus Influencing Factor-Based Random Forest Regression Model. Sustainability, 15(3), 2748. https://doi.org/10.3390/su15032748

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction of the Tropospheric NO₂ Column Concentration and Distribution Using the Time Sequence-Based versus Influencing Factor-Based Random Forest Regression Model

Abstract

1. Introduction