1. Introduction
With the development of industries, air pollution has become a problem of increasing concern and aroused widespread attention from the whole of society. Nitrogen oxides are the main air pollutants and are directly or indirectly related to atmospheric environment problems, such as photochemical smog, acid deposition and stratospheric ozone depletion, among others [
1,
2,
3]. NO
2 is the main component of nitrogen oxides in the atmosphere, and its monitoring and prediction can, to a greater extent, serve as a guide to the control of atmospheric nitrogen oxides and therefore help formulate policies for their emission, reduction and control. Large numbers of mathematical and machine-learning models have been developed to calculate and describe the distribution and change of atmospheric NO
2. Weather research and prediction in combination with the weather research and forecasting community multiscale air quality modeling system (WRF-CMAQ) and weather research and forecasting-chemistry (WRF-Chem) have been used extensively [
4,
5,
6]. Shin et al. (2018) [
7] made a linear regression analysis of NO
2 in Japanese metropolises using the spatiotemporal random tree model and found that it was advantageous to use this model to simulate spatiotemporal changes of NO
2. Zhan et al. (2018) [
8] established a new model known as random forest space-time Kriging (RF-STK) and used it to assess the exposure risks of NO
2 and SO
2 in some regions of China.
The most critical issue in the management of air pollution is the prediction of the concentration and distribution of the pollutants, and air pollution cannot be controlled by only analyzing the pollution that has occurred. Moolchand et al. (2021) [
9] established a modified model of extrapolating air pollutants based on historical and current meteorological datasets and calculated the results from 196 cities in India on various classifiers, finding that the accuracy of linear robust regression was 94–96%. This accuracy could be improved to some extent after using various types of clustering algorithms, showing that the optimal accuracy of the decision-tree classifier was 99.7%, and the use of the random forest classifier could raise the accuracy by 0.02%, indicating that the accuracy of machine-learning algorithms is superior to that of the linear model in predicting air pollutants. Sriram et al. (2021) [
10] predicted the air quality index (AQI) in Delhi by using the decision tree, support vector machine (SVM), naive Bayes classifier, logistics regression, random forest and K-nearest neighbor as the supervised machine-learning algorithms, finding that the decision tree method produced the best results with an overall accuracy of 99.8%. The results of the prediction models, based on big data analysis and machine learning, can help assess the current air quality and compare the assessments. In the present study, we established a NO
2 column concentration distribution prediction model based on the random forest regression mainly by using the time sequence analysis and influencing factor prediction methods with the purpose of compare their advantages and disadvantages of the two methods and their respective application settings. Wang et al. [
11] used TROPOMI and HRRR data to develop a random forest model of ozone to estimate ground-level ozone concentrations in California. This model allows the contribution of satellite data products to be assessed in a concise modelling framework, and their findings suggest that TROPOMI data improve the estimation of extremes in ground-level ozone modelling. It could also accelerate future research on the application of satellite data products and high-resolution meteorological data to predict ground-level ozone concentrations. Long et al. [
12] developed models for estimating daily ground-level NO
2 in China using four tree-based machine learning models (decision tree (DT), gradient boosted decision tree (GBDT), random forest (RF) and extra tree (ET)), and found that the estimated high-resolution results were consistent with ground-based observations of NO
2 through spatio-temporal analysis and comparison, and that of the four models, the extra-tree model with the spatio-temporal information (based on the ST-ET) model outperformed the remaining three models for the 2019 estimation. This is, in addition, to the large number of studies based on tree models, which demonstrate the generalizability of tree-based machine learning models for atmospheric pollution studies at a global scale.
Much of the past research exists in the discussion of studies of one or several different models. Rarely has there been an analysis of different ideas and approaches to one model. Moreover, in the traditional use of machine learning models, the results of a single model are mostly used as a conclusion. In contrast to previous studies, we discuss two commonly used methods for prediction and analysis based on random forest regression models (RFR). The advantages, disadvantages and applicability of both methods are investigated, while we also provide a more detailed quantitative analysis of the relationship between influencing factors and atmospheric pollutants as an extension to the random forest regression model.
In a study by Rui F et al. (2019) [
13], it was shown that machine learning takes less than one percent of the computation time of the traditional atmospheric models. Simulating hours of seven air pollutants for 4 months in 2018 using WRF-based would take more than 6 days. The same data would take less than 1 h for machine learning using a personal laptop with four cores. Considering that the random forest model has a faster computing speed and lower technical requirements than other models, such as the WRF and neural network models, it is more suitable for social communication. Therefore, we choose the random forest regression model for our research discussion.
Beijing is a world-famous ancient capital and modern international city, as well as the capital and the political, economic and cultural center of China, located in the north of China and North China Plain, adjacent to Tianjin in the east and Hebei in the west with the center at 116°20′ E and 39°56′ N (
Figure 1).
Geographically, Beijing is high in the northwest and low in the southeast; its west, north and northeast sides are surrounded by mountains, and the southeast side is a plain gently inclining to the Bohai Sea. The climate of Beijing belongs to the warm temperate semi-humid and semi-arid monsoon climate, hot and rainy in summer and cold and dry in winter.
As the capital of China, Beijing is the city that responds most promptly to policy and is also the earliest to monitor air pollutants in China. The changes in air pollutants in Beijing are representative of most major cities in China.
2. Data and Methods
2.1. Data Sources
Satellite data were obtained from the ozone monitoring instrument (OMI) aboard NASA’s Aura satellite (
https://disc.gsfc.nasa.gov/ (accessed on 15 October 2021)) [
14]. In the present study, we used the product of OMI/Aura NO
2 tropospheric column L3global grid 0.25 × 0.25 degrees V3. As this product has undergone data filtration and only preserves the cloud fraction data <30%, it is unnecessary to do additional filtration. In addition, hourly real-time monitoring data of air quality released by the National Urban Air Quality Real-time Publishing Platform of China’s Environmental Monitoring Station were used (
http://www.cnemc.cn/ (accessed on 15 October 2021)). The data used in this study were the mean daily value calculated from NO
2 data per hour.
Using the re-analysis data released by the National Centers for Environmental Prediction (NCEP)/National Cholesterol Education program/National Center for Atmospheric Research (NCAR) (
https://psl.noaa.gov/data/gridded/data.ncep.reanalysis.html (accessed on 15 October 2021)) and the lifted index selected (LI, °C) from it, tropospheric temperature (K), atmospheric pressure (Pa), precipitable water volume (PWV, kg/m
2) and relative humidity (RH%) were calculated.
2.2. Methods
In the Python Sklearn random forest regression module, the max depth determined the downward frequency of the decision trees: the deeper the max depth, the more accurate the fitting result. However, excessive max depth may result in excessive fitting. The number of trees determines the size of the random forest model: the more trees, the more accurate the result obtained [
15]. The random number determines the occurrence of events. If there is no specified random number, each calculation would produce a different result, and therefore the specified random number can help the client find better hyperparameters. The learning curve of the drawn model indicates that an excessively complex model will reduce the accuracy of the model, meaning that the excessive number of trees and excessive depth will increase the time of calculation and reduce the accuracy of the model. For this reason, accurate selection of the hyperparameter can greatly increase the accuracy and speed of the random forest model (
Figure 2).
Based on the above knowledge, three main hyperparameters are required to establish a random forest: the number of decision trees to be produced (n_estimator), the depth of the tree model (max_depth) and the random number (random_state) [
16].
In this study, we used Python GDAL, Pandas, Numpy, Scipy, Sklearn and Jupyter modules to treat data and generate images, among which the GDAL module has great power in calculating grid images. In this study, we used GDAL to read raster in raster calculation followed by matrix operation. To ensure the accuracy of the model and the occurrence of excessive fitting, we selected the hyperparameter R2 score less than 0.98 to establish the model.
The time sequence prediction model was established by selecting the NO2 column distribution for n successive year as the target value of NO2 concentration distribution of tag value n + 1 year, and training was performed on it to obtain the optimal hyperparameters. Using the trained model, we predicated the NO2 concentration of n + 2 years and obtained good prediction results.
As no grid images representing large numbers of human activity data were available, especially industrial and traffic data, and only monthly or yearly mean data were available, we only selected part of the meteorological data as influencing data in establishing the influencing factor prediction model in this study, which does not mean that these are the only influencing factors.
Prediction models using influence factors, due to the large amount of human activity data, especially industrial and traffic data, do not exist as raster images, only monthly average or annual average data, so this paper only selects some meteorological data as influence factors. This paper only discusses the scenarios of using two methods and does not analyze the NO2 column concentration in the study area in depth, so the influence factors selected are only those that can make the model established and relatively accurate.
The model R2 and RMSE shown in this paper are only for the training set, and the RMSE for the predicted data set is discussed in detail in the paper.
Figure 3 shows the flow diagram for the adjustment of the model parameters used in this paper.