*4.3. Experimental Results and Interpretations*

For the purposes of comparisons, FFANN-BP, DTR and RFR models are trained in order to predict air pollutant concentrations in the three monitoring stations of Ningxia at a local scale. In this study, DTR, FFANN-BP and RFR were used to evaluate the ability of two-layer random forest model to estimate air pollutant concentrations. The data from 1 January 2016 to 30 June 2017 is used for model training, and the remaining is used for model prediction. It is trained on DTR, FFANN-BP and RFR, and the parameters are fine tuned according to the experimental results. The flowchart of our method is shown in Figure 6.

The initial values of the parameters are set according to the algorithmic characteristics and parameter-adjustment experience of different models, and the grid search provided by scikit learn is used for super parameter optimization. In this paper, the base model of random forest is DTR, and the alternative values of the number of DTRs are set as 10, 20, 30, 40 and 50. Other super parameters such as the maximum number of samples and the minimum number of segmented samples of the leaf nodes use the default minimum value. The final number of DTRs is 20. The stopping criterion is met if there is no improvement in the *R*<sup>2</sup> after ten iterations, in combination with a maximum number of iterations equal to 500. The optimal parameters of FFANN-BP are that the least mean square error as 0.001, max training time as 1000, and learning rate as 0.15. The size of the network and learning parameters greatly affect prediction performance. The best network structure trained is

5 input nodes and 12 hidden nodes. The output layer has only one neuron, corresponding to the air pollutant concentrations. It has been demonstrated that the BFGS algorithm is the most efficient method to solve the optimization of the object function because of its speed and robustness. Due to space constraints, this paper only shows the experimental results of Ma Lian Kou air monitoring station.

**Figure 6.** The flowchart of our method.

To verify the performances of the DTR, FFANN-BP and RFR used in this study, Table 6 shows the RMSE, *R*2, MAE and MAPE between the measured and predicted values of air pollutant concentrations of the above three models at Ma Lian Kou, Sha Po Tou and Ma Yuan air monitoring stations. The *R*<sup>2</sup> of the three machine learning models is between 0.44 and 0.99, it is shown that the values of these statistical parameters for the three models are all within the recommended range. The RMSE of each model is between 0.25 and 126.7, and the RMSE of the RFR model is the lowest. Compared with the MAE, the RFR model has the lowest MAE of 6.93, followed by the FFANN-BP model of 7.74, and the DTR model has the highest MAE of 10.6. For MAPE, the RFR model is also the lowest among the three models of 17.56. It can be found that RFR shows good experimental results. The time series plots are also shown in Figure 7 to depict the relationships between the observed and predicted data. These results indicate the important goodness of fit of the RFR to the observed data. Following the same methodology, fitting were also made for the other air pollutants as dependent variables using DTR, FFANN-BP and RFR with the results as follows. It is shown that RFR is the best model for predicting the concentration of air pollutant concentrations in the three air monitoring stations at a local scale, since the correlation coefficient of RFR equal to 0.99.

The time series plot of the ground measured air pollutant concentrations and the predictions by DTR, FFANN-BP and RFR are shown in Figures 7–9. It can be observed that there is a higher agreement between the observed and predicted data. It is also shown that the predicted concentrations of RFR are closer to the observed data than those of the DTR and FFANN-BP, meaning that the RFR improves the predicted performance of air pollutant concentrations. We also employ the histograms to provide further insight into the relationship of the predictors with air pollutant concentrations in Figures 10–12. RFR for air pollutant concentrations is very good since the histogram of RMSE is very steep and it is also considerable for the other pollutants in Figures 10–12. At the same time, according to the construction time of the models, RMSE, MAE, MAPE are analyzed to evaluate the model. The prediction accuracy and model construction efficiency of different machine learning models are compared and analyzed. Appropriate variables are selected for the prediction of air pollutant concentrations. In terms of prediction accuracy, the RFR model has the best prediction ability, followed by the FFANN-BP model, and the DTR model. RFRs have stable accuracy and good prediction capability. The results show that RFR not only increases the performance of the prediction of air pollutant concentrations in Ningxia, but also discriminates the influential factors and reduces the dimension of the data, therefore reduces the time complexity of the algorithm.

**Table 6.** The predicted performance of the DTR, FFANN-BP and RFR model for the concentrations of six air pollutants at Ma Lian Kou, Sha Po Tou and Ma Yuan in 2016.


**Figure 7.** The observed concentrations and the predicted concentrations of DTR, FFANN-BP and RFR for *CO* and *NO*<sup>2</sup> at Ma Lian Kou, where the red line and the black, green and lightcyan lines represent the observations, and the predictions of DTR, FFANN-BP and RFR, respectively.

**Figure 8.** The observed concentrations and the predicted concentrations of DTR, FFANN-BP and RFR for *O*<sup>3</sup> and *SO*<sup>2</sup> at Ma Lian Kou, where the red line and the black, green and lightcyan lines represent the observations, and the predictions of DTR, FFANN-BP and RFR, respectively.

**Figure 9.** The observed concentrations and the predicted concentrations of DTR, FFANN-BP and RFR for *PM*2.5 and *PM*<sup>10</sup> at Ma Lian Kou, where the red line and the black, green and lightcyan lines represent the observations, and the predictions of DTR, FFANN-BP and RFR, respectively.

**Figure 10.** The error histogram of the prediction for *CO* and *NO*<sup>2</sup> at Ma Lian Kou.

**Figure 11.** The error histogram of the prediction for *O*<sup>3</sup> and *SO*<sup>2</sup> at Ma Lian Kou.

**Figure 12.** The error histogram of the prediction for *PM*2.5 and *PM*<sup>10</sup> at Ma Lian Kou.

RFR uses the average reduction of node impurity to describe the importance of the variables. The greater the reduction of node impurity by a factor, the more important the factor becomes. The importance of variables in the decision tree model is measured in the form of weight. The greater the weight of a factor, the stronger the influence of the factor in affecting the concentration of air pollutants. In this research, the importance of each factor on the prediction of air pollutant concentration is further analyzed. Figures 13–15 and Table 7 show the analysis of the most important features of DTR and RFR for six air pollutant at Ma Lian Kou. The characteristic variables considered include meteorological factors, air pollutant concentrations of the previous day.

For *CO*, it can be seen that the concentrations of *NO*<sup>2</sup> rank first and contribute the most. For *NO*2, it can be seen that *PM*<sup>10</sup> concentrations rank first and contribute the most. For *O*3, it can be seen that ground surface temperature ranks first and contributes the most. For *SO*2, it can be seen that *NO*<sup>2</sup> concentration ranks first and contributes the most. For *PM*2.5, it can be seen that *PM*<sup>10</sup> concentration ranks first and contributes the most. For *PM*10, it can be seen that *PM*2.5 concentration ranks first and contributes the most. As shown in Table 5, the weight importance of temperature, relative humidity and air pressure are 14 and 25 in turn, indicating that ground surface temperature and relative humidity have the greatest impact on the concentration of air pollutants predicted by DTR, followed by air pressure and precipitation, and wind speed has the least impact. Figures 13–15 and Table 7 shows the importance analysis of various influencing factors when we use decision tree and random forest algorithm to predict the concentration of various air pollutants in 2016. As shown in Figures 13–15 and Table 7, for CO, *NO*<sup>2</sup> is the most important factors in both methods. For *NO*2, *PM*<sup>10</sup> is the most important factor. For ozone, the ground surface temperature is the most important factor. *PM*2.5 and *PM*<sup>10</sup> are the most important influencing factors for each other. *NO*<sup>2</sup> is the most important factor in the prediction of *SO*2.

**Figure 13.** The importance of predictor variables of RFR for (**a**) *CO* and (**b**) *NO*<sup>2</sup> at Ma Lian Kou.

**Figure 14.** The importance of predictor variables of RFR for (**a**) *O*<sup>3</sup> and (**b**) *SO*<sup>2</sup> at Ma Lian Kou.

**Figure 15.** The importance of predictor variables of RFR for (**a**) *PM*2.5 and (**b**) *PM*<sup>10</sup> at Ma Lian Kou.


**Table 7.** The importances of influential factor at Ma Lian Kou in 2016.

Mean GST, Mean RHU and SSD represent the average ground surface temperature, average relative humidity and sun shine duration, respectively.

Table 8 shows the running time of the three algorithms on the concentrations of six pollutants in Ma Lian Kou in 2016. It can be seen from the Table 8 that the running time of DTR is the shortest due to its simple structure, FFANN-BP model takes the longest time to build, followed by RFR model. The running time of RFR is much lower than that of FFANN-BP. This is enough to reflect that RFR has low time complexity.

Due to the randomness of the three methods, the accuracy of the three methods cannot be evaluated by one experimental result. Therefore, this paper runs 1000 Monte Carlo experiments and takes the average of the running results to evaluate the accuracy of the three methods. The results in Table 9 show that the accuracy and prediction stability of RFR are better than the other two methods.


**Table 8.** The runtime of the three algorithm at Ma Lian Kou in 2016, all in (s).

**Table 9.** The mean, variance and the confidence interval of the predicted concentrations for six air pollutants based on 1000 Monte Carlo experiments at Ma Lian Kou in 2016.


The performances achieved highlight that for the extreme concentrations of air pollutants, the performance of the DTR is not significant. The reason is that the construction project of this period is particularly high in Ningxia. However, RFR still acceptedly performs even with the sudden occurrence of such event. For the particulate matter, we find the decrease in performance of DTR and FFANN-BP, making the variance of the concentrations of the particulate matter larger. However, RFR is still more adaptable than FFANN-BP and DTR. It shows that the DTR model has poor prediction ability in using the meteorological elements to predict air pollutant concentrations, and it is recommended to use the RFR model to predict air pollutant concentrations.

#### **5. Conclusions**

In this study, Ningxia Province, where air pollution has been increasing in recent years, is selected as the research area. It is shown that the concentrations of *CO*, *PM*<sup>10</sup> and *PM*2.5 were higher in the cold and dry winter than those in summer because of the combustion of fossil fuels for heating purposes. The aim of this study was to propose a modelling procedure that would yield satisfactory results for the prediction of ambient air pollutant concentrations. In this work DT, FFANN-BP and RFR models were proposed for predicting the air pollutant concentrations in Ningxia, China. The levels of air pollutant concentrations were observed in three air monitoring stations, the capital of Ningxia and the rural areas of Ningxia.

The collected data for air pollutant concentrations and meteorological variables were used for the development of DTR, FFANN-BP and RFR. Data was prepared by calculating the average of the air pollutant concentrations for each day of the study period. Compared with DTR and FFANN-BP, it is evident that RFR is superior to the other methods. Furthermore, the proposed method has been successfully applied to the analysis of the importance of the predictors. We conducted an uncertainty analysis based on Monte Carlo experiments. The proposed method has worked well in predicting the air pollutant concentrations and can be effectively utilized for the analysis of the importance of the predictors. It reveals that there is a close relationship between air pollutant concentrations and meteorological variables. Hence, the developed model is capable of generating better forecasting performance for air pollutant concentrations. Because of the generality of the algorithm, it can be applied to other area and databases.

It can be incorporated into the control and management for a cleaner air and a better environment in many cities. Furthermore, we will consider other ways of using the spatial and meteorological conceptions. Our future research work will focus on the improvement and optimization of machine learning models. Multimodal analysis can effectively decompose the time periodic change trend and noise of air pollutant concentration. Therefore, the introduction of multimodal analysis into random forest regression model can effectively improve the prediction accuracy and the prediction of air pollutant concentration in extreme pollution weather, which will be a problem to be solved in the future.

**Author Contributions:** Funding acquisition, W.D.; Investigation, W.D.; Methodology, X.Q.; Resources, X.Q.; Supervision, X.Q.; Visualization, W.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Ningxia Natural Science Foundation under Grant no. 2021AAC03223, the National Natural Science Foundation of China under Grant no. 11761002, First-Class Disciplines Foundation of Ningxia under Grant NXYLXK2017B09, Western light project of Chinese Academy of Sciences: Application of big data analysis technology in air pollution assessment.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.
