**1. Introduction**

With industrialization and urbanization, air pollution in most countries is worsening over the years. Many areas including the north China and south of the Yangtze River have suffered severe and continuous haze weather. High level of air pollutant concentrations plays an important role not only in degrading the environment but also in causing respiratory diseases [1–6]. In order to enable the government to put forward reasonable measures in mitigating air pollution, it is very necessary to accurately predict the concentrations of air pollutants in real time or near real time.

Generally forecasting techniques can be divided into deterministic and stochastic approaches. The deterministic model is suitable for a wide range of trend forecasting, and the stochastic model is suitable for single site prediction. The deterministic air quality models based on numerical models mainly include Chem models, Community Multiscale Air Quality (CMAQ) [7] and Nested Air Quality Prediction Model System (NAQPMS) [8] etc. It mainly uses all kinds of meteorological data and emission source data to estimate the diffusion of air pollutants through the physical and chemical processes. It has a solid theoretical foundation and a relatively transparent model. However, the accuracy of the deterministic model is highly influenced by the boundary condition of the model and the initial conditions. Furthermore, historical data are not be used in the model. At the

**Citation:** Ding, W.; Qie, X. Prediction of Air Pollutant Concentrations via RANDOM Forest Regressor Coupled with Uncertainty Analysis—A Case Study in Ningxia. *Atmosphere* **2022**, *13*, 960. https://doi.org/10.3390/ atmos13060960

Academic Editors: Duanyang Liu, Kai Qin and Honglei Wang

Received: 24 April 2022 Accepted: 8 June 2022 Published: 14 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

same time, the computations of the model are complex and the requirement of computing resources is higher. So it is difficult to fully understand and quantify [9–11].

The stochastic methods mine the relationships between air pollutant concentrations and the influential factors, including the meteorological variables and human activities based on machine learning methods, and then predict air pollutant concentrations in the future [12–14]. Statistical methods are considered more reliable tools to predict air pollutant concentrations than deterministic approaches [15–20], including principle components analysis (PCA), kriging, inverse distance weighting [21,22], land-use regression (LUR) and artificial neural network (ANNs), etc. [23–26]. Regression methods can learn the intrinsic relationships between the influential factors and air pollutant concentrations [27]. Harishkumar [28] proposed to use geographical weighted regression (GWR) method to study the relationships between air pollutant concentration and the influential factors, and achieved good results. LUR is technically simple, easy to fit in calculation and high spatial resolution. Since its emergence in 1997, it has been applied to the predictions of air pollutant concentrations. However, the regression methods do not consider the spatial correlation in the air pollution data and overestimate the importance of covariates. At the same time, because the error does not meet the assumption of independent and identically distributed, the prediction ability of the regression method is low in the spacetime domain. The performances of ANNs are generally higher than air quality numerical models CMAQ and NAQPMS. ANNs have the advantages of less sample data, simple modeling, convenient operation, small relative error [17,20]. However, there are generally some disadvantages in ANNs, such as poor generalization ability, over fitting, easy to fall into local optimization.

Geostatistics is based on the principle that the closer the observation value in the space-time domain is more similar than the farther the observation value [29]. There is no the assumption of sample independence in Geostatistics and obeys the constraint of normal distribution to obtain a good fit to the data. However, it results in spatiotemporal heterogeneity after adding time dimension, which makes spatiotemporal data visualization and analysis quite challenging. In addition, spatiotemporal data usually contain a long time series of air pollution [18]. It is necessary to impose strong assumptions on the process [21,22].

In this paper, RFRs have been employed in this work in order to predict air pollutant concentrations. RFRs have the characteristics of adaptive training and tuning and effectively establish the relationships between the meteorological variables and air pollutant concentrations, and well suppress the overfitting problem and improves the accuracy of prediction. Besides the limitation of machine learning for single site prediction is also overcame.

The remainder of our paper is organized as follows: In the next section we present the study area and the data collected. In Section 3 the basic concepts of FFANN-BP, DTR and RFR are presented, and how the validity indices can be used to identify and compare the predicted results. The critical analysis is followed by predicting air pollutant concentrations based on data from 2016–2017. Finally, we conclude our work at the end part after discussing the results of our experiments.

### **2. Area Description**

#### *Study Area*

Ningxia is located in the inland area of northwest China, bounded between the latitudes of 35◦14 N–39◦14 N and the longitudes of 104◦17 E–109◦39 E, adjacent to Shaanxi in the east, Inner Mongolia in the west and Gansu in the north, with a total area of 66,400 square kilometers, and a permanent population of 6.8179 million. The topography of Ningxia gradually inclines from southwest to northeast. It is divided into three parts: the irrigation area of the Yellow River in the north, the arid zone in the middle and the mountain area in the south. Located within the Yellow River system, Ningxia has a temperate continental arid and semi-arid climate with a high terrain in the south and a low terrain in the north. The southern Liupan Mountains are wet and rainy with low

temperature and short frost-free period. The northern part has abundant sunshine, strong evaporation, large temperature difference between day and night, and the annual sunshine reaching 3000 h.

Ningxia which is located in the western margin of China's monsoon area is affected by southeast monsoon in summer, low precipitation, with July being the hottest month, the average temperature is 24 ◦C. In winter, it is greatly affected by northwest monsoon, with a large fluctuation in temperature, with an average temperature of −9 ◦C lowest temperature. The annual precipitation in the whole region ranges from 150 mm to 600 mm. The average annual water surface evaporation in Ningxia is 1250 mm, ranging from 800 to 1600 mm. Furthermore, the prevailing north wind lowers the humidity level [30].

In Ningxia, the extremely hot and dry climatic conditions in the area play an important role in the resuspension of fine particle, both the sand storm and the domestic fuel are the sources of air pollution. According to the Ningxia annual reports on air quality, the O3 and particulate matter (PM) are the most important air pollutants in the city [31]. There are 15 air monitoring stations of the China National Environmental Protection Agency and 12 meteorological stations in Ningxia.
