**Regression Analyses of Air Pollution and Transport Based on Multiple Data Sources—A Decision Support Example for Socially Integrative City Planning**

**Mingyue Liu, Buyang Cao, Mengfan Chen, Otthein Herzog, Edna Pasher, Annemie Wyckmans and Zhiqiang Wu**

#### **1. Introduction and Related Work**

Socially integrative cities are defined as "socially mixed, cohesive, liveable and vibrant. Compactness, functional mix, and intra-urban connectivity as well as equal rights regarding the access to municipal services play an important role. Environmental quality, the quality of public spaces and the quality of life contribute to the well-being of the population. Strengthening a sense of community and fostering a sense of place as well as pre-serving cultural heritage shape the city's inand outward-bound image. Investments into neighborhood improvement, service delivery, infrastructure and the quality of housing are important supportive measures. Empowerment and participation of the population, as well as social capital, are indispensable." (Müller et al. 2019, p. 1, emphasis added).

The quality of the air in a city is one of the most important environmental qualities which is also emphasized by the air pollution measurement stations in cities all over the world. The impact of outdoor air pollution on the health of city populations is huge (Cohen et al. 2005). The air quality in a city influences the health of the people in the city in general (WHO n.d.), with living quarters being in close proximity to busy roads and/or industry conglomerates. A recent research preprint (Wu et al. 2020) demonstrates that even small increases in fine particulate matter (PM2.5) had an outsized effect in the US, and that an increase of 1 µg/m<sup>3</sup> corresponded to a 15% increase in COVID-19 deaths. This result is supported by another recent preprint (Travaglio et al. 2020) where current SARS-CoV-2 cases and deaths recorded for several sites across England were compared with public databases to both regional and subregional air pollution data. The levels of nitrogen oxide and sulphur dioxide as markers of poor air quality are associated with increased numbers of COVID-19-related deaths across England. Particulate matter could also contribute to increased infectivity—the relative contributions of individual fossil fuel sources on key air pollutant levels have also been analysed and it was found that the levels of some air pollutants are linked to COVID-19 cases and adverse outcomes.

The formation of air pollution is complex and has yet to be fully understood (Yu et al. 2014). Motor vehicle traffic emissions contribute a significant proportion of pollutants in cities globally, particularly in some developing countries. In China, the situation is serious, especially due to the high PM2.5 and PM<sup>10</sup> concentrations in the ambient air of a number of regions (Chen et al. 2017). The linkage between air quality and transportation has been evidenced by certain previous studies. Hu et al. (2017) proposed an index called the Mutual Information of Air Quality-Traffic-Meteorology Index to describe the combined effects of meteorology and traffic restrictions. Karner et al. (2010) found that different pollutant concentrations had significant different near-roadway dispersion mechanisms. Wang et al. (2019) proposed the mechanism of air pollution terrain nexus. Research has suggested the complexity of air pollution and the multiple influential factors in cities (Liu et al. 2019). Studies disclosed that the concentration of PM2.5 had a strong spatial correlation with SO<sup>2</sup> emissions, inversion temperature, GDP, and population density (Yao et al. 2019). Emission control has reduced the concentrated level of PM to some extent lately, but unfortunately unfavourable weather and climate partially counteract the emission control effects (Wang et al. 2019).

In order to be able to monitor and improve the air quality in the cities, it is important to analyse the huge amount of air quality data in order to determine spatiotemporal features and causes of pollution.

Kang et al. (2018) describe an overview on the current methods for the analysis of air quality and concentrate on reviewing Big Data analytics and machine learning approaches to determine the multidimensional factors influencing air pollution and make air quality predictions. They describe five data-driven approaches from South Africa, Western USA, Malaysia, and two from China, all of which concentrate on singular air pollution components such as smoke, NO<sup>2</sup> or PM10. All of them use statistical models. For the prediction of air pollution, discussed approaches of machine learning were used, such as Artificial Neural Networks (ANNs), a combination of an ANN and a genetic algorithm, random forest model, decision tree model, least squares support vector machine model, and spatiotemporal deep learning model for the areas of Greece, Japan, Macau, and three from China. Five of them used one air quality factor and one used two factors. Based on the comparably meagre results, the authors (Kang et al. 2018, p. 8) describe a "Need #2: Research and development of real-time air quality monitor and evaluation systems supporting air quality evaluation and analysis on multiple levels. This demand is caused by the lack of the existing research work addressing the air quality impacts on different levels due to air pollution from a special air source. This suggest[s] the demand on an integrated real-time air quality monitor and evaluation system based on sensor networks and IoT infrastructures at the different levels".

Ye and Ou (2019) used statistical methods to analyse Air Quality Index (AQI) data for determinants and spatiotemporal patterns of air quality in the Yangtze Delta region of China, a densely populated urban agglomeration with a population of

more than 220 million, in the years from 2014 to 2016. For the examined areas, they could determine that industrialization, urbanization, total energy consumption and population agglomeration were the most important factors causing air pollution.

Xu et al. (2019) examined the spatiotemporal patterns and the influence of meteorological and socio-economic factors of air pollution in north China based on the daily Air Quality Index of 96 cities from 2014 to 2016. They used statistical analysis and the exploratory spatial data analysis-geographically weighted regression ESDA-(GWR) model. Their analysis shows that on an annual scale, car ownership and industrial production are positively correlated with air pollution. The increase in wind speed, per capita gross domestic product (GDP), and forest coverage leads to reductions in pollution.

The next sections will show that the urban air quality is correlated with urban size, population, industrial infrastructures, shopping centres, and transportation facilities. The insights gained through the models and analyses provide an evidence base for decision-making to ensure a sustainable urban development with respect to air pollution. The analytical results also form the basic framework for testing, monitoring, benchmarking and assessing impacts of the digital urban transition in China, and the associated technologies may be extended to other parts of the world, even if they would be used only for an early warning of potentially dangerous air pollution.

#### **2. Data and Analysis Methods**

#### *2.1. Air Quality and Transportation Data in Tianjin*

Using Tianjin as an example, we study the correlations between air pollution and transportation (traffic) as well as other factors based on multiple data sources. The analytics will take additional factors into account as well as transportation to explore the inter-relationships of air quality, industrial entities, daily-life activities, and transportation with annual, monthly and real-time data.

Tianjin, located in the east-central coast of China, is one of four municipalities directly under the Central Government of China with a permanent population urbanization rate of 84% in 2020 over an area of 11,760 km<sup>2</sup> (TJ People 2020), with 16 districts and 240 towns and townships. Tianjin is one of the core cities in the Beijing–Tianjin–Hebei Metropolitan Region, and one of the regions bearing the most air pollution in China. Furthermore, from 2013 to 2017, Tianjin experienced a rise in transportation infrastructure development and an expansion in mechanized road cleaning as the built area of Tianjin expanded to about 145% the size of that in 2013 (National Bureau of Statistics 2019). Based on the data statistics, over 85% of all roads in Tianjin were covered by mechanized road cleaning in 2017 (National Bureau of Statistics 2019).

With respect to air quality data, concentrations of gaseous pollutants and fine particles (NO2, O3, SO2, CO, PM2.5, PM10) were obtained from the Platform for AQ Intelligent Management (Zenqi 2019). Monthly air quality data ranging from December 2013 to February 2019 and real-time air quality data of Tianjin were collected three times a day (8:00, 13:00, and 18:00) during a period of half a month. Annual transportation data from 2013 to 2017 were obtained from the China Statistical Yearbook. Figure 1 illustrates that from 2014 to 2018, the Air Quality Index (AQI) in Tianjin reveals a significant decreasing trend and the annual minimum value also tends to decline.

Furthermore, the pattern of the average monthly AQI (the values were determined according to the standards issued by the Chinese government) in Tianjin is U-shaped over the course of a year—that is, it is high in winter and falls to low values in spring and summer. In October, the AQI value starts to show an upward trend, and it reaches its peak in December of the same year; then it gradually declines from February of the next year onwards. However, the U shape becomes inconspicuous in 2017 and 2018.

The real-time air quality data were collected from 15 air quality monitoring stations in Tianjin (Table 1), whose locations are depicted in Figure 2.

**Figure 1.** Monthly Air Quality Index (AQI) in Tianjin (2014–2018). Source: Figure by authors based on data from (Zenqi 2019).


**Table 1.** Real-time air quality data samples. Source: (Air Pollution in Tianjin 2019).

**Figure 2.** The air quality monitoring stations in Tianjin. Source: Figure by authors based on (Zenqi 2019).

The website AMap provides real-time average traffic speeds on 1843 road segments in Tianjin, which is depicted on a map shown in Figure 3, where the units are km/h.

#### *2.2. Industrial POI Data in Tianjin*

From AMap, the locations of industrial Points of Interest (POIs) of construction, machinery and electronics, chemical and metallurgy, mining, and other types of manufactures in Tianjin can be obtained. According to the collected data, there are over 9000 industrial POIs including more than 4000 factories, 505 chemical and metallurgy companies, 30 mining companies, 2644 machinery and electronics companies, and 1806 construction entities. The results of a kernel density analysis for each kind of POI are shown in Figure 4 below (the cell size is 2.3 × 2.3 km); the darker the colour is, the more facilities are located there.

**Tianjin 2019/4/26 19:00**

**Figure 3.** Real-time traffic of Tianjin on 2019-04-26 at 7 p.m. Source: AMap (2019).

Machinery and electronics companies

**Figure 4.** The kernel densities of various industrial POIs of Tianjin. Source: Figure by the authors. **Figure 4.** The kernel densities of various industrial POIs of Tianjin. Source: Figure by the authors.

*2.3. Correlation Analyses* 

pollution, more data had to be incorporated into the analytical model. All analytical

Figure 5 depicts the real-time traffic data for 1843 road segments together with the

results will be discussed in more detail in Section 3. 2.3.1. Multisource Data Processing and Integration

industrial POIs in the city of Tianjin.

Correlation analyses were carried out to reveal the relationship between air

7

#### *2.3. Correlation Analyses*

Correlation analyses were carried out to reveal the relationship between air quality (AQI) and transportation (traffic). However, the correlation analyses did not present very useful information if only transportation (traffic) data were used as the factor impacting the AQI. In order to gain better insights into the causes of air pollution, more data had to be incorporated into the analytical model. All analytical results will be discussed in more detail in Section 3.

#### 2.3.1. Multisource Data Processing and Integration

Figure 5 depicts the real-time traffic data for 1843 road segments together with the industrial POIs in the city of Tianjin.

To facilitate the analysis, the urban area of Tianjin was divided into 3398 grid cells with a sizes 2.3 × 2.3 km as previously explained. Among 15 air quality monitoring stations, 12 stations were chosen that are located in the central area and the Binhai new district of Tianjin to create 12 Tyson polygons (Figure 6). These Tyson polygons cover an area of 2827.38 km<sup>2</sup> , including 537 grid cells. Within these grid cells, there are 27 mining companies, 359 construction companies, 73 machinery companies, 284 chemical companies, and 924 manufacturing factories.

**Figure 5.** Real traffic and industrial POIs in Tianjin. Source: Figure by authors.

Then, real-time traffic and AQI data were assigned to the grid cells where grid cells in the same Tyson polygon were assigned the same AQI value provided by the monitoring station in this polygon. Grid cells intersecting with multiple road segments were assigned the average traffic speed of those road segments at that moment. POIs in each grid cell were counted and also integrated into the data model.

To compare the similarities (patterns) between the air quality time series and the ones of traffic (speeds) in every grid cell, the dynamic time warping (DTW) (Keogh and Pazzani 2001) method was applied. DTW is a time series comparison technique that can essentially be employed to compare any data that are represented as one-dimensional sequences. Here, DTW was utilized to compare the similarity between traffic (speed) and AQI time series; the results will be discussed in Section 3.

**Figure 6.** Tyson polygons built upon the air monitoring stations. Source: Figure by authors.

#### 2.3.2. Other Impact Factor Considerations

The AQI in a region may be impacted by traffic conditions and the activities of the industrial entities in the neighbourhood. Motor vehicles are known for their emissions of CO, NOx and particulate matter (PM). Industrial entities are also known as the main sources of air pollution in built areas.

To study their influences on the AQI in Tianjin, a multivariate regression model was conceived to gain insight into how the traffic and industrial entities impact the AQI. All data were normalized into the [0, 1] interval. Furthermore, the data were grouped by three points of time of a day—namely, 8 a.m., 1 p.m., and 6 p.m. The data groups of different time periods were fed to the multivariate regression model to determine the impacts of traffic and industrial entities on the air quality in Tianjin.

#### *2.4. Prediction Model for AQI*

According to the average monthly AQI data, the AQI has a seasonal pattern. Therefore, the Holt–Winters method was applied to smooth the data. Seasonality was defined as the trend of time series data, which shows the behaviour of the monthly AQI repeating the U-shape by year (Exponential Smoothing 2021).

The Holt–Winters method calculates the dynamic estimates for the three components: level, trend, and seasonal components, which are based on the following formulae:

$$a\_t = \alpha [y\_t - c\_t(t - s)] + (1 - \alpha)[a\_{t-1} + b\_{t-1}] \tag{1}$$

$$b\_t = \beta[a\_t - a\_{t-1}] + (1 - \beta)b\_{t-1} \tag{2}$$

$$c\_t = \gamma [y\_t - a\_t] + (1 - \gamma) c\_{t-s} \tag{3}$$

where *a<sup>t</sup>* is the intercept, *b<sup>t</sup>* indicates the trend, and *c<sup>t</sup>* represents the seasonal factor. Then, the smoothed sequence *yt* is determined by the following formula (supposing to predict the *k*th time period from *t*):

$$
\hat{y}\_{t+k} = a\_t + b\_t k + c\_{t+k-s} \tag{4}
$$

The three damping factors in the prediction formula: α, β, γ, ranging from 0 to 1, were selected through multiple experiments, whereas *s* is the length of the season chosen to be 12 here. The prediction results are presented and discussed in detail in the section below.

#### **3. Analytics of Air Quality**

#### *3.1. Correlation Analysis*

According to the collected data and our study, the air quality of Tianjin has an improving trend every year. The air quality has improved over the last several years due to the actions taken for environmental protection. The air quality of Tianjin is negatively correlated with public transportation. That is, the more public transportation provided, the better the air quality. In addition, it is interesting to note that public transportation construction projects in Tianjin had the strongest impact on SO<sup>2</sup> and the weakest impact on NO2. The results of the associated correlation analytics are shown in Figure 8 (Figure 7 illustrates the legend used in the correlation matrix of Figure 8). In Figure 8, road\_area indicates the total road area (km<sup>2</sup> ) in Tianjin, while road\_clean\_area represents the road area (km<sup>2</sup> ) cleaned by street sweepers in Tianjin.

#### *3.2. Real-Time and Multisource Data Analyses*

As discussed above (Section 2.3.1), the DTW model can be applied to identify the similarity between the AQI and real-time traffic time series (vehicle speeds on the underlying road segments). The results of the DTW analyses are listed in Table 2. For each attribute (AQI, PM2.5, PM10, SO2, NO2, CO), the smaller the value of the DTW, the more similar it is to the traffic time series data. It is obvious that the pattern of the road speeds in Tianjin is more similar to those of PM2.5, AQI, and NO pollutants.

To analyse more precisely the impact factors on AQI in a city, a multivariate linear regression model was applied to the real-time AQI data, road speeds, and industrial POI data to obtain the proportion of traffic and industrial POI impacts on the AQI in the morning, midday and in the evening of a day.

As the results shown in Table 3, it can be observed that at 8am in the morning the biggest impact on the AQI originates from chemical enterprises and manufacturing plants, while at 1 PM, chemical enterprises, and machinery enterprises have the greatest influence on the AQI. Furthermore, at 6pm, the biggest impact on the AQI originates from traffic machinery enterprises, which is consistent with the work shifts of these types of businesses. In general, chemical and machinery enterprises, as well as manufacturing plants, have the most significant influence on the air quality, albeit the impact of chemical and machinery enterprises on air quality varies greatly over the course of a day. These outcomes provide great insight into the how the industrial structure impacts on the air quality as well as how the measure should be taken in order to improve the air quality.

**Figure 7.** Legend of correlation matrix. Source: Figure by authors.

#### **Air Quality & Transportation Correlation – Tianjin**

**Figure 8.** Correlation matrix of air quality and transportation attributes. Source: Figure by authors.




**Table 3.** Results of multivariate linear regression.

#### *3.3. AQI Predictions*

After the impact factors were analysed as mentioned above, we were able to build the model to predict the AQI. The Holt–Winters approach-based prediction model was applied and the monthly average AQI values from December 2013 to February 2019 with a total of 63 samples containing the trends and seasonality were employed. Starting from the 50th data point, the AQIs for the next 20 points of time can be predicted. According to the similarity between the predicted and the actual values, the accuracy was within 95%. The prediction also shows that the U-shaped trend is stable over the course of a year, and that the maximum and minimum values of the AQI are decreasing gradually, which also coincides with reality. The results are illustrated in Figure 9, where the x-axis denotes the months while the y-axis indicates the AQI. The accuracy measures are defined by: Mean Absolute Percentage Error (MAPE), Mean Absolute Deviation (MAD) and Mean Signed Difference (MSD).

#### **4. Cost Model for Air Pollutants**

In 2014, the Chinese Government proposed a strategy for the coordinated development of Tianjin, Beijing and Hebei. The three cities are adjacent to each other, and all of them are population-intensive cities. Therefore, through the implementation of measures such as transportation integration, ecological environmental protection and industrial upgrading and transfer, the economic development and environmental protection of the three cities can be improved.

In terms of air pollution control, considering the diffusion of air pollutants, there will be spill-over effects between adjacent areas (Keogh and Pazzani 2001). Therefore, comprehensive consideration of the three cities and the implementation of regional collaborative environmental governance can reduce the cost due to air pollutants. In order to guide cities to establish a more effective mechanism of air pollution prevention and minimize the cost of air pollution control, we tried to build a cost model to reduce air pollutants of Tianjin as a prototype and we hope the resultant cost model can assist the decision-makers to make more reasonable (both economic and efficient) decisions.

**Figure 9.** Prediction results for the AQI. Source: Figure by authors.

#### *4.1. Definition of Parameters*

According to the research of the World Bank (Johnson et al. 1997), we decided to choose three major factors to build the cost model for reducing air pollutants: the annual emissions of main pollutants (T), the total annual emission of air pollutants (E) and regional characteristics (W), which represent the regional economy, industrial structure as well as the pollution control technology levels and could be considered as a constant term.

$$\mathcal{C} = f(T, E, W) \tag{5}$$

where *C* is the cost of emission/air pollution reduction.

The table below (Table 4) shows relevant data of the three factors (to build the model for air pollution reduction) of Tianjin from 2011 to 2017, together with the associated cost *C* released by the National Bureau of Statistics of China.

#### *4.2. Logarithmic Regression Result*

With reference to the results of the World Bank policy research bureau (Poon et al. 2006), the fixed elastic function was selected to simplify this model:

$$\mathbf{C} = \boldsymbol{\varphi} \times \mathbf{T}^{\alpha} \times \mathbf{E}^{\beta} \times \mathbf{W} \tag{6}$$

where ϕ, α, β are the hyperparameters of the model. To ensure the efficiency of the calculations, we adapted logarithmic regression analysis (Formula (7)), and Table 5 below summarizes our analysis results.

$$
\ln \mathbb{C} = \ln \rho + \alpha \ln T + \beta \ln E + \ln W \tag{7}
$$

$$
\ln \theta = \ln \varphi + \ln W \tag{8}
$$

**Table 4.** Impact factors for cost model. Sources: National Bureau of Statistics (2019) and Zenqi (2019).



**Table 5.** Analytical results. Source: Data by authors.

According to the logarithmic regression analysis presented above, NO<sup>x</sup> has the most significant effect on the cost of reducing the pollutants, though PM is an important factor. By applying the analytical results, we were able to derive the cost function associated with NOx emissions in Tianjin as follows:

$$\mathbb{C} = e^{9.81} \times T^{3.21} \times E^{-3.92} \tag{9}$$

The formula states that the cost elasticity coefficient of NOx emissions in Tianjin is −3.92, which means that every 1% reduction in NO<sup>x</sup> emissions in Tianjin requires an increase of 3.92% in the cost to control air pollution. To decrease the emission of air pollutants, we may take into account that stricter emission standards and the promotion of new energy vehicles should be adopted to control the emissions of NOx on a larger scale, instead of merely purifying the air after pollution. The latter might not be so effective.

Based upon the model and the analysis, it is shown that SO<sup>2</sup> and PM pollutants from the industry are not the biggest factors affecting Tianjin's air quality, even though some actions were taken in this respect by request of the central government such as industrial upgrades in the region, the regional transfer of heavily polluting industries, and the usage of renewable energy for heating. Nitrogen oxides emitted by motor vehicles are one of the main causes of regional air pollution and it is significantly costly to perform a posterior clean-up according to our model. To reduce air pollutants more effectively, we may consider adopting stricter emission standards and promoting new energy vehicles to lower NOx.

#### **5. Conclusions**

In this case study, the methods for unearthing the inter-relationship between air quality, transportation, and industrial air pollutants were applied to conduct the analyses for annual, monthly and real-time data together with additional attributes drawn from the datasets collected for the city Tianjin. The analyses reveal that transportation (traffic) time series data are very consistent with those of PM2.5, AQI and NO<sup>2</sup> pollutants. This means that transportation has a big influence on air quality. In addition, the analyses based on real-time data plus relevant POIs of different industries reveal that the impacts of industrial entities on air quality vary significantly over the course of a day, and that they dominate the AQI. An AQI model based on the Holt–Winters method is proposed, which shows its accuracy in predictions.

To assist decision-makers in making more effective decisions, a cost model is developed that assists decision-makers to determine how to reduce air pollutants in a city more effectively.

This case study provides a framework to assist a city administration to improve its air quality via the following steps:


It is conceivable that the combination of these steps delivers a comprehensive, data-driven, and evidence-based decision support procedure for a targeted improvement of the air quality, which would be impossible without these data analyses.

**Author Contributions:** Conceptualization, Zhiqiang Wu, Edna Pasher and Annemie Wyckmans; Methodology, Buyang Cao and Otthein Herzog; Software, Mingyue Liu and Mengfan Chen; Validation, Buyang Cao; Data Curation, Buyang Cao, Mingyue Liu and Mengfan Chen.; Writing—Original Draft Preparation, Buyang Cao, Mingyue Liu, Mengfan Chen and Otthein Herzog; Writing—Review and Editing, Buyang Cao and Otthein Herzog. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research has received funding from the European Union's Horizon 2020 Research and Innovation Programme under Grant Agreement No. 770141. The material reflects only the authors' views and the European Union is not liable for any way that the information contained herein is used.

**Conflicts of Interest:** The authors declare no conflict of interest. The funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

#### **References**


Zenqi. 2019. Available online: https://m.zq12369.com (accessed on 18 June 2019).

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
