1. Introduction
Since the Industrial Revolution, many countries have begun large-scale urbanization. Over the past 40 years, China has experienced rapid economic growth and urbanization. The urbanization rate increased from 19.39% in 1980 to 58.52% in 2017. In these 40 years, the urban population has grown at a rate of 16 million people per year [
1,
2]. China’s rural population fell by 17% between 2000 and 2010 [
3]. This means that in many countries around the world, including China, cities are playing an increasingly important role in economic development [
4]. A good urban development evaluation model can evaluate the current status of urban development and guide the city’s future planning to help cities achieve healthy and sustainable development [
5,
6,
7].
The great popularity of mobile phones and the application of mobile phone data provide more possibilities for city evaluation because mobile phone data has many advantages. First, the sample size of the mobile phone data is large, secondly, the acquisition cost of the mobile phone data is low, and finally, the mobile phone data has a wide and complete spatial coverage. Therefore, the mobile phone data can reflect the characteristics of residents in a certain area or the connections between residents in some areas, and can well reflect the projection of residents’ activities in space [
8]. Therefore, a large number of scholars use mobile phone data to evaluate the development of the city from different angles. For example, Tang and Lin, based on Shenzhen’s mobile phone data and point-of-interest (POI) data, used geographic and time-weighted regression methods to connect residents’ travel information with different types of infrastructure and evaluate the vitality of the city [
9]. Liu et al. used Nanjing’s mobile phone data and POI data to extract the vitality of the city through hierarchical clustering [
10]. Montero combined Catalan mobile phone data with survey data through principal component analysis to analyze travel characteristics in Catalonia [
11]. Zhang et al. used the K-means clustering method to analyze the relationship between Beijing’s travel characteristics and land use, using Beijing’s mobile phone data and POI data. Zhang classified Beijing’s land into four different categories according to travel characteristics [
12].
Existing studies use mobile phone data that is often converted into resident travel data [
13,
14,
15]. However, travel characteristics cannot directly show the development status of the city. In this article, mobile phone data is further processed into the residential and working places of citizens. The output value of the service industry in Beijing accounted for more than 79.65% of the total annual output value in 2015 [
16]. Because jobs are mainly concentrated in the tertiary industry, the contributions of different jobs to urban development and the economy are similar. The working population density of Beijing can reflect the economic status and development status of a certain area in the city to a certain extent. Therefore, this article attempts to reflect the economic vitality of this area through the working population density estimated in different areas. The working population can promote urban development, create fiscal revenue, and maintain social stability. For the sustainable development of cities, it is very important to maintain a reasonable and stable working population [
17,
18,
19]. The growth of the working population is also a necessary condition for rapid urban development. The analysis of the working population by using comprehensive land use indicators can not only analyze the factors restricting the development of the working population from a new perspective, but also evaluate the current status of urban development and, therefore, assist people to rationally plan and manage the city [
20].
This study uses Beijing as the research area. Through mobile phone data, the density of working and living populations in different areas of Beijing was identified. A model of the working population and land use characteristics were established using the method of geographically weighted regression. The geographically weighted regression model takes into account spatial heterogeneity, so it can reflect the difference in the correlation of indicators between different regions. This model can analyze the development status of Beijing and predict the impact of infrastructure changes on the working population. Therefore, this study has the following characteristics: (1) estimating the population density is taken as the research objective, and the current situation of urban development is analyzed more intuitively; (2) considering the spatial heterogeneity of the impact of facilities on the working population, it is possible to quantitatively analyze the impact of infrastructure in different regions on the working population.
2. Study Area and Datasets
This article takes Beijing as the research objective. Beijing has 16 urban districts under its jurisdiction, with a total area of 16,410.54 square kilometers, of which the built-up area is 1485 square kilometers (
Figure 1a). According to data from the National Bureau of Statistics of China in 2015, Beijing’s permanent population at the end of the year was 21.71 million, of which the urban population was 18.77 million and the rural population was 2.93 million [
16]. Beijing consists of a main urban area with the five ring roads as the boundary and several satellite cities. There are more jobs in the central part of the city. On the other hand, most citizens live in satellite cities of Beijing. The main urban area and satellite cities are connected by rail transit and urban expressways.
The citizens of Beijing have a high penetration rate of mobile phones. In the second quarter of 2015, there were 193.7 mobile phones per 100 residents in Beijing [
21]. Through the mobile phone data, the life trajectory of Beijing residents can be recorded in a straightforward way. This article uses Beijing mobile phone data from 1 June to 7 June 2015. The data was provided by China Mobile. The company is Beijing’s largest telecommunications provider, and its users account for 67% of all mobile phone users. Each record contains an anonymous user ID, location information and timestamp, etc.
The traffic analysis zone (TAZ) used in this paper is a division of the TAZs by the Beijing Urban Transport Institute for Beijing transportation planning purposes (
Figure 1b). According to different land use characteristics, Beijing is divided into 2006 transportation districts with an average area of 8.16 square kilometers. In this study, 18 million pieces of mobile phone data from anonymous users between 1 June and 5 June 2015, were used to obtain the living and working population of different TAZs.
According to the Athens Charter, the main functions of the city include residence, work, entertainment, and travel [
22]. In this study, 12 sets of POI data including these 4 kinds of urban functions were selected from all available POI data sets to comprehensively describe the functions of the cities. The POI data sets were collected by the Google Place API [
23,
24]. In this study, POI is linked to TAZs and processed as density data to objectively show the current status of infrastructures in different TAZs. The table shows the POI data selected in this article (
Table 1).
Different types of POI have different distribution characteristics in Beijing (
Figure 2). For example, restaurants are most densely distributed in Beijing (
Figure 2d). Office buildings are clustered in Beijing. Marked on the map are Zhongguancun Innovation Park in northwestern Beijing, Financial Street in the middle, and CBD in the east (
Figure 2f). The distribution of the parks in Beijing is relatively sparse (
Figure 2j). Bus stations are denser than subway stations (
Figure 2k,l). The subway is the core of Beijing’s public transportation system, transporting large numbers of commuters every day.
3. Methodology
This study adopts the framework of a geographically weighted regression model, explores the correlation between land use characteristics and working population, and constructs a working population estimation model (
Figure 3).
First of all, this study identifies the living and working population of Beijing through mobile phone data. To prevent the inaccurate regression results caused by the magnitude difference, this paper normalizes the 12 sets of POI data and residential population density data. Next, we use the working population density as the dependent variable for geographically weighted regression. To verify the reliability of the regression analysis, this paper also performed OLS regression on this set of data to compare the goodness of fit of the two schemes. Finally, this study describes the regression results through land use difference analysis and spatial heterogeneity analysis.
3.1. Work/Residence Location Identification
In this study, the users’ mobile phone data during time spent at their residence and work hours are used to determine their work location and the users’ residential area. For this purpose, the fuzzy discrimination method is employed. The result of this method does not include users without a fixed workplace or users who work at night.
First of all, this paper preprocesses the original mobile phone data to remove abnormal items, then according to the distribution of work and residence time, combines it with the characteristics of mobile phone data and dwell time, feature index extraction of user work and residence data characteristics, and construction of membership function. On this basis, a standard feature vector is constructed, and a discrimination rule between the feature vector and the measured data is formed, and finally, the user’s employment and residence information are obtained (
Figure 4) [
25]. Since most residents of Beijing work from 9:00 am to 5:00 pm, the working hours selected in this article are 9:00 to 11:00 am and 2:00 to 4:00 pm, and the residence time is 0:00 to 5:00 am.
3.2. Geographically Weighted Regression
Geographically weighted regression (GWR) is a spatial analysis technique [
26]. By establishing a local regression equation at each point in the spatial range, GWR explores the spatial changes and related driving factors of the research object at a certain scale and can be used to predict future results. Because it takes into account the local effects of spatial objects, its advantage is high accuracy. According to the first law of geography proposed by Tobler, everything is spatially related, and the closer the distance between the things, the greater the spatial correlation [
27,
28]. Therefore, unlike traditional cross-section data, the spatial correlation of spatial data will lead to the spatial heterogeneity of the regression relationship. To explore the spatial heterogeneity of spatial data, Brunsdon et al. first proposed a geographically weighted regression model in 1996, set as follows [
29,
30,
31]:
where
α is the baseline working population in the average area,
β is the association between population and working/daytime population, and
γj is the effect of POI density
j = 1,2,…,12 on working population.
The distance function selected in this paper is a Gaussian function because the dependent variable in the model is a continuous value with a large working population density. The Gaussian function is defined as follows:
where
dij represents the Euclidean distance between observations
i and
j, and
b represents the bandwidth.
Another important choice in specifying a geographically weighted regression model is the choice of fixed or variable bandwidth. There are two common bandwidth selection schemes: constant bandwidth and variable bandwidth. This paper chooses a variable bandwidth scheme. This is because if the variable bandwidth scheme is selected, the bandwidth for sparsely distributed data will be greater, while the bandwidth for densely distributed data will be reduced. The distribution of POI in the sample is uneven; this flexibility can lead to a more accurate model.
The corrected Akaike information criterion (AICc) method provides a trade-off between the degree of freedom and goodness of fit to optimize the bandwidth [
26,
32]. Fotheringham defined the AICc equation for GWR as:
where
n represents the local sample size (according to bandwidth);
represents the estimated standard deviation of the error term, and tr(
S) represents the trace of the hat matrix
S. The hat matrix denotes the projection matrix from the observed
y to the fitted values.
The smaller the bandwidth, the smaller the bias of the estimated regression function, but the larger the variance. Therefore, overfitting occurs. But for AICc in many cases, it can better overcome the overfitting phenomenon; that is, it tends to determine a more reasonable window width [
33,
34].
4. Results and Discussion
4.1. Reliability Analysis
In order to verify the reliability of the GWR model, the same data in this study were subjected to global model (OLS) regression and geographically weighted regression, respectively. In this study, the working population was used as the dependent variable, the normalized population density and the 12 groups of POI density were used as independent variables for regression analysis. By comparing the parameters such as the goodness of fit of the OLS model and the GWR model, a better regression model is obtained. The results of geographically weighted regression and global regression are as follows (
Table 2 and
Table 3):
It can be seen from the above table that the R square and adjusted R square in the geographically weighted model are higher than those in the OLS model, so the geographically weighted regression model has a higher degree of goodness of fit.
Residual squares in the summary table of geographic regression models refers to the sum of squared residuals in the model (residuals are the difference between actual dependent variable value and the estimated value of the dependent variable returned by the GWR model). The smaller this measurement, the more the GWR model fits the observed data. Effective number of parameters, which is larger when bandwidths are smaller, is related to the choice of bandwidth. It is a compromise representation of the variance of the fitted values and the deviation of the coefficient estimates. AICc is a measure of model performance and helps to compare different regression models. Considering model complexity, models with lower AICc values will better fit the observed data. AICc is not an absolute measure of fitness but useful for comparing models that are applicable to the same dependent variable and have different explanatory variables. These 4 parameters are used to compare different methods of geographic regression models. After comparison, this paper chooses a variable bandwidth selection scheme.
This paper analyzes the coefficient distribution of the regression results of the least squares model. It can be seen from the bar chart (
Figure 5) that the residential population has the most significant impact on the working population, followed by office building density. However, indicators with residence attribute: school density, hospital density, pharmacy density, and restaurant density have little negative impact on the number of the working population. This is similar to the results of geographic regression analysis. Traditional global regression cannot analyze the differences of coefficients in different regions. Geographically weighted regression can better show and analyze the impact of infrastructure on the working population in different regions.
4.2. Land Use Difference Analysis
The box plot (
Figure 6) of the coefficient distribution of the geographically weighted regression results is shown in the figure below. The coefficient distribution chart shows the factors affecting the working population of different land use characteristics. The higher the index coefficient, the more significant the impact on the working population density. If the coefficient is negative, the indicator has a negative correlation with the working population density.
The impact of the resident population density on the working population density is most significant, followed by four indicators with job attributes: mall density, office building density, government agency density, and bank density. The office building density has the highest coefficient and the highest degree of dispersion. This proves that office building density can significantly affect the working population density, and some district office buildings will have a higher degree of influence. The average coefficients of bus station density and subway station density, which are traffic indicators, are slightly higher than 0, but the density of subway stations in some areas is significantly positively related to the working population density. Indicators with entertainment attributes, hotel density and park density, have little effect on the working population. Among the indicators with residential attributes, school density, hospital density, and restaurant density have no significant effect on the working population density, but the pharmaceutical density has a negative correlation with the working population density. This may be due to the little employment created by pharmacies, and most of them are built near residential areas and far away from work areas.
4.3. Spatial Heterogeneity Analysis
One of the characteristics of geographically weighted regression is that it highlights the spatial heterogeneity of the effect of the independent variable on the dependent variable. This article analyzes the correlation of different indicators on the working population density in different regions by analyzing the coefficient differences in different regions. Then, combined with the real situation in Beijing, the reasons for different regional coefficients are analyzed.
From the distribution map of the population density coefficient (
Figure 7), the overall urban area coefficient is higher than that of the suburbs, and the two northern areas have the highest coefficients. The two areas are the Haidian District and the Chaoyang District of Beijing. They are the most concentrated urban areas in Beijing with the highest economic attributes. They have the perfect infrastructure and can attract a large number of people in other urban areas to work there. They have higher coefficients because of the denser working population and relatively lower residential population distribution in these two urban areas.
From the perspective of the spatial distribution of the residential index coefficients (
Figure 8), the overall main urban area coefficient is lower than the suburban area, and the school density coefficient distribution is the most obvious. The greater relevance of the suburbs may be due to more backward infrastructure. However, the higher density of residential infrastructure in the main urban area means that the area is mainly responsible for residential functions rather than generating more employment. These indicators of the main urban areas and some satellite cities are negatively correlated with the working population density.
The spatial distribution of the coefficients with commercial indicators varies widely, depending on the infrastructure they represent (
Figure 9). The area with the highest density coefficient in the mall is Haidian District, located northwest of the main urban area, where Beijing has the most technology companies and universities. The active economy and high population have attracted a large demand for shopping. The correlation between office building and working population density is second only to residential population density. In most areas of Beijing, the coefficient of office building density is relatively high, and the coefficient of Tongzhou District, which is known as the sub-center of Beijing City, is the highest. Tongzhou District, located on the east side of Beijing’s main urban area, is the area where Beijing has had the greatest potential for development in recent years, carrying many functions of Beijing’s main urban area. Government agencies have less influence on the working population density. The regions with relatively higher coefficients are located on the outskirts of the north and south sides of Beijing. The bank density has the highest coefficient in Chaoyang District on the east side of Beijing. Beijing’s Central Business District is located in Chaoyang District, which has the most business activities in Beijing.
The impact of recreational indicators on the working population is not significant, but it also has a certain spatial heterogeneity. In terms of spatial distribution, the suburbs of Beijing have a higher coefficient of entertainment indicators (
Figure 10).
Because Beijing’s main urban areas and suburbs have large differences in infrastructure and population density, this article mainly analyzes the part of the traffic-type indicators in Beijing’s main urban areas. The east and south sides of Beijing’s main urban area have higher bus station density coefficients. Here are Chaoyang District and Fengtai District. The distribution map of bus stations can be seen in other areas of the city, and areas with higher coefficients have fewer bus stations (
Figure 11). Compared with other urban areas, the demand for public transportation in these two urban areas is similar, and even the public transportation demand in Chaoyang District will be higher. According to the data of the smart card named Yikatong, provided by the Transportation Operations Coordination Center, in 24 hours, a total of 1.67 million passengers took the bus in Chaoyang District, while only 1.34 million passengers took the bus in Haidian District. However the bus stations are more sparse in Chaoyang District. Therefore, the reasons for the significant positive correlation between bus station density and working population density in these areas may be due to strong demand and relatively few bus stations.
Similar laws are also reflected in the correlation characteristics of subway station density and working population density. From the perspective of the distribution of subway coefficients, the regional coefficient in the northwest of the city is higher (
Figure 12). There are some technology companies here, but the subway has not been opened in 2015. Now Beijing Metro Line 16 has been opened here. In areas with strong demand but relatively scarce supply, the coefficient is often higher.
5. Conclusions
With the maturity of mobile communication means, we can obtain more information through mobile phone signaling data. In this paper, the densities of working and residential population in Beijing were obtained through mobile phone signaling data. In order to fully describe the working population, we selected 12 groups of POIs according to the description of urban functions in the Athens Charter. Through regression analysis of the working population, the impact and correlation of different infrastructure density and other indicators on the working population in different regions were obtained. This paper proposes a regression model with working population density as the dependent variable, which not only explains the impact of different indicators on the working population, but also provides an estimation model of the working population. This article examines the impact of different infrastructures on working population density. Additionally, because we chose the geographic regression model, which takes into account the first law of geography, this paper studies the spatial heterogeneity of different indicators. By studying how these facilities affect the working population density in different regions, we can explore the relationship between each indicator and the working population with greater accuracy. Urban planners and managers can use this approach to explore the impact of infrastructure adjustments on the working population. Finally, it can promote the scientific development of the city.
The achievements and innovations in this paper include the following.
The working population density obtained using mobile phone data extraction is the object of model research. Compared with the travel intensity data used in other studies, the working population distribution has a more direct role in guiding urban development.
The relationship between working population and land use was analyzed based on the geographic regression model. The geographic regression model takes into account the characteristics of spatial heterogeneity and the impact of various infrastructures with space attenuation, which is more accurate than traditional regression models, and quantitatively, the impact of each index on the working population is expressed by region.
This study has several limitations that deserve further exploration: due to data constraints, the types of POI extracted are insufficient, many other infrastructures will also affect the distribution of the working population, and there is space for improvement in model accuracy. Due to the limitations of the extraction method of the working population, the model accuracy is limited by TAZs. This results in lower accuracy when studying spatial heterogeneity, and the differences between regions are not obvious. In addition, the working population identification method is to divide the user’s working location by working hours, which is not accurate. The mobile phone data selected in this article is not a full sample or random sample, and the description of the population may not be accurate. If there is more data, such as data at different times, the model will become more accurate and interesting.