3.1. Data and Variables
We selected Zhongshan city, a prefectural city in China, to study the nonlinear and threshold effects of bus stop proximity on transit use and commuting-related carbon emissions in the central city and suburban areas (see
Figure 1). Zhongshan is polycentric, lacking a strong city-level center. The central city houses 1,000,000 permanent residents, and the suburban area houses 3,000,000. Under China’s national official standards, Zhongshan is considered a developing city, lacking the qualifications to build rail transit. Accordingly, bus transit is the primary system of public transportation. Zhongshan also has a high vehicle volume, with 1,200,000 electric bikes, 600,000 motorcycles, and 1,000,000 private cars.
This study uses three datasets: (1) travel survey data, (2) land use data, and (3) Internet map data. The sample used in this analysis was collected from a travel survey conducted in April 2019, which assessed 45,700 individuals in 1337 traffic analysis zones at a sample rate of 1.5%. After data cleaning, 31,155 valid commute records were obtained, with 8411 in the central city and 22,744 in suburban areas. The survey records participants’ trip origin and destination, commuting mode, and demographics. According to the survey, the modal split of cars, electric bicycles, motorcycles, buses, walking, and bicycles are 25.8%, 21.8%, 38.6%, 2.1%, 2.7%, and 9.0%, respectively.
The land use data were provided by the Bureau of Land Management in Zhongshan. It consists of Zhongshan’s land use patterns and road distribution in 2019. Nine land uses could be identified with this dataset: public administration and public service, commercial service, green spaces and plazas, construction use, industrial use, residential use, transportation use, public facilities, and warehousing.
The Internet map data obtained from Baidu Map includes the street distance for each trip, the heatmap image data, and the location of each bus stop. Baidu Map is a product similar to Google Maps, offering users location-based services, including place inquiry and route search. In 2019, Baidu Map ranked as the most-used search engine in China, with the number of users reaching 400 million. Baidu Map is also officially recommended for its information services by local governments and transit operators in Zhongshan. Moreover, the data in Baidu Map is open to the public through its application programming interface (API), which has been extensively used in previous studies [
38,
44,
45].
Baidu Map’s “route search” module outputs travel information according to the input origin (O) and destination (D) of residents’ trips. For input transport modes, the information includes travel distance. We designed a program to automatically input the O and D information into Baidu Map to obtain the street distance of 31,155 trips. Compared with ArcGIS, Baidu Map uses the most up-to-date traffic information to calculate the street distance, making it more intelligent [
46] and more widely adopted in planning studies [
18,
47].
The heatmap raster data reports the real-time relative magnitude of the density scale based on the number of users. Because Baidu Map reaches more than 28% of the population, many scholars use the heat value as a proxy for population density when census data cannot be resolved to a small geographic area. After vectorizing the heatmap images and reclassifying the data into several scales, researchers usually aggregate the categorized data within a 1 km buffer around the house for each observation and obtain a scale value approximating density. Zhang et al. built a regression between the scale value and the “true” density value in the yearbook at the sub-district level and showed that the R-squared (R
2) is around 0.7, which suggests that the heatmap data fits the density well [
45]. Therefore, we used the heatmap data to generate an intensity index as a proxy for density in a similar way. The heatmap image data were extracted from the Baidu Map API at the 8 × 8 m grid level between Tuesday, 18 August, and Thursday, 20 August 2020.
The location data showed the latitude and longitude of each bus stop and was collected on 6 October 2020. The location information of points of interest (POI) is stable and reliable after several years of self-adjustment and user feedback. The location data cover 2238 bus stops, which are identical to the lists obtained from the internet. The number of bus stops recorded by the city archive in Zhongshan is 2124, which is slightly less than that in Baidu Map. Considering that the city archive might not consider the adjustment of bus routes and stops within the year, the difference is acceptable, and the locations obtained from Baidu Map can be considered reliable.
The dependent variables are transit use and carbon emissions during participants’ commutes. Because congestion relief relies on private motorized travelers shifting to transit, only transit taking, car driving, electric bike riding, and motorcycle riding are considered when accounting for the effects of bus stop proximity on transit use. All valid samples are considered when accounting for the carbon emissions from commuting.
Emissions from commuting are calculated using commuting distance
Di multiplied by emission rate
Ri. The commuting distance
Di refers to the street distance for each travel mode. This calculation was proposed in the 2006 IPCC Guidelines for National Greenhouse Gas Inventories [
48] and is commonly utilized in transportation research. The emissions factor refers to the amount of carbon dioxide emitted per traveler per kilometer. Previous studies used emission factors from diverse sources, including departments in the United States, the United Kingdom, and the European Council (EC) [
32,
33,
41,
44]. Because the study area in Ao et al.’s work share similarities to the transport modes with Zhongshan [
32], we chose to use their emissions factors in this work. The emission rate
Ri for each corresponding mode is presented in
Table 1.
In addition to proximity to a given bus stop, the independent variables include built environment features and demographics. Bus stop proximity is measured by the straight-line distance to the nearest bus stop. An entropy index is computed based on the land use data using the following equation to assess the mixture of nine land uses:
where
s is the number of land uses, and
pi is the proportion of the area in the
ith land use. Employment density is indirectly measured by the average values of an intensity index derived from heatmap data collected at 10 a.m., 11 a.m., 3 p.m., and 4 p.m. Population density is indirectly measured by the average values generated from heatmap data collected at 11 p.m. and 12 a.m. Employment density and population density are averaged within a buffer of 1 km buffer around participants’ houses. Road density is indicated by the street length per square kilometer around the house. Distance to the nearest city center is also controlled. Demographic data covers gender, income level, educational background, family size, and the number of children.
Table 2 presents the definitions and statistics of the aforementioned variables.
3.2. Modeling Approach
This study applies the GBDT model to explore the nonlinear and threshold effects of bus stop proximity. GBDT is a machine learning model that originated in computer science [
49] and has increasingly been employed in urban studies and transportation planning [
23,
50]. Compared to traditional regression methods [
51], GBDT relaxes the assumption of the predefined function and can flexibly predict dichotomous or continuous variables. Therefore, this model can better fit the nonlinear associations between variables that can otherwise go uncaptured by variable transformations. It also can output a graphical depiction to exhibit how marginal effects change. Additionally, the GBDT model has higher predictive power than other methods.
GBDT builds decision trees to explain relationships between the response (dependent variable) and predictors (independent variables).
Figure 2 shows an example of a single decision tree with a response
Y and two predictors,
X1 and
X2. All observations are first classified into two subsets based on whether
X1 is bigger than
c1. The subsets are further partitioned into two or more regions according to some rules, such as whether
X2 is smaller than
c2 or whether
X1 is bigger than
c3. We continue choosing the predictor and split-point for each classification until the sample in subsets is too small to split. In
Figure 2, the sample is finally split into five regions (
R1,
R2,
R3,
R4,
R5) using four nodes (
c1,
c2,
c3,
c4). The predicted output
Fm(
x) is modeled by the mean of
Y in each region for continuous response and is modeled by the most frequent response of
Y in the region for the discrete response. Because the classification rule does not have to be linear, decision-tree-based modeling can fit any irregular nonlinear and threshold effects.
The predictive power of any one decision tree might be limited, so an iteration process based on the gradient descent direction is used to update
Fm(
x). The updated model is shown below:
Here
Rjm refers to regions partitioned by a decision tree
Fm(
x),
I(
x∈
Rjm) equals 1 if
x falls into
Rjm and 0 otherwise.
J is the number of regions partitioned by a decision tree, and
γjm is the value of optimal gradient for the region
Rjm, which could minimize the loss function
L(
y,
Fm(
x)).
γjm and
L(
y,
Fm(
x)) is given as follows:
when the response is dichotomous Equation (3)
when the response is continuous Equation (4)
The iteration process continues until cross-validation error is minimized. To address the overfitting problem, learning rate
ξ is incorporated as weights into the iteration as follows:
With these results, the relative contribution of each independent variable is quantified as the sum of improvement in the loss function by making the split based on the variable
xi. It can be described as follows:
where
K is the number of split points, and
dj is the reduction in loss function when predictor
xi is used as the splitting variable. The relative contribution of all predictors adds up to 100%.
GBDT can produce a partial dependence plot to visualize the nonlinear relationship of
xs by computing the response value over the distribution of other predictors
xc. It gives us a direct depiction of the predicted value of dependent variables after accounting for the effects of all other variables. Using this plot, we can specify the impact threshold beyond which proximity to the bus stop ceases to boost transit usage or reduce carbon emissions. The partial dependence of
F(
x) on
xs can be formulated as follows:
In this study, a five-fold cross-validation is used to build the model. The learning rate is set to 0.1. The maximum depth of each decision tree, the minimum number of terminal nodes, and the number of additive trees are set at 5, 10, and 1000, respectively, which reflects the complexity of the tree. All the parameters are within the range suggested in previous studies [
30,
38].