1. Introduction
Smart transportation is an important component of smart city construction. As a safe, punctual, and convenient urban transportation tool, the metro is the backbone of passenger transportation in many cities [
1]. Big data can be used to analyze the micro mechanisms of residents’ metro travel behavior and spatial interaction, which play an important role in promoting the construction of smart cities. With the advancement of information technology, real-time metro smartcard data enables us to track metro passengers’ travel patterns [
2,
3]. Previous studies have found that metro ridership exhibits the characteristics of temporal and spatial change regularities, which are the result of residents’ time-varying travel demands and the differences in the surrounding areas of different metro stations [
4,
5]. Therefore, classifying inquiry metro ridership changes at different periods is not only helpful in understanding the spatiotemporal characteristics of residents’ metro travel but also in exploring the relationship between residents’ metro travel and land use in metro stations’ catchment area. This can further help planners and decision makers make planning decisions for optimizing land use in metro stations’ catchment areas in the construction of smart cities.
Different metro stations have differentiated metro ridership characteristics, which is due to the dual attribute features of metro stations [
2,
6]. On the one hand, metro stations are nodes in the metro network, allowing passengers to travel from one place to another [
7]. On the other hand, metro stations are also places in the city. Under the influence of the TOD (transit-oriented development) strategy, the areas around metro stations usually adopt high-density mixed-development patterns and become key areas for human activity aggregation [
8,
9,
10]. However, due to the differences in land use development around metro stations, metro ridership is significantly different across different metro stations [
6,
11,
12]. For example, due to the commuting characteristics of the city’s work schedule, from 9 am to 5 pm, metro stations near residential areas may have larger inflow ridership during the morning peak and larger outflow ridership during the evening peak. Conversely, the trend in employment centers will be reversed. However, most of the existing studies have roughly analyzed ridership uniformly across all stations, and less attention has been paid to scaling the differences in metro ridership at different types of stations.
In addition, the built environment has long been considered an important factor affecting metro ridership [
13,
14,
15]. Most existing studies have used the “5D” research framework to quantify the built environment [
16,
17,
18,
19], which measures the built environment according to five dimensions: density, diversity, design, transportation distance, and destination accessibility. While previous research has confirmed a high degree of correlation between the built environment and metro ridership, less attention has been paid to the possibility that the impact of the built environment on metro ridership may vary over time [
1,
20]. Furthermore, most previous studies on the relationship between the built environment and metro ridership have typically assumed a linear or generalized linear relationship, failing to reveal the nonlinear relationship between the two [
21,
22]. In summary, the nonlinear effects of the built environment on metro ridership at different periods have been rarely revealed.
To address the above issues, this study collected metro smartcard data in Wuhan, China. Firstly, using EM clustering analysis, metro stations were classified into different types. Then, the ridership was extracted for each metro station during four time periods, including morning peak, noon, evening peak, and night, and the GBDT model was applied using machine learning to explore the relative importance and nonlinear relationship between the built environment and ridership at different types of stations.
Therefore, this study contributes to the existing literature in both theory and practice. Firstly, it enriches the sparse existing literature on the relationship between the built environment and metro ridership by finely measuring the impact of the built environment on metro ridership at different types of stations and during different time periods. In addition, by exploring the relative importance and threshold effects of the built environment on metro ridership using the GBDT model of machine learning, it provides a reference for optimizing the built environment around metro stations in smart cities and formulating relevant policies.
The rest of this paper is arranged as follows. In
Section 2, we review the literature related to the built environment and metro ridership. In
Section 3, we introduce the study area and data source and present the metro station classification method and the machine learning model used in this study. In
Section 4, we report the main findings of the study and conduct relevant discussions. In
Section 5, we present the conclusions and policy implications of this study and point out future research directions.
2. Literature Review
Data used in traditional studies on residents’ travel behavior often rely on resident trip surveys [
23,
24,
25], which have the advantage of capturing residents’ social attributes and can reflect their detailed travel characteristics. However, these data also have disadvantages, such as high survey cost, large time consumption, limited sample size, and most importantly, difficulty in acquiring real-time updates. With the significant development of real-time data collection technology through smartcard systems, smartcard data are widely used in travel behavior research due to their large sample size, high accuracy, and detailed spatiotemporal information [
3,
26,
27,
28]. Classifying real-time ridership in detail using smartcard data (SCD) can be helpful for understanding the relationship between the built environment and metro ridership at different types of stations.
Cluster analysis is an unsupervised classification method used to extract the most meaningful content [
29]. As far as the functional classification of metro station is concerned, different classification results may be obtained from different perspectives. Some scholars classify metro stations’ functions from the perspective of the land use in the catchment area of a metro station. For instance, [
30] classified New York metro stations into five categories, including commercial, highly mixed use, moderately mixed use, residential, and transfer residential, based on the intensity of commercial land use in the catchment areas of metro stations. Furthermore, some scholars classify metro station types in terms of the travel patterns of metro station passengers. For example, [
29] classified Shanghai metro stations into six categories: employment stations, residential stations, mixed stations, mixed residential, mixed employment, and transportation hubs, based on the SCD of five consecutive weekdays. However, using land use to classify station types may not reflect the true travel patterns of metro ridership, since most cities have the characteristic of “city first, station later”, which causes some stations to have poor TOD guidance. Therefore, this study used SCD, which reflects the real travel patterns of metro passengers, to classify station types.
The built environment has long been proven to be an important factor affecting metro ridership, and the “5Ds” framework is often used to measure the built environment [
13,
14,
16]. Density is an important indicator that affects metro ridership and is usually measured by resident population and the plot ratio, as high population concentration and spatial density may directly translate into metro ridership [
31,
32,
33]. Diversity is mainly manifested in a mixture of land uses, which is more conducive to enhancing the attractiveness of the region and therefore promoting the demand for metro travel [
4,
22]. The number of street intersections is a commonly used indicator for measuring micro-level design, as more intersections indicate stronger road network connectivity, which enhances metro station accessibility and consequently promotes metro ridership growth [
12,
34,
35]. However, some studies have found that the more street intersections, the longer the waiting time at traffic lights, which negatively affects metro ridership [
11,
36]. Travel distance is an indicator used to measure the convenience of metro stations and is usually represented by the number of bus stops in the catchment area of a metro station. Previous studies have found that the more bus stops, the more conducive an area is to bus–metro transfers, which further promotes metro ridership [
36,
37]. However, it has also been found that buses may divert metro ridership and in turn reduce metro usage [
4]. Distance from the city center is usually used as an indicator of regional accessibility, and most studies show that stations closer to the city center have higher ridership due to the city center’s core role in employment and commerce [
38,
39]. In addition, the higher the number of daily travel destinations such as enterprises, shopping facilities, and living service facilities around metro stations, the more helpful it is for residents to choose metro travel [
36,
38,
40,
41]. Moreover, metro station characteristics also affect metro ridership. Previous studies have found that transfer stations, terminal stations, higher exit quantities, and higher betweenness centrality have a significant positive impact on metro ridership [
11,
20,
36,
42]. However, most existing studies consider all metro stations uniformly, with less subdivision of the relationship between different types of stations and the built environment in different catchment areas, during different periods, which causes the functional connection and temporal heterogeneity between the travel characteristics of different stations and land use to be largely ignored.
In addition, in previous studies on the relationship between the built environment and metro ridership, it is usually assumed that there is a linear or generalized linear relationship between the two, and linear regression models, Poisson regression models, or negative binomial regression models are commonly used to explore this relationship [
1,
21]. Although these studies have laid an effective foundation for understanding the relationship between the two, they cannot capture the nonlinear effects between them. Some recent studies have used supervised machine learning techniques to explore the relationship between the two and found that the impact of the built environment on metro ridership generally has complex nonlinear correlations [
40,
43]. For example, [
22] used the GBDT model and found that intermediary centrality only has a positive promotion effect on metro ridership between 0 and 0.2; when the intermediary centrality further increases, it no longer has a positive promotion effect on metro ridership. Furthermore, [
1] used the random forest model to reveal the impact of the built environment on metro ridership during morning peak, noon, and evening peak periods and found that there is a time heterogeneity between metro ridership and the built environment. However, the authors only examined ridership at all metro stations, so suggested that future research could focus on the correlation between the ridership of different types of stations and the built environment.
Existing studies have identified a number of research gaps in this field. Firstly, many previous studies have frequently treated all metro ridership as the dependent variable, neglecting the variations in travel behaviors contingent upon distinct metro station features. Particularly, there exist differences in ridership based on the functions of stations with different attributes and land use. Secondly, previous research has confirmed the significant nonlinear correlation between metro ridership and the built environment. However, due to the highly structured spatiotemporal regularities of residents’ travel activities and the marked temporal heterogeneity in their travel purposes, the temporal heterogeneity of the potential nonlinear relationship between the built environment and metro ridership has not been thoroughly discussed.
To address these gaps, our study undertook several key contributions. Firstly, leveraging a vast dataset of smartcard data, we effectively classified different types of metro stations through EM clustering, thereby revealing spatial disparities in travel characteristics among these distinct station types. Secondly, we extracted metro ridership from metro stations during four time periods: morning peak, noon, evening peak, and night. By employing the GBDT model, we investigated the relative importance and nonlinear effects of the built environment on metro ridership during these different time periods. This approach enables us to effectively identify the temporal heterogeneity in the nonlinear correlation between the built environment and metro ridership.
3. Research Design
3.1. Research Area
Wuhan, the largest city in central China, was the study area for this paper.
Figure 1 shows the urban spatial structure of Wuhan, which is divided into the urban center within the Third Ring Road, and the Metropolitan Development Area (WMD) outside the Third Ring Road, where Wuhan has expanded in recent years. Due to the natural barrier of rivers and lakes, Wuhan has formed the three clusters of Hankou, Hanyang, and Wuchang, making it a typical polycentric city. In addition, the natural barriers have greatly restricted the organization of ground transportation in Wuhan, making metro travel popular among citizens. From 2010 to March 2021, the number of metro stations in Wuhan increased from 16 to 210 (transfer stations are not counted repeatedly), metro operating mileage increased from 28 km to 360 km, the share of the metro in public transportation also increased from 2% to 51%, and daily ridership has reached 3.1 million trips. Based on previous studies [
6,
16,
44,
45], this paper defines an 800 m buffer zone around the metro station as the station’s influence range, and the intersecting parts are processed using the Payson polygon technique.
3.2. Data and Variables
The data used in this paper include the smartcard data of 211 metro stations in Wuhan for five consecutive working days in March 2021, Wuhan point of interest (POI) data in 2021, building contour vector data, resident population data, and 2017 land use data in Wuhan. The smartcard data records the cardholder’s card number, entry and exit stations, swipe time, etc. Based on the card number and travel time information, we constructed travel OD chains from the origin to the destination of residents’ trips. After deleting some invalid data, a total of 9,392,605 travel OD chains were constructed, with a data validity rate of over 99%. Subsequently, the ridership for each metro station can be obtained by counting the number of passengers getting on and off at each station during each hour based on entry and exit time. The focus of this paper is the ridership on weekdays, so the ridership on non-workdays was not considered. Referring to the travel characteristics of residents’ daily life and work, we used the average ridership during four periods on workdays as the dependent variable, including the morning peak (7:00–9:00), noon (11:00–13:00), the evening peak (17:00–19:00), and night (21:00–23:00).
To examine the relationship between the built environment of a catchment area and metro ridership, we used the “5D” framework to construct the built environment variables [
13]. Density included the resident population and the plot ratio of the catchment area; diversity was measured according to the land use mixture entropy score; the number of street intersections in the catchment area was used as a measure of design; the distance to public transport was represented by the number of bus stops in the catchment area; and accessibility to destinations was measured by the number of enterprises, shopping facilities, living service facilities, sports facilities, educational facilities, and medical facilities. In addition, considering the polycentric urban characteristics of Wuhan, the distances from the city center and sub-city center were selected to measure the regional accessibility of metro stations. Furthermore, this study also considered five factors affecting metro station characteristics: opening time, terminal station, transfer station, exit quantity, and betweenness centrality. Among them, the terminal station and transfer station are set as dummy variables corresponding to non-terminal and non-transfer stations. The specific indicator settings and definitions are shown in
Table 1.
3.3. Cluster Analysis
K-means clustering analysis is widely used due to its simplicity and efficiency when applied to the existing division method for metro station types [
6]. However, K-means clustering analysis requires the pre-setting of the number of categories, and different category values can lead to significant differences in the results. In contrast, the EM clustering analysis does not require pre-set category values and divides categories based completely on objective data, which have more objective and stable characteristics [
46]. Therefore, this study used EM clustering analysis to divide metro station types. Referring to the existing studies [
46], EM clustering analysis has two steps and is obtained through alternate calculation:
Step 1: Calculate the expectation (E) to obtain the maximum likelihood estimate of the hidden variables.
Step 2: Maximize (M) the maximum likelihood value calculated in the first step to arrive at the value of the parameters.
The result of the M step is used in the next E step calculation, and this process is continuously iterated to continuously improve the initialization parameters through hidden variables until the parameters no longer change.
Under the framework of the EM algorithm, we chose the Gaussian mixture model (GMM) to solve the EM clustering. The GMM refers to a model with the following probability distribution:
In the formula,
, and the probability density of the
k-th Gaussian distribution is:
where the model parameter
.
3.4. GBDT Model
To better analyze the nonlinear impact of built environment features on metro ridership, this study constructed a gradient boosting decision tree (GBDT) model of machine learning. Compared with traditional regression models, GBDT does not predefine any form of correlation between independent variables and dependent variables and can effectively identify the nonlinear effects between them. Moreover, it can measure the relative importance of independent variables, which helps planners to determine intervention measures reasonably under limited conditions. In addition, GBDT adjusts the weight of the predictive variable by learning the data in stages, resulting in higher fitting accuracy than traditional regression models [
40,
43]. GBDT generates the predictive models in the form of model ensembles, which in this study are regression trees. The goal of this algorithm is to minimize the loss function. Regression trees can be defined as follows:
where the parameter
represents the splitting position and the mean of the terminal node in each regression tree
and estimates
by minimizing the loss function. The optimization process involves several iterative steps.
First, initialize the weak learner
:
Second, for iterations:
(a) Calculate the negative gradient (i.e., residual)
for each sample
:
(b) Fit a regression tree to the residual and obtain the leaf node region of the m-th tree, where ., a tree composed of leaf nodes.
(c) Calculate the best fitting value
for each leaf region
:
(d) Update the strong learner
:
Finally, end the operation and obtain the final learner .
In this study, we introduced a learning rate factor
to limit the residual learning results of each regression tree:
And we used the “gbm” package in the R platform to establish the GBDT model and export the relative importance of independent variables and the dependence graph of each variable.
5. Conclusions
The purpose of this study was to better understand the spatiotemporal correlation between the built environment and resident metro travel through in-depth data mining. To achieve this, the study used smartcard data from the Wuhan metro system in China, combined with multi-source big data such as land use data and POI data, and applied an EM clustering model to divide metro stations into five clusters based on spatiotemporal ridership characteristics of metro travel. The study then uses the GBDT model of machine learning to explore the nonlinear relationship between metro ridership at different types of stations and built environment factors during different times of the day. The study results fill an important research gap and provide some interesting and meaningful findings.
Firstly, based on the detailed travel spatiotemporal characteristics of each station, the EM clustering model was used to divide metro stations into five clusters: residential-oriented stations, mixed residential stations, employment-oriented stations, mixed employment stations, and comprehensive stations. Each type of station has different travel spatiotemporal characteristics, which provides a foundation for understanding the relationship between resident travel characteristics and urban land use functions. Although this study used Wuhan as an example, this classification method is also applicable to other cities. Secondly, the study confirms that the relative importance of the built environment on ridership at different types of stations varies significantly. For residential-oriented stations, the distance from the sub-city center is the most significant factor influencing ridership, while the number of enterprises plays the most crucial role in employment-oriented station ridership. Betweenness centrality emerges as the most pivotal variable impacting metro ridership in comprehensive stations, while the number of enterprises, as well as the distance from the sub-city center, are the most vital factors respectively influencing mixed residential and mixed employment station ridership. Additionally, the relative importance of these factors exhibits distinct disparities across stations of the same type during different time periods. For instance, in the case of residential-oriented stations, the number of medical facilities, number of shopping facilities, distance from the sub-city center, and number of enterprises were the most significant factors during the morning peak, noon, evening peak, and night periods, respectively. It is worth noting that resident population has a strong impact on metro ridership at all stations during different periods, which further confirms that high-density TOD development patterns are conducive to promoting public transportation travel [
9,
22]. However, land use mixture only has a significant impact on ridership in comprehensive stations, which may explain the difference between previous research results regarding the impact of land use mixture on metro ridership [
4,
21], as mixed land use may not be effective in all areas. Third, most built environment variables have complex nonlinear effects on metro ridership at any time and in any cluster of stations and show significant threshold effects.
These findings have important planning and policy implications for urban planning and related departments regarding the optimization of land use at metro stations in the construction of smart cities. Firstly, the relative importance of the built environment to the metro ridership of different types of stations provides a reference for the priority order of built environment intervention in different regions. Therefore, urban planning authorities should formulate distinct land use development measures based on the diverse station types and characteristics of residents’ travel behaviors. For residential-oriented stations, the optimization of public service facilities catering to daily needs, such as medical and shopping facilities, should be prioritized. In the case of employment-oriented stations and mixed residential stations, there should be a concerted effort to attract enterprises to within the vicinity of these metro stations while enhancing the accessibility of these enterprises to the metro stations. As for comprehensive stations and mixed employment stations, promoting population concentration through compact development proves most effective in bolstering metro ridership. Moreover, prevailing transit-oriented development (TOD) paradigms emphasize the significance of high-density and mixed-use development. However, our research demonstrates that population density exerts a pivotal influence across all station types, while land use mixture only contributes very significantly to comprehensive stations. This suggests that a compact and intensive development model contributes to enhancing metro ridership across all station types, but mixed land use significantly enhances ridership only for comprehensive stations. Thirdly, the threshold effect of the built environment on metro ridership provides an impact range for optimizing the built environment. For example, for comprehensive stations, when the land use mixture reaches 0.58, the metro ridership reaches an inflection point and gradually increases. However, when the land use mixture further grows to 0.7, it no longer exhibits a significant promoting effect. This serves as a reminder to urban planners that planning interventions below the threshold or beyond the threshold may not yield effective outcomes. It is essential to devise land use optimization measures within an effective influence range. Fourthly, the impact of the built environment on metro ridership has significant spatiotemporal heterogeneity, which may remind urban planning and transportation management departments to pay attention to the characteristics of metro travel demand and the job–housing balance. The different ridership of different types of stations at different times and their different associations with the built environment remind us that transportation planning and urban functional layout should not be simply based on daily ridership. Spatial organization and transportation planning should be carried out according to the travel demands of urban residents during different periods. Especially for the layout of urban employment centers and residential areas, avoiding long-distance commuting and job–housing unbalance is key.
By dividing metro stations into clusters based on their spatiotemporal travel characteristics and exploring the nonlinear relationship between ridership and the built environment at different times for different clusters, this study reveals the relationship between residents’ metro travel characteristics and urban land use, which will help optimize land use around metro stations in smart city construction and policy formulation. However, this study still has some shortcomings that are worth exploring further in future research. First, this study defines an 800 m buffer zone around the metro station as the station’s influence range based on previous research [
16,
44], but different types of metro stations may have different influence ranges. In the future, a more reasonable catchment area should be defined based on the classification results of metro stations and combined with residents’ travel survey data. In addition, this study did not consider the impact of residents’ social attributes on the ridership of different types of metro stations. This should be remedied in the future by increasing the use of questionnaire surveys, which will help to formulate more refined measures. Finally, the conclusions of this study cannot be generalized to other cities, especially those with medium- and low-density oriented development. Therefore, more cases of different development-oriented cities should be added to further research to verify the accuracy of this study’s results.