This study utilizes one-week SCD from the Beijing subway for March 2016, consumer data for the shops around the subway stations for 2016, and house sale and rental price data for 2016 to implement the M2A framework.
4.1. Data Preparation
Urban big data are collected passively and include abundant bad and redundant data. Therefore, they need to be cleaned and preprocessed initially. To simplify, we assume passengers come from or are destined for a location within walking range of a station, and do not require other transportation modes.
From SCD, card ID, entry line and station ID, entry time, exit line and station ID, and exit time are extracted from a large number of fields. Then, line and station ID are replaced by station name to identify passengers going to an interchanging station from different entrances from one station. Finally, the data are cleaned by deleting the records in which the entry and exit stations are identical or the exit time is earlier than the entry time. After preprocessing and data integrating, 50,141 passengers’ trip records remain to be analyzed.
Shop consumer data are collected from a business review website, dianping.com, in China that is similar to Yelp in the US. The data contain five items: station name, every shop name, every shop location, average price of each shop, and review score of each shop. First, we calculate the Euclidean distance between shops and the nearest stations and filter the shops by a distance of more than 800 m. Then, according to shop categories (
c), which are catering, entertainment, and shopping, we calculate the average price (
av) and price variance (
v) of consumption for every station using Equations (10) and (11):
where
Nc is the number of shops in category
c in the catchment area of a station,
pri is the average price of the
ith shop, and
cfi is the ratio of the review score of the
ith shop to the sum score of all category
c shops, which reflects the attractiveness of a shop.
Shop consumer data of Xizhimen station, available online, are used to show the preprocessing. First, we group the shops into three categories based on their types; then, we calculate the average price and price variance for each category. Considering the example of catering-type shops, there are three items to review on a 10-point scale for each shop; these items describe taste, environment, and service. For each item, we calculate the average score and assign it to the missing shops. We sum all of the scores to obtain the total score tsi for each shop and the average total score for the catering-type shops in the Xizhimen area, as = 22.29. The attractiveness of each shop is calculated as cfi = tsi/as. Integrated with the price of each shop, the average price of catering in the Xizhimen area is formulated using Equation (1), avcater = 45.92. Based on the average price, the price variance can be calculated using Equation (2), vcater = 1261.67. The same processing is used for entertainment and shopping in the Xizhimen area, the average price and price variance can be calculated as, aventer = 36.02, venter = 999.82, avshop = 477.5, vshop = 415,973.4.
The house price data are collected from a real estate website. They contain three items: station name, house location, and rental or selling price. Houses that are located in the catchment area of 800 m are chosen for the analysis.
4.2. Model Implementation
First, we implement HMM based on the algorithm flow, as shown in
Figure 4; a separate numerical simulation study was conducted [
48]. From the results, we identified six observation clusters, as shown in
Table 1. Based on the observation clusters, four activity types were inferred with three trip purposes, as presented in
Table 2. Activities 1 and 4 are included in the trip purpose of “Work”, which may be because of different attendance management of different companies in Beijing.
Based on these results, each passenger’s trip purpose is inferred from the SCD to generate trip chains using HMM; the percentages of passengers traveling to work and home at different times of the day are shown in
Figure 5. Due to a lack of detailed survey data for Beijing, the results of the Household Interview Travel Survey (HITS) and the Future Mobility Survey (FMS) in Singapore [
49] are used to verify the results of this study. Based on these, the results of this study using HMM are consistent with the results of the FMS during peak hours, while they match better with the results of HITS during off-peak hours. Because trip purposes during peak hours mainly are going home or to work [
6,
7], they have higher regularity and predictability than other periods [
5]. In this study, HMM can capture these trips with high accuracy as does the FMS [
49]. During off-peak hours, trip purposes vary; however, this study only analyzed continuous trips by metro using smart cards. Therefore, its limited sample size may lead to under-reporting of related trips as in the HITS [
49]. Notably, in the results of the HITS and FMS, the morning peak is earlier than in the HMM result. This is because work time is earlier in Singapore than in Beijing. From the above, accuracy of the results of HMM is similar to or even higher than that of the HITS and are suitable for in-depth analysis.
Subsequently, individual mobility is derived from the trip chains. The location economic feature data are combined, the comprehensive consumption indicator set of each passenger is formulated, and five principal components are extracted, as show in
Table 3. The first component includes the information of commuting distance and living cost. However, commuting distance shows negative relationship with living cost significantly. This result means as income rises, people will travel shorter to commute by public transit, and it confirms that public transit is an inferior good for the commuting trip. Catering, entertainment, and shopping consumption are three different components, respectively. It shows they have a different relationship with income. At last, trip frequency and mobility diversity are in the same component, which means they have the same relationship with income.
4.3. Results
At last, six consumption levels are classified using k-means method, as shown in
Figure 6.
From
Figure 6a,b, we can claim that passengers in cluster 3 have the highest income among all the metro passengers, followed by cluster 2, based on our assumptions. Furthermore, clusters 1 and 4 are the middle-income groups and clusters 5 and 6 are the low-income groups. Considering the distribution of commuting distance in
Figure 6c, it has a negative relationship with metro passenger income. As for superior consumption, clusters 3, 5, and 6 maintain a low expenditure on catering, entertainment, and shopping. High-income passengers have more options, such as private cars or taxis, for flexible trips like shopping, while low-income passengers are constrained by ratio of expenditure to income [
50]. Therefore, they take fewer superior consumption trips via the metro than other passengers and their expenditure is lower than others. Interestingly, other passengers have different consumption preferences for superior goods. Passengers in cluster 2, who have higher income than those passengers of clusters 1 and 4, prefer to take the metro to high consumption areas of shopping, while passengers in cluster 1 prefer to take the metro to expensive eating areas and spend more money on dining outside, and passengers in cluster 4 prefer to travel to expensive entertainment by metro. Individual trip frequency and mobility diversity by metro have no significant correlation with passenger income, as evident from
Figure 6d,h. More specifically, the average trip frequency is about nine for all clusters, excluding cluster 6; this implies that these passengers mainly use the metro for commuting.