1. Introduction
In recent years, China’s urban rail transit has developed rapidly, with the scale of the network and passenger flow ranking first in the world and reaching 9206.8 km and 23.69 billion passenger trips, respectively, in 2021 [
1]. In 2020, 5 cities in mainland China ranked in the top 10 of the world’s urban subway passenger flow among 137 cities in 46 countries [
2]. The vast network and passenger flow make the daily risk control of urban rail transit increasingly severe. At the same time, incidents such as begging, selling, promoting, and theft still occur, seriously affecting the safety of the urban rail transit system and becoming the main source of risk. Therefore, the identification of abnormal passenger behavior in urban rail transit has become an important task for its safe operation management.
Many scholars are researching abnormal passenger behavior. Pan et al. [
3] detected anomalies through human movement data and social media data. Zhao et al. [
4] analyzed passenger travel patterns from the perspectives of time, space, and spatiotemporal factors to understand hidden patterns and anomalies in travel patterns. Wang [
5] deeply explored the travel patterns of abnormal rail transit passengers based on Beijing’s rail transit card data, extracted passenger travel pattern features, and identified abnormal travel passengers. Zhao [
6] combined long-term intelligent card transaction data of passengers to deeply understand and explore their travel models and then conducted in-depth analysis of the aspects of discovered patterns and detected anomalies in passenger travel patterns by using statistical methods to analyze the distribution characteristics of passenger travel patterns and anomalies in travel time and space for abnormal passengers.
Currently, the main method used in the identification of abnormal passenger behavior is the absolute threshold method, which ignores the huge differences between different Origin-Destinations (ODs) and can easily lead to omissions in identifying abnormal passenger behavior. Liu [
7] analyzed the abnormal behavior of typical card numbers in Beijing’s rail transit OD (Origin-Destination) data and classified them into three categories. The method of judging abnormal behavior using a fixed threshold was used. Xue [
8] and others analyzed long-term stationary passengers based on fixed time using the AFC system in the subway in 2019. Yu [
9] and others analyzed passenger long- and short-term records based on fixed threshold using passenger OD data in the subway AFC system.
To study the critical values considering different factors, Li et al. [
10] proposed the concept of the relative threshold index in their research on extreme temperature events. Ouyang [
11] used quantiles to analyze indicator risk. There are two main types of threshold-selection methods: qualitative and quantitative. Han [
12] analyzed the trends of extreme high and low temperatures and extreme precipitation in Southern Xinjiang over the past 51 years using the percentile threshold method and studied their impact on agricultural production. Wang [
13] used the percentile method to determine the spatial distribution of hourly extreme precipitation thresholds and considered the corresponding thresholds for national stations and recurrence periods.
Libardo et al. [
14] analyzed the percentage changes in transportation demand generated by fluctuations in GDP per unit in the Italian region as the main object and used this to make predictions as well as to deduce the relationship between transportation demand and GDP. Rich et al. [
15] believed that long-distance travel is becoming increasingly important and estimated different types of tourism elasticity based on travel data, taking into account factors such as travel distance and purpose.
The quantitative method mainly selects the optimal threshold by analyzing one-sided data. Neil [
16] used a parameter curve fitting method to analyze the tail of the distribution and study the pricing of high excess losses in insurance. Goegebeur [
17] and others determined the optimal threshold by analyzing the tail index of the curve. However, only considering one-sided data can lead to biased high or low threshold values. Considering the limitations of the one-sided curve, Tang [
18] designed a segmented curve for the special saddle-shaped performance curve of the water pump when studying the performance of the water pump and used the least squares polynomial to fit the designed segmented curve. Chen [
19] made a logistic curve of the urbanization level, which is divided into three stages: the initial stage, the rapid stage, and the saturation stage. Duan [
20] proposed a curve segmentation method weighted by the starting point, which has small fitting errors and is easy to program. In terms of transportation quality evaluation, Nocera [
21,
22] emphasized the importance of passenger travel quality, providing assistance for policy makers to make wise judgments for future plans. Additionally, Nocera, S. proposed practical methods for quality evaluation.
Through the analysis of research literature, it was found that the attention paid to abnormal passenger behavior in the time characteristics of passengers is not high enough, the analysis is not detailed enough, and the method is too simple and lacks pertinence and cannot effectively identify abnormal time, which can lead to a lot of abnormal passenger behavior being missed. The existing threshold determination methods in various fields are mainly divided into qualitative and quantitative methods. Qualitative methods determine thresholds through observation and experience, which are subjective and have varying results and poor stability. Quantitative methods directly use formulas to calculate or determine thresholds based on one-sided tail data distribution.
This paper considers the travel time distribution of different ODs and uses relative thresholds for discrimination. Based on the current quantitative research, due to the significant difference in distribution characteristics on both sides of the quantile curve, a bilateral fitting method [
23,
24,
25,
26,
27] is proposed for threshold determination.
Figure 1 is used to present the overall technical research plan roadmap. The specific content includes the following:
(1) Based on OD data, analyze passenger travel time. Preprocess the OD data and calculate the travel time of passengers and analyze the proportion of abnormal passenger behavior in terms of time and space dimensions under the absolute threshold.
(2) Taking into account the travel time distribution of different ODs, relative threshold is used as the discrimination criterion. The distribution characteristics of the average travel time of OD individuals and the total population are analyzed, and the characteristics of curve mutation are extracted. The mutation range is analyzed, and the idea of using relative thresholds as the determination standard is proposed.
(3) This paragraph describes the method for calculating the relative threshold based on the double-sided fitting approach. First, the left and right sides of the curve data are separately fitted to determine the single OD threshold based on comprehensive fitting goodness. Then, the average percentile method is used to determine the relative threshold for multiple ODs.
(4) Case analysis and validation. After visualization, the effect of bilateral fitting is obtained, and the relative threshold values for multiple ODs are calculated and analyzed. The consistency of the threshold quantile results for different ODs is tested to verify the rationality and effectiveness of the method.
2. Travel Time Analysis
2.1. Data Preparation
Due to the significant passenger traffic volume on Guangzhou Metro and our collaborative partnership with the company, we were able to access relevant data. Therefore, this paper selects Guangzhou Metro as the research object. As of 2021, there are 24 subway lines and 411 operating stations in Guangzhou Metro with a network length of 744.5 km.
This paper selected the passenger swipe card data from 27 days in 2018 for the study, which includes over 30 million swipe records. Each passenger’s swipe record contains 13 attributes, such as user card number, recharge time, balance, entry/exit station time, entry/exit station line, and entry/exit station name, and fully records the passenger’s entry/exit time and location information. In order to facilitate data processing and meet the requirements of the study, this paper selected 7 relevant attributes from the 13 attributes for processing, including user card number, entry station, exit station, entry time, exit time, entry station line, and exit station line.
The original data contains some useless data that may affect the analysis, such as records where the entry time and exit time are the same due to system errors and records of excessively long travel time that violate the subway travel requirements. Therefore, this paper preprocesses the original data by removing useless and erroneous information to ensure more accurate analysis of travel data. The main steps are as follows:
(1) Trip duration calculation. Trip duration refers to the time that passengers spend riding on the rail transit, which is determined based on the time when passengers swipe their card to enter and exit the station. In order to conduct more precise research, the unit of time measurement is in seconds.
(2) Deleting erroneous data. There is a large amount of data in the original passenger card-swiping records where the entry time is equal to the exit time, which does not comply with the general rules of taking a ride. Therefore, this part of the data was deleted.
(3) Determining passenger travel peak periods. In this paper, travel between 7:30 a.m. and 9:30 a.m. is referred to as the morning rush hour, and travel between 5:30 p.m. and 7:30 p.m. is referred to as the evening rush hour. Travel during other times is referred to as off-peak travel.
(4) Standardization. Due to the large differences in distances and number of stations between OD pairs, data standardization is the transformation of different OD trip time data into standardized time data.
2.1.1. Absolute Threshold Index of Travel Time
In this paper, the abnormal behavior in subways is identified based on the travel time, and the threshold value is the most important for determining the abnormal behavior of passengers. In consideration of the provisions of Guangzhou Metro Company, in this paper 270 min is taken as the critical value to judge the abnormal behavior of passengers, and this method using a fixed value as the critical value is called the absolute threshold method. The abnormal behavior under absolute threshold is analyzed from two dimensions of space and time.
2.1.2. Analysis of Abnormal Behavior Based on Spatial Dimension
(1) Spatial dimension indicator determination
The number of OD station spacing is selected as a spatial index for analysis. OD station number can reflect the travel distance of passengers in space.
(2) Analysis of the anomaly ratio considering the number of interval stations
The proportion of abnormal behavior of OD with different number of interval stations is analyzed as shown in
Figure 2 and is represented by a bar chart. The abscissa represents the number of interval stations between OD, and the ordinate represents the abnormal proportion.
As can be seen in
Figure 2, the proportion of abnormal behavior of passengers varies with the number of OD interval stations, showing an overall increasing trend. When the number of OD interval stations is between 1 and 15, the proportion of abnormal behavior is between 2% and 6%. When the number of interval stations is between 16 and 23, the proportion of abnormal behavior is generally high, between 8% and 18%.
Under the absolute threshold, the number of different sites in the spatial dimension has a significant impact on the proportion of anomalies. Generally, as the number of sites increases, the proportion of anomalies gradually increases. This is because when setting the threshold, passengers with longer travel times due to more sites are taken into consideration, and therefore the threshold is set higher, with abnormal behavior mainly concentrated in ODs with more sites. On the other hand, when the number of sites is small, the proportion of anomalies is relatively low. This is because setting the threshold too high makes it more difficult to detect abnormal passenger behavior and also because the differences in passenger behavior over a short period of time are small, making abnormal behavior easy to ignore. Based on the above analysis, we need to consider different numbers of sites to set different thresholds.
2.1.3. Analysis of Abnormal Behavior Based on the Time Dimension
(1) Time dimension indicator determination
The time is divided into three parts: morning peak time, evening peak time, and off-peak time. The morning peak time is from 7:30 to 9:30, the evening peak time is from 17:30 to 19:30, and the other subway operation time is off-peak time.
(2) Analysis of the anomaly ratio considering the peak period
The proportion of abnormal behavior in different peak periods is analyzed as shown in
Figure 3 and is represented by a bar chart. The abscissa represents different peak periods (morning peak, and evening peak, and off-peak, respectively), and the ordinate represents the proportion of abnormal behavior.
As can be seen in
Figure 3, the proportion of abnormal behavior of passengers varies with the peak period. The proportion of abnormal behavior in morning-peak, evening peak, and off-peak periods are 0.93%, 2.14%, and 2.15%, respectively.
In addition, it can be seen in
Figure 3 that different peak periods of time dimension also have significant influences on the proportion of anomalies. We need to set different thresholds according to different peak periods.
6. Research Conclusions and Outlook
6.1. Research Conclusions
In this paper, in order to analyze the influence of absolute threshold on the discrimination of passengers’ abnormal behavior, fixed values were used as thresholds to study the proportion of passengers’ abnormal behavior in two dimensions: spatial and temporal. It was found that in both dimensions, the number of different stations and peak periods have significant effects on the proportion of abnormal behavior, and the results of the study imply that absolute threshold cannot adequately identify abnormal behavior when passengers travel with large spatial and temporal differences.
In order to propose a more effective method for discriminating abnormal behavior, this paper analyzed the trends of OD overall and individual quantile curves, and the study found that the distribution curves were similar in shape. Both mutations occurred, both were consistent in the mutation range, and the idea of relative thresholds was proposed.
The distribution characteristics of quantile curves were studied, and it was found that the curve mutations had obvious mutations and large differences between the curves on both sides of the mutations. In this paper, a bilateral fitting method was proposed, and the relative threshold values were determined by combining the average thousandth quantile. It was found that the method can calculate the relative thresholds for different ODs. Different ODs have different thresholds and different criteria for abnormal passenger behavior.
The threshold quantile reflects the proportion of abnormal passenger behavior in different ODs, and the proportion of abnormal passenger behavior seriously affects the quality of rail transit. The calculated threshold quantile mean for multiple ODs is 92.69%, which means that 7.31% of passenger behavior in rail transit is abnormal. This proportion is high, indicating low passenger service quality and generally poor transportation quality.
Compared to previous research, the relative threshold method is better suited for situations where there are significant spatiotemporal differences in passenger travel behavior, as it can calculate relative thresholds for different origin-destination pairs. This allows for different standards of abnormal behavior and better adaptation to different OD situations, improving the accuracy of anomaly detection. The bilateral fitting method, based on the analysis of right-tail data and the consideration of normal data distribution, enhances the stability and accuracy of threshold determination.
6.2. Outlook
In this paper, when using the bilateral fitting method to determine the threshold, the fitting method is the exponential function, the parameter determination method is least squares, and the index to evaluate the fitting effect is the coefficient of determination, which can be optimized in three aspects of the fitting model, the parameter determination method, and other methods in future research to improve the fitting effect and the accuracy of threshold determination.