Regional Traffic Event Detection Using Data Crowdsourcing

Kim, Yuna; Song, Sangho; Lee, Hyeonbyeong; Choi, Dojin; Lim, Jongtae; Bok, Kyoungsoo; Yoo, Jaesoo

doi:10.3390/app13169422

Open AccessArticle

Regional Traffic Event Detection Using Data Crowdsourcing

by

Yuna Kim

¹,

Sangho Song

²,

Hyeonbyeong Lee

²

,

Dojin Choi

³,

Jongtae Lim

²

,

Kyoungsoo Bok

⁴ and

Jaesoo Yoo

^2,*

¹

Department of Big Data, Chungbuk National University, Cheongju 28644, Republic of Korea

²

School of Information & Communication Engineering, Chungbuk National University, Cheongju 28644, Republic of Korea

³

Department of Computer Engineering, Changwon National University, Changwon-si 51140, Republic of Korea

⁴

Department of Artificial Intelligence Convergence, Wonkwang University, Iksan-si 54538, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(16), 9422; https://doi.org/10.3390/app13169422

Submission received: 13 July 2023 / Revised: 14 August 2023 / Accepted: 18 August 2023 / Published: 19 August 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Accurate detection and state analysis of traffic flows are essential for effectively reconstructing traffic flows and reducing the risk of severe injury and fatality. For this reason, several studies have proposed crowdsourcing to resolve traffic problems, in which drivers provide real-time traffic information using mobile devices to monitor traffic conditions. Using data collected via crowdsourcing for traffic event detection has advantages in terms of improved accuracy and reduced time and cost. In this paper, we propose a technique that employs crowdsourcing to collect traffic-related data for detecting events that influence traffic. The proposed technique uses various machine-learning methods to accurately identify events and location information. Therefore, it can resolve problems typically encountered with conventionally provided location information, such as broadly defined locations or inaccurate location information. The proposed technique has advantages in terms of reducing time and cost while increasing accuracy. Performance evaluations also demonstrated its validity and effectiveness.

Keywords:

machine learning; crowdsourcing; event detection; transportation systems

1. Introduction

The one-car-per-person era is imminent, and traffic is rapidly increasing. Thus, it is important to develop solutions to traffic problems, such as congestion and car accidents, resulting from these situations. Traffic problems have a variety of causes such as inefficient traffic systems or inadequate infrastructure. Traffic problems are inconvenient for both transportation users and residents in traffic areas. They also have wide-ranging effects, including economic losses. Thus, methods for resolving traffic problems, such as traffic event detection techniques, have attracted attention among researchers [1,2].

Traffic event detection plays an important role in increasing the safety and efficiency of transportation systems [3]. Traffic events refer to circumstances that influence the flow of traffic. Problems, including traffic congestion and accidents, can be prevented by quickly detecting and responding to traffic events. Furthermore, challenges due to road construction and special occasions can be handled in advance. In this way, reduced accident rates and less congestion increase the safety and efficiency of traffic flow. In addition, the operation and management of traffic systems can be improved by collecting and analyzing information about traffic events. Therefore, traffic event detection techniques play an important role in improving the safety and efficiency of traffic systems [4].

Crowdsourcing is a research method that recruits a large workforce via the Internet to help solve a problem. It is used for tasks requiring large amounts of labeled data, such as event identification. These large volumes of data are collected via crowdsourcing for machine learning, which analyzes the information needed to identify events and increases the accuracy of event detection. Crowdsourcing-based traffic event identification uses human intelligence to identify traffic events [5,6]. This method uses online platforms where ordinary people gather. For example, mobile applications are used to observe and record traffic events, and then the data are uploaded to a central server. These data are provided by human workers, who review them to distinguish events and assign accurate labels. Crowdsourcing-based traffic event identification can be used for various purposes. For example, the authors of [7] used crowdsourcing to design a driver assistance system, and the authors of [8] also used crowdsourcing to predict road traffic conditions. In these ways, crowdsourcing can measure the degree of road traffic congestion and rapidly respond to traffic accidents.

Traffic events occur under various circumstances. Manually classifying these events is difficult and costly. For example, it is difficult to manually count and classify all the cars in a traffic jam, where vehicles block a street. Consequently, manual classification cannot be used to accomplish real-time event detection and handling due to its time-consuming nature. Machine learning uses existing events to inform and identify new data based on what the models have learned. For example, the authors of [9] proposed an event detection technique that collects real-time tweets and uses text mining. The authors of [10] suggested collecting traffic-related tweets and preprocessing and analyzing the data using a big data processing platform known as SAP HANA. Traffic events can be quickly and accurately classified through such efforts, and various services such as traffic congestion prediction can be provided. Therefore, machine learning is an essential technology for traffic event identification. However, conventional techniques generally use entire datasets when collecting and classifying data. Since such methodologies use raw social data, they may include unnecessary data, which complicates analysis. In traffic event identification, the characteristics of the data vary greatly according to the region where traffic occurs and the type of event. Therefore, to distinguish events, relevant data and geographical locations must be selected. Currently, studies are being conducted on event detection techniques using the geotag functions provided by social media to extract regional information [11]. Geotagging involves tagging posts with geographically identifiable data accessible to other users. However, event detection techniques that rely on geotags are limited by poor accuracy because only 2% of social media users actually use geotags.

In this paper, we propose a traffic event detection technique that uses social media-based machine learning. Traffic events are identified via machine learning, and the keywords in the text are used to extract the locations where the events occurred. This study provides the following contributions:

The ratio of irrelevant data is reduced by the data collected via crowdsourcing;
Through the use of self-learned word embedding, performing event detection is more effective and flexible regardless of the social media service used;
We compared several models and selected the model with the best performance;
Geographic locations can be determined through text mining rather than social media functions.

This paper is organized as follows: Section 2 analyzes and describes the problems of existing techniques. Section 3 introduces the processes and content of the proposed traffic event detection technique. Section 4 discusses the performance results of the proposed method. Finally, Section 5 presents the conclusions of this paper and suggestions for follow-up research.

2. Related Work

Existing traffic event detection techniques have adopted methods that improve detection accuracy by combining different kinds of data or using social media in conjunction with machine learning.

For example, the authors of [12] proposed an architecture that detects traffic events using social media and taxi GPS data to increase event detection accuracy. This technique combines different data to improve accuracy, i.e., social media data to distinguish traffic problems and GPS data to extract spatiotemporal information. It also uses density-based clustering to group related roads into single groups and analyze them. Taxi GPS data are used to determine the times and locations of traffic abnormalities, whereas social data are used to describe the causes of traffic abnormalities. However, the presented framework can only detect traffic events and not the precise causes of the events.

Meanwhile, the authors of [13] proposed the Smart Traffic Management Platform (STMP), which integrates and analyzes sensors and social media to detect traffic events and increase event detection accuracy. STMP detects concept drift integrates heterogenous big data streams such as IoT, smart sensors, and social media to distinguish repeating and non-repeating traffic events. This information can monitor the spread of event influence, predict traffic flow, analyze commuter sentiment, and determine optimal traffic control. However, this system requires semantic information to process sensor and social media data.

The authors of [14] proposed a detection method for several emergency situations, including traffic situations, using machine learning to identify events. This technique uses binary classification to select data indicating emergency situations from social data and multi-class classification to classify event types. In addition, a bidirectional long-short-term memory (BiLSTM) model is used to extract time and location information. Finally, grouping is performed based on similarity calculations of event type, time, and location. This technique recognizes different data points as a single event through grouping and understands how events change over time. However, it has performed poorly in traffic events where location is important because its similarity calculation formula uses the same weights for time and location.

The researchers of [15] used machine learning to distinguish complaints regarding road irregularities and poor road conditions. This approach extracts places and people through Twitter based on entity name recognition. The latitude and longitude of each entity name are obtained through the OpenStreetMap API. In addition, they used machine learning to classify tweets into three categories: useful, normal, and not useful. Tweets classified as “Normal” are converted into useful tweets using techniques proposed by the researchers. However, this approach has exhibited poor performance in terms of precision and recall.

In reference [16], the authors proposed a contextual word-embedding method that combines a convolutional neural network (CNN) model and a bidirectional encoder representation from a transformer (BERT) model to detect traffic events. This technique shows that a CNN model created for image processing can also be used for text processing. Furthermore, it shows that two or more models can be combined and used together instead of a single model. This technique has demonstrated optimal performance. However, it can be improved to extract the times and locations of traffic events to perform additional analyses.

In reference [17], the researchers used Twitter to detect events influencing traffic congestion and proposed an automatic labeling access method that automatically assigns labels while considering the large volume of data. However, because the method was created based on an existing dictionary, it experiences errors when confronted with keywords not included in the dictionary. As a result, it cannot detect events when posts use place names rather than cities, which is complicated by the fact that 98% of Twitter users do not use geotags. User profiles can also introduce location information errors, particularly when users post content from locations different from the ones specified in their profiles.

3. Proposed Regional Traffic Event Detection Technique

3.1. Structure of Proposed Technique

Existing event detection methods that use social media data have a few problems. Firstly, detection accuracy decreases with insufficient data and when the ratio of posts unrelated to the event of interest is unbalanced. Secondly, geographical location information (geotags) on social media is limited. Therefore, accuracy can be poor in fields of application where location information is essential. Furthermore, when different data are combined and used, inconsistencies can occur. As the amount of data increases, accuracy can decrease with increased preprocessing complexity and including unnecessary information. In the case of sensor data, there is the possibility that malfunctioning devices can collect unreliable values.

In this study, data collected via crowdsourcing are applied to a machine-learning model to perform event detection. The collected data are first preprocessed and the features are extracted. A machine learning model is trained based on the extracted features. The trained model later receives new data as input and judges whether the data are related to an event. The machine learning method uses a classification model. This model classifies the input data into predefined classes. It accomplishes this by learning the decision boundary, which is the standard by which the model makes decisions based on the training data. Later, when new data are entered as input, the model determines which class the data belongs to according to the decision boundary and outputs the classification results. In this paper, we present a machine learning method that preprocesses data collected via crowdsourcing and trains a classification model based on these data to perform event detection. This method improves the accuracy of event detection.

Figure 1 shows the proposed technique’s overall structure, which consists of three modules: Data Preprocessing, Event Type Extraction, and Event Place Extraction. Each module is necessary to increase the accuracy of event detection. The first module, the Data Preprocessing module, includes a Tokenizer, Remove Noise and Stopwords, and Noun Extraction stages. It refines the collected data by removing any noise and stopwords. Social data contain many grammatically incorrect expressions, which reduce event detection accuracy. The second module, the Event Type Extraction module, identifies events that the collected data represent. The Event Selection stage first selects traffic-related events; then, the Categorize Selected Event Types stage determines the traffic-event types of the selected events. The final module, the Event Place Extraction module, extracts the event locations represented in the data. Since traffic events occur in specific regions, the events’ location information must be known. Therefore, keywords representing regions are identified at the Keyword Selection stage, and the administrative districts where the events occurred are extracted at the Return Administrative District stage. In this way, regional information can be obtained.

3.2. Data Preprocessing

Social data are freely expressed by users and are irregular by nature. They vary in terms of structure and form and almost always contain typographical errors, special characters, etc. Analysis techniques that use machine learning are greatly influenced by these data forms. Raw data may incur errors or reduce accuracy during event detection. As such, data refinement processes are needed. The following data refinement methods represent nonstandard data in a standardized form and promote accurate event detection.

Tokenization is a process that converts complex text into word sets known as tokens. Data consisting of text include spaces, punctuation, mathematical symbols, special characters, and typographical errors. In the proposed system, characters other than Korean letters and numbers are removed, and an N-gram tokenization approach is used to divide each part of the text into words. After this stage, each text in the corpus is expressed as a series of words for additional processing.

Unnecessary data negatively impact machine learning performance. Therefore, the proposed approach removes unnecessary data, such as noise and stopwords, from the text. Noise refers to mathematical symbols such as ‘@, *, _’, whereas special characters include punctuation and typographical errors. The Korean language uses consonants and vowels, specifically in patterns of initial consonants, medial vowels, and final consonants. Text that does not follow this pattern is considered meaningless. Therefore, texts containing only consonants or vowels were considered typographical errors and removed. On the other hand, stopwords are words that appear often but are meaningless. For this study, meaningless words such as “urgent” or “first” were selected and removed.

Index words or keywords representing sentences are in the form of nouns. Therefore, by extracting nouns to convey meaning accurately, the proposed method can reduce text expressions during machine learning and increase accuracy. In this study, we used a morpheme analyzer to perform noun extraction.

3.3. Event Type Extraction

Event type extraction is the process by which events in traffic data are classified into specific types. It performs an essential role in understanding and quickly responding to or preventing traffic conditions. Traffic safety and efficiency can be increased through event type distraction, which is performed by machine learning. Traffic data should first be processed by a text classification algorithm to extract important information influencing the event type. For this purpose, natural language processing (NLP) and machine learning algorithms are generally used. Data refinement, preprocessing, and feature extraction processes are also performed to increase extraction accuracy. This process makes accurate and reliable event type extraction is possible.

Figure 2 shows an overall flowchart of the event classification process. Event type extraction consists of two stages: Event Selection and Categorize Selected Event Types. In the Word Embedding stage, frequency–inverse document frequency (TF–IDF) values for each word are calculated, vectorized, and stored in the model. Vectorization is performed by the model when storing the words in each post. At the Event Selection stage, the data are divided into relevant and irrelevant data. Only data that influence traffic conditions were selected. This step prevents inaccuracies in the traffic event extraction stage. Subsequently, the Categorize Selected Event Types stage determines the event types of the selected traffic events.

Table 1 shows the event types and event meanings. In this study, traffic-related data were classified into six events (accident, construction, special events, weather, congestion, and others) according to traffic events defined by Korea’s National Transportation Information Center [18].

In this study, TF–IDF was used as the word-embedding method for machine learning. TF–IDF is a statistical method that calculates the relative importance of words in text data and assigns weight values based on word frequency throughout an entire document [19]. Thus, it reflects the importance of each word in a document. Through this approach, the model can know the meaning and relative importance of words to accurately identify event types.

During Event Selection, a binary classifier is used to classify the data into relevant and irrelevant data, to select only the data influencing traffic conditions. Figure 3 shows an example of Event Selection. Posts that have completed preprocessing are divided by word and maintain an array form. An embedding model is used to vectorize the words in these arrays into numbers. Machine learning is applied to the vectorized values, deriving a result of 1 if the post is related to a traffic event, and 0 if the post is not.

In this study, we performed event classification using five binary classification models: naïve Bayes, random forest, support vector classifier (SVC), linear SVC, and logistic regression [20,21,22,23,24]. Binary classification refers to the problem of classifying data into 1 s (true) or 0 s (false). In other words, it classifies data into two classes according to the data form.

After Event Selection, the Categorize Selected Event Types stage classifies events according to event type. Figure 4 shows an example of the Categorize Selected Event Types stage, depicting multi-class classification. The vector values of posts selected via Event Selection for their relevance to traffic events were input into the multi-class classifier. When the results are output by the multi-class classifier, they are converted to text referring to these event types.

In this study, we performed classification using six models: naïve Bayes, random forest, SVC, linear SVC, BiLSTM, and TextCNN, which is a CNN for text classification [25,26]. Multi-class classification refers to items classified into three or more classes. The multi-class classification includes the one-versus-all (OvA) and one-versus-one (OvO) strategies. The OvA strategy creates a binary classifier for each class and selects the class that produces the highest score. By contrast, the OvO strategy creates binary classifiers for all possible combinations of two classes and selects the class classified as the most positive.

3.4. Region Extraction from Event Occurrence

The importance of traffic events can vary according to the location of the user. Therefore, location detection is important. Extracting the region and place where the event occurred is necessary for location detection. In general, social data provide the location information of users. However, this information is limited to only 2% of all users. For this reason, selecting keywords that allow the region to be known from the text data is necessary. These keywords can then be used to identify administrative districts.

This study used an entity name recognition API based on the BERT model to extract keywords indicating regions from the text. The BERT model, a natural language processing model developed by Google, is unlike conventional models in that it can understand the context by learning sentences in both the left and right directions [27]. An entity name recognition API recognizes certain entity names in text and provides semantic information for words representing entities. Table 2 shows the keywords representing regions among the entity name tags [28]. These tags are used to extract region-related keywords from the data. We used an entity name recognition API to recognize and assign entity names to each word in the text data. Then, the words assigned to location-related entity names are extracted. Before the keywords are extracted, location-related entity names are defined, and entity names are assigned to each word. Then, words that include matching entity names are extracted.

Following the Keyword Extraction stage is the Return Administrative District stage, which returns the administrative districts. If there are several keywords, the administrative district is returned based on the last keyword because Korean addresses are arranged with specific locations listed last. Korean administrative districts consist of one special city, six metropolitan cities, eight islands, one autonomous province, and one autonomous city, totaling 17 administrative districts that are classified into regional and local governments. The administrative subdistricts of regional and local governments are called basic local governments, and they consist of cities and areas referred to by the Korean terms “gun” and “gu.” Geocoding is used to extract administrative districts based on previously extracted keywords. Geocoding refers to converting locations on the Earth’s surface into addresses or coordinates based on unique names. In other words, it generates location information from geographical coordinate information. The input address or location information is analyzed and mapped to a geographic information database. For geocoding, an API inputs addresses or location information as text strings and converts them into geographic coordinates. As a result, the user can obtain geographic information through simple API calls. Typical geocoding services include Google Maps API and Naver Maps API. Since the data provided by APIs differ, accuracy is increased using several APIs rather than a single API. This measure is important because, when geocoding, APIs are used to extract administrative districts from location keywords, so some administrative districts may not be recognized. Additionally, because the last keyword is the keyword closest to the event, geocoding is performed based on the last extracted administrative district, among others.

Algorithm 1 shows the event occurrence region extraction algorithm. The subclassification items of LOCATION(LC) and ORGANIZATION(OG), tags representing regions among the entity name tag set, and AF_ROAD and AF_BUILDING, subclassifications of ARTIFACTS(AF), are defined as the new array all_Local. To perform keyword extraction in each post, an entity name recognition API obtains entity name tags for each word. If the input entity name tag exists in all_Local, it is considered a keyword representing a region name and is stored in the keyword array. Administrative district extraction is performed based on the selected keywords. A geocoding API is used to convert the administrative districts, and the converted administrative districts are divided and stored as metropolitan local governments, basic local governments, and subdistricts.

Algorithm 1: Event Region Extraction.

Input: post_DF, [LC], [OG], [AF]
Output: Administrative_division_DF

all_Local = LC+OG+AF
ForEach post in post_DF[‘Content’] do

//Keyword extraction
keyword = [ ]
For char in post do
entity = ETRI_API(char)
If entity in all_Local then
keyword = keyword.append(entity)
end
end

//Region extraction
address = “”
For key in keyword do
address = geocoding_API(key)

4. Performance Evaluation

We verified the superiority of the proposed regional traffic event detection technique by comparing its performance to conventional techniques. We used the sklearn and tensorflow libraries for machine learning and an entity name recognition API provided by ETRI for entity name recognition [28]. The geocoding APIs provided by Google and Kakao were used for region extraction [29,30].

Data provided by agencies such as TBN Korea Traffic Broadcast, Twitter, Korea Expressway Corporation, etc., were used for this evaluation [31]. TBN Korea Traffic Broadcast is one of South Korea’s terrestrial broadcasting stations and airs programs covering to 24 h traffic information. Table 3 shows the characteristics of the collected data. For the performance evaluations, we used data from 5 March 2021 to 30 September 2021, excluding data for July, as the training dataset, whereas data from 1 July 2021 to 31 July 2021 were used as the test dataset. We used TBN Korea Transportation Broadcasting data, which includes data from Twitter, broadcast stations and so on, without collecting data directly from external sources such as Twitter and broadcast stations. We collected and managed TBN Korea Transportation Broadcasting data through web crawling. The TBN data set consists of data from external agencies such as Twitter and broadcasting stations, data from public agencies such as Korea Expressway Corporation and Central Traffic Information Center, and data from direct data providers such as avid viewers and citizens. Figure 5 shows the distribution of the TBN data set. The size of data from external agencies such as Twitter and broadcasting stations is 2588. The size of data from public agencies such as Korea Expressway Corporation and Central Traffic Information Center is 22,837. Finally, the size of data from direct data providers such as avid viewers and citizens is 79,182.

The collected traffic data include the region, ID, report data, and content. The regions are divided into all regions: Busan, Gwangju, Daegu, Daejeon, Gyeongin, Gangwon, Jeonbuk, Ulsan, Gyeongnam, Gyeongbuk, Jeju, and Chungbuk. The IDs include the reporter’s name or social media name. In the case of data reported by a public organization, the public organization’s name is the ID. The report date includes the date and time at which the data were reported as well as the year, month, day, hour, and minute. The content indicates traffic conditions such as accidents, construction, etc., and text. Table 4 shows an example of collected data. Since the original data were in Korean, they were translated into English for easier understanding.

The performance evaluation of the proposed regional traffic event detection technique consisted of an event type classification and event occurrence region extraction. The event classification performance evaluation compared the models’ performance and determined the most suitable model. To demonstrate the importance of binary classifiers, we verified the proposed technique by comparing the results of using binary classifiers and multi-class classifiers to the results of using only multi-class classifiers without binary classifiers. To evaluate the accuracy of the event occurrence region extraction, we extracted 100 random regions five times and calculated the accuracy based on the five repetitions. In this study, accuracy was evaluated based on the receiver operating characteristic (ROC) curve, area under the curve (AUC), precision, recall, and F-measure. In the proposed technique, the data were classified via binary classification into relevant data (those that influence traffic conditions) and irrelevant data (those that did not influence traffic conditions). The influential data were classified via multi-class classification according to six event types: construction, weather, accident, congestion, special events, and others. We determined which models were suitable for binary and multi-class classification using the classified data.

The machine learning models used in the binary classifier are naïve Bayes, random forest, SVC, linear SVC, and logistic regression. The parameters for all of the models are set to default. Machine learning models used in multiple classifiers are naïve Bayes, random forest, SVC, linear SVC, BiLSTM, and TextCNN. The parameters for naïve Bayes, random forest, and BiLST are set to default values. TextCNN constructs a Convolutional Neural Network (CNN) based on the model proposed in [14]. We applied dropout to the proposed technique to improve generalization and avoid model overfitting. We set the parameters of the linear SVC model to C = 1.0, penalty = “L2”, multi_class = “ovr”, and the parameters of the SVC model to C = 1.0, kernel = “rbf”, decision_function_shape = “ovr”. In the linear SVC and SVC models, we changed the value of C from 0.1 to 1000, but there was no significant change. Therefore, we performed a performance evaluation by changing the value of C to 1.0.

Figure 6 shows the ROC curves and AUC of the binary classifiers. Figure 6a shows the ROC curves of the naïve Bayes, random forest, SVC, linear SVC, and logistic regression binary classifiers. The curves indicate that the true positive rates (TPR) and false positive rates (FPR) of the five models used for the proposed technique were close to one. The AUC results in Figure 6b show that all five models had performance levels of ≥0.9. SVC, linear SVC, and logistic regression had the best performance levels with values of ≥0.95.

Figure 7 shows the precision, recall, and F-measures of binary classifiers. In terms of precision, random forest, and linear SVC exhibited the highest values at 0.9, followed by SVC and logistic regression models with values of 0.88 and 0.87, respectively. The Naïve Bayes model exhibited the lowest precision among the examined models at 16%. In terms of recall, the random forest and linear SVC models also exhibited the highest values at 0.86, followed by SVC at 0.81 and logistic regression at 0.80. The naïve Bayes model had the worst performance among the tested models with a recall of 0.53. In terms of the F-measure, which is the harmonic mean of precision and recall, random forest, and linear SVC, linear SVC had the highest value at 0.86, corresponding to a 43% performance increase over naïve Bayes. Based on the results of ROC curves, AUC, precision, recall, and F-measures, we selected the linear SVC model as the binary classifier for traffic event detection.

Figure 8 shows ROC curves and AUC for the six multi-class classifiers. Figure 8a shows the ROC curves, in which the curve for the naïve Bayes model was the furthest from 1, whereas those of the linear SVC and SVC models were the closest to 1. Figure 8b shows the AUC values of all models, which exhibited values of 0.9 or greater. As such, their performance in this aspect was satisfactory. Of the evaluated models, the linear SVC model and SVC model performed best, with AUC values of 0.98.

Figure 9 shows precision, recall, and F-measures for each class. For the Construction class, linear SVC and SVC performed best, with precision, recall, and F-measure values of 0.92, 0.93, and 0.92, respectively. For the Weather class, the naïve Bayes model showed the best values, at 0.8, 0.82, and 0.86, respectively, whereas random forest showed the second-best values at 0.68, 0.93, and 0.78, respectively. For the Accident class, random forest, linear SVC, and SVC showed the highest precision, recall, and F-measure values at 0.98, 0.98, and 0.98, respectively. For the Congestion class, linear SVC and SVC had the highest performance, with values of 0.8, 0.98, and 0.88, respectively. For the Crowded Event class, linear SVC had the highest values, at 0.83, 0.68, and 0.75, respectively, whereas SVC had the second-highest values, at 0.7, 0.68, and 0.73, respectively. Finally, for the Others class, linear SVC and SVC had the highest values at 0.94, 0.6, and 0.73, respectively, and 0.94, 0.58, and 0.72, respectively.

To compare the overall classification performance of the six models, Figure 10 shows the average values of their F-measures, which show that linear SVC had the best performance, with an average value of 0.83, corresponding to a performance difference of approximately 19% compared to that of the naïve Bayes model. In our research, it was shown through performance evaluations the Linear Support Vector Machine (LSVM) achieved much better performance than Convolutional Neural Networks and naïve Bayes. These results suggest that models that perform predictions focusing on the correlations between specific words and labels rather than the inter-relationships between words within sentences are more consistent with our data characteristics. In other words, our crowdsourcing data are composed of specific words rather than relationships between words. Therefore, we show better performance when we apply our crowdsourcing data to SVM. As a result of the performance evaluations, linear SVC was chosen as the most suitable model for use as the binary classifier and the multi-class classifier for event classification.

As mentioned earlier, the proposed technique uses binary classification to determine data influencing traffic conditions. It then uses multi-class classification to classify the selected data according to the event type. This section demonstrates the importance of removing irrelevant data by comparing and evaluating the results of not using the binary classifier. During multi-class classification without binary classification, irrelevant data are classified as “others”. In this performance evaluation, cases that perform multi-class classification after binary classification are labeled AB, whereas cases that perform multi-class classification without binary classification are labeled WB.

Figure 11 shows the AUC values for AB and WB. The AUC results for the AB and WB models show no differences between AB and WB for the naïve Bayes, random forest, and SVC models, which had AUC values of 0.95, 0.97, and 0.98, respectively. In the linear SVC model, the AB result was 0.98, and the WB result was 0.97, indicating that the AB value was higher by 0.1. In the BiLSTM model, the AB model was 0.96, and the WB model was 0.94, corresponding to a 2% difference. In TextCNN, the AB result was 0.96, and the WB result was 0.94, indicating that the AB value was 2% higher in both models.

Figure 12 shows the precision, recall, and F-measures of the AB and WB models. The precision results show that AB improved the performance of all evaluated models. For the linear SVC model, which we selected as the most suitable classifier in earlier parts of the study, the value for WB was 0.79. By contrast, the value for AB was 0.84, which was 5% higher. In terms of recall, WB generally showed higher performance. However, in the case of BiLSTM, AB’s performance was 2% higher, whereas TextCNN exhibited the same results for WB and AB. Finally, concerning the harmonic means for precision and recall, random forest WB had a value of 0.73, whereas AB had 0.70, indicating WB’s superior performance. By comparison, AB outperformed the rest of the models (excluding random forest). In the case of the linear SVC model, the value for WB was 0.81, whereas that for AB was 0.83, confirming AB’s superior performance.

Figure 13 shows the F-measures of the six models in the AB and WB cases compared to F-measures without the “others” class. In the naïve Bayes model, the F-measure values were the same for AB and WB, whereas, in the random forest model, WB had better results than AB. However, regarding the F-measure values of the four models, i.e., excluding the two aforementioned models, those for AB were higher than those for WB. In the case of the linear SVC model, which was selected as the most suitable classifier based on earlier performance evaluation results, the F-measure value for AB was 2% higher than for WB. In the case of WB, data that did not influence traffic conditions were classified as “others”. Therefore, when we measured the F-measure values without the “others” class, the F-measure values for AB increased in all six models. However, we found that the F-measure values increased in the Naïve Bayes and BiLSTM models for WB but remained the same or decreased in other models. Thus, we confirmed that accuracy is improved using a binary classifier to determine data relevant to traffic conditions.

Our evaluations verified the performance of the proposed technique, which selects keywords from text data that infer regions and converts them to administrative district names. We performed the analysis based on linear SVC results, which exhibited the best performance after event classification. The total number of data points was 8100. Table 5 shows extraction accuracy percentages based on a random selection of 100 data points from the overall set of 8100 data points. We repeated this process five times and calculated the average accuracy values.

The average accuracy for the five extraction results was 80.4%. Two factors reduced administrative district extraction accuracy. Firstly, it was difficult to extract administrative districts if a precise location name such as “National Highway 46” was not mentioned. Secondly, when sections of a single large road were described based on bridges, as in “construction on lane 4 of Seoul’s Gangbyeonbuk Expressway in the Guri direction from the Hannam Bridge to the Dongho Bridge”, the administrative district switched to the bridge location, which reduced accuracy for non-autonomous regions.

5. Conclusions

In this paper, we propose a social-media-based machine-learning technique to detect regional traffic events. The proposed technique performs a data refinement process to increase the accuracy of machine learning. We trained various machine-learning models and selected the model with the best performance. We used a binary classifier to extract data relevant to traffic conditions and a multi-class classifier to classify relevant data into six event types that influence traffic conditions: accidents, construction, congestion, weather, special events, etc. To determine where the events in the classified data occurred, we used an entity name recognition API to extract region-identifying keywords from the text. We then used a geocoding API to convert the extracted keywords to administrative districts. Our performance evaluation showed that the linear SVC model had an AUC value of 0.96 and an F-measure of 0.87 in binary classification. Additionally, the linear SVC model had an AUC value of 0.98 and an F-measure of 0.83 in multi-class classification. Thus, the linear SVC was judged to be the most suitable model for event classification. Overall, the region extraction accuracy was 80.4%. Revising data with an interpolation technique based on entity name recognition and administrative region conversion values may improve accuracy. The proposed technique provides a fast and accurate traffic service. In future works, grouping will be performed using classified event types and administrative regions to calculate similarities between related events. We will also develop an entity name recognizer that can accurately detect regions. Furthermore, we plan to incorporate data from various social media platforms such as Facebook and Instagram.

Author Contributions

Conceptualization, Y.K., S.S., H.L., D.C., J.L., K.B. and J.Y.; methodology, Y.K., S.S., H.L., D.C., J.L., K.B. and J.Y.; validation, Y.K., S.S., H.L., D.C., J.L. and K.B.; formal analysis, Y.K., S.S., H.L., D.C., J.L. and K.B.; writing—original draft preparation, Y.K., S.S., H.L. and K.B.; writing—review and editing, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. RS-2023-00245650), AURI (Korea Association of University, Research Institute and Industry) grant funded by the Korean Government (MSS: Ministry of SMEs and Startups) (No. S3047889, HRD program for 2021), MSIT (Ministry of Science and ICT) under the Grand Information Technology Research Center support program (IITP-2023-2020-0-01462) supervised by the IITP (Institute for Information and Communications Technology Planning and Evaluation), and by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2022R1A2B5B02002456).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, L.; Racz, D.; Vaillancourt, K.; Michelman, J.; Barnes, M.; Mellem, S.; Eastham, P.; Green, B.; Armstrong, C.; Bal, R.; et al. Smartphone-Based Hard-Braking Event Detection at Scale for Road Safety Services. Transp. Res. Part C Emerg. Technol. 2023, 146, 103949. [Google Scholar] [CrossRef]
Essien, A.; Petrounias, I.; Sampaio, P.; Sampaio, S. A Deep-Learning Model for Urban Traffic Flow Prediction with Traffic Events Mined from Twitter. World Wide Web 2021, 24, 1345–1368. [Google Scholar] [CrossRef]
Cai, Q. Cause Analysis of Traffic Accidents on Urban Roads Based on an Improved Association Rule Mining Algorithm. IEEE Access 2020, 8, 75607–75615. [Google Scholar] [CrossRef]
D’Andrea, E.; Ducange, P.; Lazzerini, B.; Marcelloni, F. Real-Time Detection of Traffic from Twitter Stream Analysis. IEEE Trans. Intell. Transp. Syst. 2015, 16, 2269–2283. [Google Scholar] [CrossRef]
Jang, B.; Yoon, J. Characteristics Analysis of Data from News and Social Network Services. IEEE Access 2018, 6, 18061–18073. [Google Scholar] [CrossRef]
Subroto, A.; Apriyana, A. Cyber Risk Prediction through Social Media Big Data Analytics and Statistical Machine Learning. J. Big Data 2019, 6, 50. [Google Scholar] [CrossRef]
Jeong, H.-Y. Design and Implementation of Mobile Crowdsourcing-Based Driver Assistance Systems (MC-DAS). J. IKEEE 2018, 22, 29–37. [Google Scholar]
Vij, D.; Aggarwal, N. Smartphone Based Traffic State Detection Using Acoustic Analysis and Crowdsourcing. Appl. Acoust. 2018, 138, 80–91. [Google Scholar] [CrossRef]
Klaithin, S.; Haruechaiyasak, C. Traffic Information Extraction and Classification from Thai Twitter. In Proceedings of the 13th International Joint Conference on Computer Science and Software Engineering—JCSSE, Khon Kaen, Thailand, 13–15 July 2016. [Google Scholar] [CrossRef]
Alomari, E.; Mehmood, R.; Katib, I. Sentiment Analysis of Arabic Tweets for Road Traffic Congestion and Event Detection. In EAI/Springer Innovations in Communication and Computing; Springer: Berlin/Heidelberg, Germany, 2020; pp. 37–54. [Google Scholar] [CrossRef]
Choi, M.; Shin, S.; Choi, J.; Langevin, S.; Bethune, C.; Horne, P.; Kronenfeld, N.; Kannan, R.; Drake, B.; Park, H.; et al. TopicOnTiles: Tile-Based Spatio-Temporal Event Analytics via Exclusive Topic Modeling on Social Media. In Proceedings of the Conference on Human Factors in Computing Systems-Proceedings 2018, Montreal, QC, Canada, 21–26 April 2018. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, C.; Wang, P.; Xiong, Y.; Zhang, F.; Lv, Y. Framework for Fusing Traffic Information from Social and Physical Transportation Data. PLoS ONE 2018, 13, e0201531. [Google Scholar] [CrossRef] [PubMed]
Nallaperuma, D.; Nawaratne, R.; Bandaragoda, T.; Adikari, A.; Nguyen, S.; Kempitiya, T.; De Silva, D.; Alahakoon, D.; Pothuhera, D. Online Incremental Machine Learning Platform for Big Data-Driven Smart Traffic Management. IEEE Trans. Intell. Transp. Syst. 2019, 20, 4679–4690. [Google Scholar] [CrossRef]
Huang, L.; Liu, G.; Chen, T.; Yuan, H.; Shi, P.; Miao, Y. Similarity-Based Emergency Event Detection in Social Media. J. Saf. Sci. Resil. 2021, 2, 11–19. [Google Scholar] [CrossRef]
Agarwal, S.; Mittal, N.; Sureka, A. Potholes and Bad Road Conditions-Mining Twitter to Extract Information on Killer Roads. In ACM International Conference Proceeding Series; ACM: New York, NY, USA, 2018; pp. 67–77. [Google Scholar] [CrossRef]
Neruda, G.A.; Winarko, E. Traffic Event Detection from Twitter Using a Combination of CNN and BERT. In Proceedings of the International Conference on Advanced Computer Science and Information Systems, ICACSIS 2021, Depok, Indonesia, 23–25 October 2021. [Google Scholar] [CrossRef]
Alomari, E.; Katib, I.; Albeshri, A.; Yigitcanlar, T.; Mehmood, R. Iktishaf+: A Big Data Tool with Automatic Labeling for Road Traffic Social Sensing and Event Detection Using Distributed Machine Learning. Sensors 2021, 21, 2993. [Google Scholar] [CrossRef] [PubMed]
National Transport Information Center. Available online: https://www.its.go.kr/opendata/opendataList?service=event#moveData (accessed on 1 March 2021).
Aizawa, A. An Information-Theoretic Perspective of Tf–Idf Measures. Inf. Process. Manag. 2003, 39, 45–65. [Google Scholar] [CrossRef]
Rish, I. An empirical study of the I Bayes classifier. In Proceedings of the IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Seattle, WA, USA, 8 August 2001. [Google Scholar]
Pal, M. Random Forest Classifier for Remote Sensing Classification. Int. J. Remote Sens. 2007, 26, 217–222. [Google Scholar] [CrossRef]
Tax, D.M.J.; Duin, R.P.W. Support Vector Data Description. Mach. Learn. 2004, 54, 45–66. [Google Scholar] [CrossRef]
Ho, C.-H.; Lin, C.-J. Large-Scale Linear Support Vector Regression. J. Mach. Learn. Res. 2012, 13, 3323–3348. [Google Scholar]
Sperandei, S. Understanding Logistic Regression Analysis. Biochem. Med. 2014, 24, 12–18. [Google Scholar] [CrossRef] [PubMed]
Kim, Y. Convolutional Neural Networks for Sentence Classification. arXiv 2014, arXiv:1408.5882. [Google Scholar]
Ali, F.; Ali, A.; Imran, M.; Naqvi, R.A.; Siddiqi, M.H.; Kwak, K.S. Traffic Accident Detection and Condition Analysis Based on Social Networking Data. Accid. Anal. Prev. 2021, 151, 105973. [Google Scholar] [CrossRef] [PubMed]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL HLT Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies-Proceedings of the Conference, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
AI API/DATA. Available online: https://aiopen.etri.re.kr/ (accessed on 1 August 2020).
Google Maps Platform. Available online: https://developers.google.com/maps/documentation/geocoding/overview (accessed on 1 May 2021).
Kakao Developers. Available online: https://developers.kakao.com/ (accessed on 1 May 2021).
TBN. Available online: http://www.tbn.or.kr/ (accessed on 1 March 2021).

Figure 1. Regional traffic event detection technique: overall structure and specific modules.

Figure 2. Event classification flowchart.

Figure 3. Event selection example.

Figure 4. Example of Categorize Selected Event Types stage.

Figure 5. Distribution of the TBN data set.

Figure 6. Binary classifier ROC curves and AUC. (a) ROC; (b) AUC.

Figure 7. Binary classifier precision, recall, and F-measure.

Figure 8. Multi-class classifier ROC curves and AUC: (a) ROC; (b) AUC.

Figure 9. Multi-class classifier precision, recall, and F-measure by class: (a) Construction; (b) Weather; (c) Accident; (d) Traffic Jam (e) Crowded Event; (f) Others.

Figure 10. Multi-class classifier F-measure averages.

Figure 11. AUC with and without binary classifiers.

Figure 12. Precision, recall, and F-measures with and without binary classifiers.

Figure 13. F-measures with and without the “others” class.

Table 1. Event types and their meanings.

Event Type	Meaning
Traffic accident	Data representing accidents that occur when vehicles collide, from accident occurrence until settlement. Collision accidents, single-car accidents, etc.
Construction	All construction occurring on a road. Road construction, roadside tree work, etc.
Crowded Events	Data representing special events where many people gather. Assemblies, festivals, etc.
Weather	Data representing inclement weather. Fog, rain, wind, etc.
Congestion	Data representing congestion on certain road sections.
Others	Data that cannot be depicted as accidents, construction, special events, weather, or congestion. Fallen objects, animal carcasses, vehicle malfunctions, etc.

Table 2. Entity name tag definitions.

Classification	Subclassification	Definition
LOCATION (LC)	LC_OTHERS	Other places that are not specific LC-series types
	LCP_COUNTRY	Country name
	LCP_PROVINCE	Name of region such as province or state
	LCP_COUNTY	Name of Korean administrative subdistrict, e.g., Gun, Myeon, Eup, Ri, Dong
	LCP_CITY	City name
	LCP_CAPITALCITY	Capital city name
	LCG_RIVER	River, lake, pond
	LCG_OCEAN	Ocean, sea
	LCG_BAY	Peninsula, bay
	LCG_MOUNTAIN	Mountain, mountain range, ridge, pass/hill, peak
	LCG_ISLAND	Island, archipelago
	LCG_CONTINENT	Continent
	LC_TOUR	Tourist attractions
	LC_SPACE	Celestial body name
ORGANIZATION (OG)	OG_OTHERS	Other organizations/associations
	OGG_ECONOMY	Economic organization/association, company
	OGG_EDUCATION	Educational organization/association, education-related organization
	OGG_MILITARY	Military organization/association and type, national defense organization
	OGG_MEDIA	Media organization/association, broadcast-related organization/company
	OGG_SPORTS	Sports organization/association
	OGG_ART	Art organization/association
	OGG_MEDICINE	Medical/health organization/association
	OGG_RELIGION	Religious organization/association, including sects
	OGG_SCIENCE	Scientific organization/association
	OGG_LIBRARY	Library or library-related organization/association
	OGG_LAW	Legal organization/association
	OGG_POLITICS	Government/administrative organization, public organization, political organization
	OGG_FOOD	Food-related business/company
	OGG_HOTEL	Lodging-related business
ARTIFACTS (AF)	AF_BUILDING	Building/civil engineering structure, playground name, apartment, bridge, lighthouse, fountain
ARTIFACTS (AF)	AF_ROAD	Road/railway name

Table 3. Characteristics of the collected data.

Classification	Content
Learning data set	data collection period: 1 March–30 September 2021 (excluding data for July)	96,815
Test data set	data collection period: 2021/07/01~2021/07/31	11,379
TBN data set	external agency	Twitter, Broadcasting stations, Radio, etc.
	public agency	Korea Expressway Corporation, Central Traffic Information Center
	direct data provider	avid viewers, citizens, etc.

Table 4. Example of collected data.

Region	ID	Date and Time	Content
전북 (Jeollabuk-do)	김재 * (Kim Jae *)	5 March 2021 17:22	긴급 동부대로 발단리사거리 동산광장 전북여자고등학교 입구 좌회전차로 화물차 택시간의 사고 사고처리 안되고 있음 (Accident between truck and taxi in the left-hand lane of the Jeonbuk Girls’ High School entrance in Dongsan Plaza, Baldan-ri Intersection, Dongbu-daero)
경인 (Seoul-Incheon)	애청 * (avid vie *)	5 March 2021 21:26	최초 서울시 서강로 광흥창역 봉원교 2차로 추돌사고 일어났음 주의요망 (Please note that a collision occurred in the second lane of the Bongwon Bridge, Gwangheungchang Station, Seogang-ro, Seoul)
부산 (Busan)	이희 * (Lee Hee *)	5 March 2021 21:35	중앙대로 광무교 서면교차로 정체 (Jungang-daero, Gwangmu Bridge, Seomyeon Intersection congestion)
부산 (Busan)	김광 * (Kim Kwang *)	5 March 2021 21:36	중앙대로 좌천 삼거리 초량교차로 내 1차로 공사중임 (At the intersection of Jungang-daero and Jocheon Intersection, inside Jocheon Intersection, lane 1 is under construction)
경인 (Seoul-Incheon)	정보상황 * (Information Situation *)	5 March 2021 21:52	안내 인천시 교육청 코로나19 상황 인한 민원 대면 시간 최소화하기 위해 시민 대상 드라이브 스루 형식의 차 타Go 민원 Call 서비스 제공 (Guidance from the Incheon Metropolitan Office of Education: minimize face-to-face interactions due to the COVID-19 situation. A drive-through Car TaGo Civil Complaint Call service is being provided to citizens)

* is a mark to protect the informant’s personal information.

Table 5. Region extraction accuracy.

	Extraction Accuracy (%)
	Metropolitan Local Government	Basic Local Government	Non-Autonomous Region
1	92	88	80
2	92	84	78
3	95	84	80
4	98	88	85
5	95	86	79
Average	94.6	86	80.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, Y.; Song, S.; Lee, H.; Choi, D.; Lim, J.; Bok, K.; Yoo, J. Regional Traffic Event Detection Using Data Crowdsourcing. Appl. Sci. 2023, 13, 9422. https://doi.org/10.3390/app13169422

AMA Style

Kim Y, Song S, Lee H, Choi D, Lim J, Bok K, Yoo J. Regional Traffic Event Detection Using Data Crowdsourcing. Applied Sciences. 2023; 13(16):9422. https://doi.org/10.3390/app13169422

Chicago/Turabian Style

Kim, Yuna, Sangho Song, Hyeonbyeong Lee, Dojin Choi, Jongtae Lim, Kyoungsoo Bok, and Jaesoo Yoo. 2023. "Regional Traffic Event Detection Using Data Crowdsourcing" Applied Sciences 13, no. 16: 9422. https://doi.org/10.3390/app13169422

APA Style

Kim, Y., Song, S., Lee, H., Choi, D., Lim, J., Bok, K., & Yoo, J. (2023). Regional Traffic Event Detection Using Data Crowdsourcing. Applied Sciences, 13(16), 9422. https://doi.org/10.3390/app13169422

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Regional Traffic Event Detection Using Data Crowdsourcing

Abstract

1. Introduction

2. Related Work

3. Proposed Regional Traffic Event Detection Technique

3.1. Structure of Proposed Technique

3.2. Data Preprocessing

3.3. Event Type Extraction

3.4. Region Extraction from Event Occurrence

4. Performance Evaluation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI