A Glimpse at the Future Technological Trends of Road Infrastructure: Textual Information-Based Data Retrieval

Kim, Inyoung; Choi, Sungtaek; Lee, Hyejin; Park, Jeehyung; Yun, Ilsoo

doi:10.3390/infrastructures9120233

Open AccessArticle

A Glimpse at the Future Technological Trends of Road Infrastructure: Textual Information-Based Data Retrieval

by

Inyoung Kim

¹

,

Sungtaek Choi

^2,*,

Hyejin Lee

³,

Jeehyung Park

⁴ and

Ilsoo Yun

⁵

¹

Department of D.N.A. (Data, Network, Artificial Intelligence) Convergence, Ajou Universty, Suwon-si 16499, Republic of Korea

²

Department of Urban Planning and Engineering, Hanyang University, Seoul-si 04763, Republic of Korea

³

Department of Civil and Environmental Engineering, Seoul National University, Seoul-si 08826, Republic of Korea

⁴

Department of the Private Investment SOC Management Support, Korea Transport Institute, Sejong-si 30147, Republic of Korea

⁵

Department of Transportation System Engineering, Ajou University, Suwon-si 16499, Republic of Korea

^*

Author to whom correspondence should be addressed.

Infrastructures 2024, 9(12), 233; https://doi.org/10.3390/infrastructures9120233

Submission received: 10 October 2024 / Revised: 2 December 2024 / Accepted: 11 December 2024 / Published: 13 December 2024

(This article belongs to the Section Smart Infrastructures)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Since the Fourth Industrial Revolution was announced in 2015, relevant key technologies have recently merged and have extensively affected our society. To provide empirical insights into the future and address expected issues in the context of transportation, this study seeks to investigate how future road infrastructure technology will shift. Going over the mainstream future road infrastructure inspired by the strategy implemented in the Korean New Deal 2.0, we extract central keywords explaining what specific technologies and political directions will prevail globally. In particular, a specific morphological analyzer, Mecab-Ko, which is suitable for Korean is selected after comparing a variety of packages. Then, a specific text mining approach is employed to collect textual online sources (news articles, research articles, and reports) written in Korean while most studies gather information written in English. Using the term frequency-inverse document frequency (TF-IDF), 11 keywords were extracted from unstructured textual online sources. Topic modelling with latent Dirichlet allocation (LDA) is subsequently performed to classify them into four groups: an unmanned payment system, intelligent road infrastructure, connected automated driving road, and eco-friendly road. Based on these findings, we can take a glimpse into how the future road infrastructure in Korea will be reshaped. Evidently, a digitalized road without a human component is around the corner. Fully automated systems will soon become available, and the keyword sustainability will continue to receive critical attention in the transportation sector.

Keywords:

Industry 4.0; road technology; text mining; term frequency-inverse document frequency; latent Dirichlet allocation

1. Introduction

Since the Fourth Industrial Revolution, also referred to as Industry 4.0, was popularized by Klaus Schwab for the first time in 2015, associated core technologies, such as big data analysis, artificial intelligence (AI), and the Internet of things (IoT) have rapidly changed our modern society [1]. The importance of sustainability has recently attracted considerable attention in the context of global climate change and global warming. As a possible solution to embrace the upcoming future, many countries are seeking to establish strategies to respond to the dramatic transitions induced by Industry 4.0, calling for sustainable development [2].

In the transportation field, sustainable mobility and high-tech transport infrastructure have emerged in recent years [3,4]. The central keywords with growing interest include frontier technologies in transportation, such as electric vehicles, autonomous vehicles (AV), and shared cars. Information technology infrastructure based on vehicle-to-everything (V2X) and advanced communication technologies are also a critical keyword, as well as other mobility services. At the UN Global Sustainable Transport Conference hosted by the United Nations Economic Commission for Europe in Beijing, China [5], a summit meeting was held to discuss the importance of sustainable transport, which is reflected in the 2030 Agenda for Sustainable Development. All members agreed that the transition to low-carbon green energy, digitalized road infrastructure, and global collaboration are essential for establishing a sustainable integrated transport system in the context of technology and the environment. In terms of sustainable road initiatives, possible strategies for improving social, environmental, and economic achievements based on cutting-edge technology were proposed at the World Economic Forum in 2021.

Several countries expect that major advanced technologies related to the Fourth Industrial Revolution will lead to dramatic changes in transportation. Accordingly, they have planned road investment strategies to proactively respond to such changes while proceeding with the Digital New Deal since the early 2000s [6]. For instance, the United States is focusing on the construction of intelligent transport systems (ITSs), including the digital revolution in road infrastructure with huge investments. Germany expects a substantial increase in demand for electric vehicles (EVs) resulting from the Green New Deal, which is a concept for the great ecological transition of the economy. Japan fully supports the invention and innovation of technology associated with disaster monitoring and facility management systems by integrating information and communication technology (ICT). China has expanded business areas to the core technology of the ITSs and manufactured business-oriented EVs.

The government of Korea recently announced the Korean New Deal 2.0 to promote the transformation of road infrastructure into a digital economy [7]. However, there is a lack of principal strategies for investment and trend forecasting research on specific technologies focusing on road infrastructure and the transport system in the context of sustainability and Industry 4.0.

In light of the background above, this study performed a text mining analysis to predict how the future trend in road infrastructure technology, which is generally expected to rapidly change and shift, led by high-tech digital applications inspired by the Korean New Deal, will be changed and what scientific knowledge will emerge. This approach is distinguished from previous studies that mostly used a traditional survey of a Delphi by overcoming issues on sample size and sampling bias that conventional methods usually encounter. Then, a popular word embedding method, the term frequency-inverse document frequency (TF-IDF), was used to extract specific keywords related to future road infrastructure technology (in particular, roads built with private capital). Subsequently, we conducted topic modelling with latent Dirichlet allocation (LDA) to identify the specific technologies of road infrastructure that will receive attention and how their future trend will be shaped. We hope that our findings will help clarify the future directions of technical investment in Korea and strengthen the foundation for the provision of physical infrastructure.

2. Literature Review

2.1. Future Technology of Highway

Through a survey Hamid and Zamzuri [8] introduced technologies that are viewed as major future technology keywords, including cryptocurrencies (e.g., Bitcoin and Ethereum), shared mobility, and AV, in the era of Industry 4.0. In particular, they proposed the Internet of vehicle (IoV) that can be applied to AV smart highways—the so-called implementation of the smart highway of AV. The key components of smart highways for safe driving of new mobility were also specified: IoT, V2X, connectivity, enhanced perception modules, collision avoidance systems, and blind spot regions. Singh and Sharma [9] proposed digitalized highways to enhance road safety and support a sustainable environment by applying intelligent IoT sensors and machine-learning approaches. Specifically, they developed a systematic framework including five specific technologies: smart highway lighting, smart traffic and emergency management, renewable energy resources on highways, smart display boards for vulnerable road user models, and AI on highways.

2.2. Text Analytics

Some scholars have conducted research on text mining. Putri [10] performed a systematic review to explore how future ITSs will change. Various natural language processing (NLP) methodologies, including named entity recognition, LDA, and word embedding, have been used to extract central keywords. The results showed that major knowledge in ITSs including detection systems, communication, and traffic, and substantial studies have employed mathematical, machine learning, neural networks, and optimization approaches in the ITSs discipline. Ali and El-Sappagh [11] utilized key features extracted from social media to gather transport information because current sources, including sensor network-based systems and mobile applications, are not sufficient to collect transportation information in the ITSs field. As a key solution, fuzzy ontology-based semantic knowledge with the Word2vec model was adopted using bidirectional long short-term memory (Bi-LSTM). The results showed remarkable improvements in the extraction of key characteristics of relevant topics from social media and the classification of unstructured text. Salloum and Al-Emran [12] also applied text mining techniques to extract keywords related to mobile learning by collecting 300 references focusing on research topics published in six major scientific databases (e.g., Springer, Wiley, and Science Direct). There are three specific techniques, including the word cloud/frequency, association rule mining, and K-means clustering, for extracting and visualizing text information. Experimental results showed that “Learning” is the most frequent keyword across references; “Education” is centrally placed on the tree structure to which most words are connected, and six clusters where most keywords belong to at least one of them were identified. Zhang and Fleyeh [13] conducted an analysis to prepare a scientific risk control plan for workplace safety. Text mining and NLP techniques were used to extract keywords from the construction accident reports. The authors applied six specific methods, including support vector machine, linear regression, K-nearest neighbor, decision tree, Naive Bayes, and an ensemble model, to analyze the factors that cause accidents and classify them. The findings indicated that the most common objects that cause or are related to dangerous accidents include ladders, roofs, trucks, machines, and forklifts. Onan [14] proposed a two-stage procedure that combines word embedding and clustering techniques to extract information from the scientific literature. In particular, they proposed improved schemes for performing text collection by introducing Word2vec, POS2vec, word-position2vec, and LDA2vec schemes to improve word embedding and incorporate typical clustering methods for the clustering ensemble framework. By extensively reviewing a corpus in the agricultural engineering, economics, engineering, and computer science domains, they concluded that the newly developed approach outperformed previous approaches by comparing the performance of both methods (ensemble word embedding and framework vs. baseline approaches) in terms of predictive performance. Gupta and Agarwalla [15] sought to find optimal parameters by carrying out hyperparameter tuning of the LDA model with respect to the coherence score of given models. The best model was estimated with

α

= 0.01 and

β

= 0.909999999, which generated the coherence score of 0.408 whereas other conventional models, such as TF-IDF without hyperparameter tuning or Bag of Words, did not demonstrate a similar level of coherence. The parameter

α

, set to 0.01, indicates that documents are highly specific, with only a few dominant topics, making the topic distribution sparse. On the other hand, the parameter

β

, set to 0.909999999, implies that topics are more inclusive, encompassing a broader range of words, resulting in a less sparse word distribution. This combination reflects a balance that allows for distinct yet comprehensive topic modeling. Interestingly, the coherence score was further increased to nearly 0.483 when applying the TF-IDF corpus, highlighting the effectiveness of TF-IDF preprocessing in enhancing the model’s ability to extract meaningful and coherent topics by emphasizing significant terms and reducing noise.

2.3. LDA Topic Modelling

There is an emerging technique called LDA-based topic modelling. This approach enables the computation of associations between words, which cannot be expressed through TF-IDF, thereby enhancing the explanatory power for topics. Specifically, the LDA process analyzes the importance within documents using TF-IDF and applies the words extracted from the TF-IDF results to the topic modeling. Thus, related words can be clustered together, providing meaningful interpretations that TF-IDF does not discover. In particular, it provides the importance of each topic in document, so we can understand which words contribute the most to a given topic.

Some scientific efforts have been made by applying this technique. For instance, Roque et al. [16] aimed to identify keywords frequently appearing in road safety inspection reports published between 2012 and 2017 and to explore the relationship between safety problems in road design and management and their corresponding solutions. To achieve this, the LDA technique was employed, and the data were divided into two groups: road safety problems and responses. Topics for each group were extracted, resulting in 25 topics for each group. Topics 1, 2, 5, and 20 in the road safety problem group were highly related, while topics 11, 14, and 23 in the safety response group were strongly associated. Keywords such as “vehicles”, “walls”, “boundaries”, “edges”, “barriers”, and “risks” were common across both groups. These keywords were considered the most frequently mentioned in road safety-related issues and were interpreted as areas that should be prioritized for resolution, with a focus on these key terms. Sun and Yinupta [17] analyzed 17,163 papers published in 22 major transportation journals from 1990 to 2015 using the LDA model. We extracted 50 topics through LDA, demonstrating that these topics are both representative and meaningful in the field of transportation research. Changes in the distribution of topics over time were measured, revealing research trends such as sustainability, travel behavior, and non-motorized mobility. The results of this analysis suggest that researchers, journal editors, and funding organizations can identify promising research topics or projects, find suitable journals for thesis submissions, and adjust their focus for journal development. Hidayatullah et al. [18] applied the LDA technique to summarize the Indonesian government’s infrastructure development plans. They performed topic analysis by gathering papers and news articles published between 2014 and 2019. To determine the optimal number of topics, they calculated the coherence score, which revealed that 40 topics produced the highest coherence value. From these 40 topics, they identified key labels such as oil and gas infrastructure, power plant infrastructure, information technology and internet networks, and road infrastructure. This analysis facilitated the establishment of priorities for infrastructure development. Some scholars recently employed LDA-based topic modelling focusing on Asian languages. Liu et al. [19] conducted LDA topic modeling to reveal how the press plays a role in identifying the relationship between communication patterns related to health and COVID-19 in China. They collected media reports and news (11,220 cases in total) on COVID-19 by adopting WiseSearch database. A specific Python package, Jieba, was used for text preprocessing. By applying LDA topic modeling, nine specific topics were obtained including confirmed cases, medical supplies, medical treatment and research, prevention and control procedures, Wuhan’s story, mental health, global/local social/economic influences, materials supplies and society support, and detection at public transportation. Yamamoto and Umenura [20] developed a comprehensive Japanese vocabulary difficulty level dictionaries to improve the accuracy of the vocabulary level scoring. In particular, they collected text data in Japanese from Wikipedia and then applied TF-IDF and LDA topic modeling to find the word appearance probability. The result showed that using the proposed dictionaries can increase the accuracy of test scoring by 4.9% compared with manual scoring conducted by human.

2.4. Implications

We can confirm that AI_based techniques have been extensively implemented in various research domains such as NLP, data mining, and computer vision. In particular, machine learning or deep learning which are specific techniques of AI analysis have been popularly employed to analyze big data to identify trends and patterns.

Regarding identification of trends, current topic modelling studies have been mostly conducted by machine learning-based NLP and word embedding to extract research topics from text information after cleaning data via simply removing punctuation and lowering cases. However, there is a fundamental, practical limitation that those previous approaches are mostly specialized in English, which means that they cannot be directly applied to Asian languages such as Korean, Chinese, and Japanese. Each language has their own grammar, structure, and morphemes, meaning that a specific morphological analyzer must be used to accurately obtain grammatical information. In this regard, this study focuses on choosing the best one for Korean by comparing the text preprocessing performance of various analyzers, not using the general NLP. In particular, several analyzers for Japanese which can be applied to Korean were tested since the two languages share strong similarities and come from Chinese. In most cases, a morphological analyzer for a different language is not used for analyzing Korean due to the dissimilarity between the two languages. However, this study attempts to apply Japanese-oriented analyzers to provide insights into the practical application of text tokenization and encourage scholars to use more various analyzers in text mining.

In addition, the Korean language requires a more sophisticated data cleaning process and algorithm since Korean is an agglutinative language [21]. Specifically, there is a unique spacing unit, called eojeol, which is composed of one or more combined morphemes. Lee and Rim [22] explained that it makes text mining difficult compared with other languages. Theoretically, one verb can create over 5000 words with a combination of various eojeols, and each eojeol can have morphological ambiguity with multiple interpretations. This means that an additional pre-process for taking an eojeol, a morphological analyzer, and syllable units into account is required to conduct a text mining analysis. In this study, we elaborated text data by conducting an NLP process comprising three steps: text standardization, stop-words/punctuation elimination, and text tokenization.

3. Methodology

3.1. Text Mining

3.1.1. Overview

Text mining is a specific data-driven processing method for extracting relevant information and knowledge by applying information retrieval and extraction techniques and NLP from unstructured text documents; these approaches are then connected with advanced data-driven methods, such as data mining or machine learning [23,24]. The author proposed a process with two phases (text refining and knowledge distillation) that is commonly used as a typical methodology of text data mining. It first transforms text documents into a specific intermediate form and then deduces patterns or knowledge from such forms.

In this regard, we first collected textual data only written in the Korean language dealing with future technology of road infrastructure in Korea on private roads, and then performed preprocessing of the given text dataset. A novel method for analyzing and separating each sentence in the Korean language into morphemes was employed using several Python packages (since Korean and Japanese grammatically share huge similarities in syntax and morphology, we tested various tools of text tokenization specialized in either language (refer to p. 12 for details)). To extract relevant key topics from the texts, we applied the TF-IDF technique. Finally, topic modelling with LDA was carried out to identify the latent semantic structure in the context of future road infrastructure technology. TF-IDF results are mostly used as an input source for LDA. That is, LDA provides a broader understanding of the main themes and allows a more detailed and nuanced analysis in the text corpus. This method defines the importance of each topic in document to show which words contribute the most to a given topic, not just focusing on the importance of each word in document based on their frequency and uniqueness.

The process of this study is illustrated in Figure 1.

3.1.2. Text Preprocessing

The first stage of the text mining is the text preprocessing, which is essential to perform NLP to analyze textual databases. As mentioned above, word standardization is precise extraction of relevant keywords, which is crucial for text mining as a simple comparison and extraction can lead to multiple different keywords although their definitions are identical. For instance, while car, cars, and vehicle seem to be three different keywords, they need to be treated as a single term in a broad sense. This task can minimize the randomness of text collection and definition, thereby allowing the collected terms to be closer to a predefined standard (a quick example of text standardization is presented in Table 1).

The next step is stop-words/punctuation elimination. It removes stop-words and punctuation such as articles and punctuation marks, which do not significantly contribute to the meaning of the text and need to be removed for accuracy of text analysis. Generally, a customized dictionary is manually defined to include all irrelevant components.

Text tokenization is a mandatory step in NLP to estimate a good model and to better understand text information [25]. In the Korean language, morpheme analysis-based tokenization is necessary because more than two morphemes mostly constitute one word segment, whereas simple spacing can be used to determine the unit of tokenization in English [26]. This implies that morpheme analysis must be performed prior to text tokenization. The dictionary by Merriam-Webster defines a morpheme as a “distinctive collocation of phonemes having no smaller meaningful parts”. Parts of speech (POS), such as nouns, verbs, and adjectives, can be morphemes depending on the context. Since the Korean language fundamentally has a more complex sentence structure, additional tasks that identify the structure of morphemes by splitting a word phrase into morphemes and conducting POS tagging are required to perform such analyses [27]. In the present study, multiple Python packages were tested to compare their performances and choose a best solution (substantial detailed information about the performance comparison is described in Section 4.1).

3.1.3. TF-IDF for Extracting Keywords

TF-IDF is useful for retrieving information by quantifying the importance of string representations after accounting for the importance of each word and how frequently it appears in a corpus [28].

{T F}_{i, j} = \frac{n_{i, j}}{\sum_{k} n_{k, j}},

(1)

where,

$n_{i, j} = t h e n u m b e r o f t i m e s t h e w o r d t_{i} a p p e a r s i n d o c u m e n t d_{j}$
$\sum_{k} n_{k, j} = t h e n u m b e r o f a l l t h e w o r d s i n d o c u m e n t d_{j}$

The term frequency (TF) indicates the frequency of each term in a corpus. In other words, it counts the number of times each word appears in the given sources, as shown in Equation (1). A higher frequency value indicates the prominence of the term in the given text data [29].

{D F}_{i} = \frac{| d_{j} \in D : t_{i} \in d_{j} |}{| D |},

(2)

where,

$|D| = t h e t o t a l n u m b e r o f d o c u m e n t s i n t h e d o c u m e n t s e t$
$|d_{j} \in D : t_{i} \in d_{j}| = n u m b e r o f d o c u m n e t s c o n t a i n i n g t h e w o r d t_{i}$

IDF is the inverse document frequency. Document frequency (DF) refers to the extent to which each word is common in the corpus, and it is a measure of how much the given information (i.e., term) is common. Thus, the larger the DF is, the less important the word is. This can be formulated as shown in Equation (2).

{I D F}_{i} = \frac{| D |}{| d \in D : t_{i} \in d |},

(3)

IDF is the inverse value of DF, which is the log-transformed, weighted inverse fraction of the document (Equation (3)). Accordingly, IDF can be interpreted as a level of rarity [29].

{T F - I D F}_{i, j} = {T F}_{i, j} \times {I D F}_{i},

(4)

Given these two factors, TF-IDF can be calculated as shown in Equation (4). It is based on the combination of the frequency of individual terms and the proportional presence of the rarity of the terms to identify the importance of terms given in a pooled text database [28]. The larger the TF-IDF value is, the higher the relevance score is [30].

The present study, as described at the beginning of this subsection, utilized TF-IDF to extract keywords related to road infrastructure technologies from the preprocessed text database and conducted a visualizing text analysis with a word cloud based on TF-IDF scores (refer to Section 4.1 for further details about the analysis).

3.1.4. LDA for Clustering Keywords

Topic modelling, also referred to as probabilistic latent semantic analysis, is an unsupervised learning method based on a latent class model [31]. It discovers specific topics that appear in a set of given texts and then identifies latent semantic structures. Several algorithms have been used in topic modelling. The present study adapted LDA is a “generative probabilistic model of a corpus” that follows known and fixed Dirichlet distributions. Topic modelling with LDA assumes that a corpus consists of various topics, and hidden keywords are predetermined when inferring relevant topics [32].

The overall process of document generation assumed by the LDA topic modeling is illustrated in Figure 2. In this model, α, β, and K represent the Dirichlet distribution hyperparameters, D represents the number of documents, and N represents the number of words in a document.

θ_{d}

represents the Dirichlet distribution in a document,

ϕ_{k}

represents the word of the topic,

Z_{d, n}

is the topic number to which it belongs, and

W_{d, n}

is the actual observable value. Among these variables, the only observed variable is

W_{d, n}

while the remaining variables are latent and must be estimated. The hyperparameters α and

β

are not directly estimated but are instead manually set or tuned [33]. The number of topics K can also be manually adjusted to optimize the model by comparing measures of goodness-of-fit. Once the topics are extracted, they need to be labeled to reflect the relationships and hierarchy among the associated words [34].

3.2. Data Collection

This study constructed a text dataset in Korean by collecting news articles, research articles, and official reports dealing with future road infrastructure in Korea via Google and Google Scholar web searches. The base date of the search was January 2016, meaning that all documents published in the past six years, ranging from 2016 to 2021, were collected. The salient keywords of web scraping including future road, road technology, and digital roadway, and the contents of top-ranked articles and reports on Google were selectively filed in the initial database. Subsequently, documents that were irrelevant or less associated with our target keywords (we reviewed the keywords and abstract of each document) were excluded from the dataset. Ultimately, 100 documents (57 news articles, 21 research articles, and 22 official reports) were retained (Some readers might raise a question that this sample size is not enough to draw some general implications and keywords. However, we are sure that this was the best sample size because there were relatively few sources that satisfy both conditions: being written in Korean and dealing with the future trend in road infrastructure technology). When compiling the final set of documents, we did not entirely review each document; however, we partly examined it by focusing on the abstract, results, and conclusion, considering that the aim of the study is to extract “future directions” of road infrastructure technology.

The final data size amounted to 293,251 bytes and was structured into a two-dimensional array (100, 7), resulting in a total of 700 data points. Columns of data include search keywords, types of documents (articles, papers, reports, etc.), title, content (introduction, result, and conclusion), category (topic), keywords related to infrastructure, and specific core keywords (purpose of infrastructure construction).

4. Model Estimation

Employing multiple methods for the keyword extraction, visualization, topic modelling, and classification after text preprocessing using unconstructed textual data sources in Korean, we obtain valuable findings and implications on future technological trends of road infrastructure as follows (Figure 3).

4.1. Data Manipulation

Text preprocessing was conducted to transform the collected unstructured information into analyzable keywords. Python 3.6 was used as the main programming language with various library toolkits, including nltk (a standard Python library providing various algorithms for NLP), KoNLPy (a Python package for NLP specialized in the Korean language), and the morpheme analyzer (Mecab-ko). For the text normalization, we manually predefined standardized text to reduce inflectional forms after translating each word into Korean. We then created a customized dictionary that includes user-defined stop-words, language-specific stop-word recognition provided by the Python library nltk, and a list of punctuation. All relevant stop words included in the dictionary and punctuation marks were eliminated from the initial text database. For example, some words less associated with road infrastructure technology, such as vehicle types, centrality, and proposal, as well as other minor irrelevant words, such as prepositional particles, affixes, and conjunctions, were eliminated.

Regarding text tokenization that separates several texts into small pieces called tokens, some Python packages allow Korean morphological analysis to be performed: Hannanum, OKT, Komoran, KKma, and Mecab-ko, which supports Japanese, but can be applied to Korean. After testing their performances by manually comparing them, the Mecab-ko package is selected as the best solution for text tokenization. The following paragraph shows a quick example of the comparison.

Table 2 presents an example of the primitive text data. Table 3 shows the comparison results of morphological analysis by package, where each sentence in Table 2 was separated into independent morphemes and POS tagging was applied to extract only common nouns (NNG) and proper nouns (NNP).

We confirmed that Hannanum could not split imperfect sentences with a spacing error into the correct form of morphemes. Komoran and OKT seemed to outperform Hannanum, but they did not recognize proper nouns, such as autonomous vehicles and digital infrastructure. KKma’s outcome was inferior to that of other packages (i.e., it did not recognize spacing errors or proper nouns). In addition, all four packages above partially included unnecessary text (e.g., adverbs, prepositions, and adjectives). On the contrary, Mecab-ko divides words into morphemes relatively well. It suggests that testing various morphological analyzers is necessary even if they do not officially support the specific language that researchers are interested in when two different languages share strong similarities.

Based on the comparison results, we chose the Mecab-ko package for text segmentation. This package casts light on spacing recognition [26] and demonstrates high inference speed and stability [35]. In particular, this morphological analyzer not only provides predefined dictionaries but also allows users to create a customized user dictionary that helps in recognizing proper nouns used in a specific research domain and tokenizes them [36]. Thus, users can manually modify the dictionary by adding or discarding specific nouns, which prevents potential recognition errors. In this study, some specific transportation/technical terminologies related to trends in the road industry (e.g., autonomous vehicles, digital infrastructure, and high-tech road systems) were manually added to the user dictionary provided by Mecab-ko.

4.2. Keyword Extraction and Visualization

To implement the TF-IDF analysis, 100 words that were top-ranked with the highest frequency in the corpus dealing with future road infrastructure were designated as input data. Based on the extracted 100 keywords, we conducted word visualization, the so-called word cloud (or tag cloud), as illustrated in Figure 4 (All keywords were originally extracted in Korean. Then, they were translated in English for readers’ readability). The most used keywords include infrastructure, real time, artificial intelligence, and cooperative intelligent transport systems, which are presented with larger font sizes and noticeable colors. We can infer that the major technologies inspired by the Fourth Industrial Revolution have a significant impact on road infrastructure. In addition, several keywords, including autonomous vehicles and robots, may hint that diversity of travel modes will occur in the near future, implying that additional immediate efforts will be required to prepare for the upcoming future of the reshaped environment of road infrastructure.

And then, the top 20 keywords were extracted according to the TD-IDF score: infrastructure (1st keyword), artificial intelligence, real time, cooperative intelligent transport systems, automation, autonomous vehicles, environment, sensor, Internet of things, management, digital, robot, big data, high technology, prevention, structure, design, smart tolling, expectation, and intelligent transport system (20th keyword). Based on these extracted keywords, we selected 11 out of 20 keywords after discarding less relevant items, such as prevention, expectation, and smart tolling, as shown in Table 4.

4.3. Trends for the Future Technology of Road Infrastructure

4.3.1. Topic Modelling with LDA

Figure 5 depicts the initial result of LDA-based topic modelling using the Python package pyLDA. While conducting the modelling, the initial number of topics was determined to be five based on five categories, including road infrastructure, new transport, high technology, eco-friendliness, and road management, which were classified when constructing the initial textual dataset. In Figure 5, the circles in the left panel represent each topic; the size of each circle is proportional to the significance of the corresponding topic in the given corpus [37], and the distance between circles can be interpreted as relatedness. The results show that the ranges of Topics 1, 3, and 4 partly overlap, indicating that there is significant redundancy among those topics. In this case, researchers are recommended to minimize overlapping areas by manually adjusting the number of topics (see further information in the next subsection). As a side note, the horizontal bars in the right panel in Figure 5 denote the prevalence of each topic; that is, how useful and meaningful each topic is [37].

To determine the optimal number of topics (i.e., minimizing circle overlap), a heuristic approach to compare the estimated models based on the number of topics needs to be performed [38]. Two specific indicators, perplexity and coherence scores, are used to evaluate models (choosing one of them is a judgment call). In this study, the number of topics was determined based on the coherence score, which is a reference measure indicating the degree of semantic similarity between high-scoring words in the topic [39], p. 954. A larger score indicated that the selected topic was more logical and well classified. As shown in Figure 6, the coherence score was highest when the number of topics was four. Thus, an additional model estimation was performed to obtain the best model by fixing the number of topics to four.

Figure 7 shows the final result of LDA-based topic modelling with four topics. We confirmed that four circles were placed in four different dimensions without overlaps, and each circle was physically distant from the others with a large gap. It shows that (1) the previous classification has been improved; (2) the current classification of topics is distinguished from each other; and (3) particular words with similar definitions or expressions are clustered into the same latent semantic structure. The horizontal bar chart on the right side in Figure 7 demonstrates the extracted salient keywords, including autonomous vehicle, toll, infrastructure, artificial intelligence, and Internet of things.

Based on the topic modelling result and its classification, we classified keywords into four groups, as shown in Table 5 (note that each topic group only contains the top five most prevalent keywords). For example, topic group 1 includes toll, management, smart tolling, intelligent, and tollgate; green road, real time, energy, carbon, and eco-friendly are retained in topic group 4.

4.3.2. Labelling Topic Groups and Insights into Future Roads

Each topic group in Table 5 was labelled based on the extracted keywords. Table 6 presents the labels of each topic group, ranging from 1 to 4, followed by the share of each group. The next paragraph provides detailed descriptions and key insights into future roads.

The label for topic 1 is an unmanned payment system, also referred to as electronic toll collection. It is a novel solution for mitigating traffic congestion at toll plazas by allowing vehicles to smoothly pass through a toll gate without interaction with a physical collector using Radio Frequency Identification, as mentioned by Khder et al. [40]. It is expected that this automated system can significantly increase efficiency of managing toll gates and reduce heavy congestion. Thus, as Chattopadhyay and Rasheed [41] pointed out, we can conclude that the future tolling system will probably be reshaped into open road tolling, which is also known as free-flow tolling without toll booths and the need to slow down. The keywords in topic group 1 (including toll, management, smart tolling, intelligence, and tollgate) support this aspect.

Topic group 2 is labelled as intelligent road infrastructure. Keywords include infrastructure, structure, Internet of things, digital, and cooperative intelligent transport systems. This classification and its topic label imply that high-tech road infrastructure will be required to prepare for a new wave of Industry 4.0 and the upcoming future of Avs, which is consistent with Munirathinam’s [42] findings. In particular, this motivation and future movement will positively impact the paradigm shift from road construction to road operation and management, given that future road infrastructure will be beneficial to road inspection and maintenance.

Topic 3 is labelled connected automated driving road. The keywords include autonomous vehicles, artificial intelligence, real time, big data, and sensor. According to this finding and a prior study proposing strategic approaches for improving road infrastructure to respond to rapid global changes in transportation [43], providing road infrastructure that allows connected automated driving is an urgent task. In particular, it is imperative to provide an integrated management system that controls infrastructure, communication, and relevant technologies and monitors traffic conditions and road emergencies, thereby ensuring the safe travel of both AVs and regular cars.

Figure 7. Final result of topic modelling with the optimal number of topics.

Topic 4 is labelled eco-friendly road. It was determined by the extracted keywords, including green road, real-time, energy, carbon, and eco-friendly. This result indicates that particular strategies for providing eco-friendly road systems are necessary for precautions against global warming, climate change, and other environmental crises. With global agreements to reduce greenhouse gas emissions, such as The Climate Pledge ‘Net Zero Carbon by 2040’, many countries and businesses continue efforts on such issues [44], which calls for prompt actions, such as the adoption of carbon-free roads.

4.3.3. Suggestions for Improvement

The result successfully identifies four distinctive keywords: unmanned payment, intelligent road infrastructure, connected automated driving road and an eco-friendly road. Based on these findings, further studies can be conducted to better understand the key components of digital roads to identify the blueprint for digitalized road infrastructure in the near future.

A massive collection of textual sources seems to be a top priority for improving our findings to conclude general future directions for future roads. Due to the limited availability of large-scale Korean-language data specific to the field of future roads, challenges have arisen in enhancing the analytical reliability and performance of the model. This data scarcity can be mitigated through the use of state-of-the-art Large Language Model (LLM) techniques, which have gained significant attention as a robust alternative to traditional text mining methodologies. LLMs can generate additional data through prompt processes, increasing data diversity and improving model performance. By leveraging this technique, it is possible to address some data limitations by automatically extracting high-quality keywords from extensive datasets related to future roads. Consequently, future research should prioritize enhancing data quality and model accuracy through the augmentation of textual data using LLMs.

To verify the results of this study, an additional analysis of another language’s data is necessary. By comparing domestic and international data, we can assess whether domestic road policies align with global trends. This will help determine whether the keywords derived from this study can serve as valid indicators for predicting future trends. Specifically, comparing the major trends in domestic road policy changes with those in other countries will provide valuable insights. This analysis will not only enhance the study’s reliability but also demonstrate the potential of this method as a predictive tool for the future development of road infrastructure.

The combination of TF-IDF and Mecab-ko provided stable and valid results, but technical challenges emerged when applying the latest techniques such as Word2vec. This issue appears to be due to compatibility problems between Word2vec and Mecab-ko, highlighting the need for the following technical improvements moving forward. First, model compatibility must be ensured. To address the issues encountered with Word2vec, it is essential to explore other morpheme analyzers that are highly compatible with Word2vec in addition to Mecab-ko or seek ways to optimize Word2vec’s processing capabilities for the Korean language. Moreover, the use of the LLM technique, as previously discussed, is critical. TF-IDF has the limitation of not fully capturing context, which requires a complex preprocessing process for keyword extraction. On the other hand, LLMs are particularly effective at understanding context, helping to overcome the limitations of traditional text mining techniques. By addressing these technical challenges, more sophisticated keyword extraction and semantic analysis can be achieved, thereby enhancing the quality of road infrastructure research.

By applying these approaches to address the limitations of existing methodologies, richer and more reliable results can be achieved in understanding the future of road infrastructure.

5. Conclusions

This study aims to explore how future road infrastructure technology (particularly private roads) in Korea will evolve. By examining the main trends in road infrastructure in Korea inspired by Digital Roads, a specific strategy implemented in the Korean New Deal 2.0, central keywords were clearly identified. Specifically, textual sources written in Korean were first collected. Then we applied the relevant morphological analyzer referred to as Mecab-ko which turned out to be appropriate for text segmentation in Korean and performed keyword extraction using TF-IDF from a dataset in terms of sustainability and the Fourth Industrial Revolution. The results show that 11 keywords (e.g., infrastructure, artificial intelligence, cooperative intelligent transport systems, autonomous vehicles, and big data) were identified via TF-IDF. Lastly, the LDA-based topic modelling regrouped extracted keywords into four independent topics which are labelled as unmanned payment systems, intelligent road infrastructure, connected automated driving roads, and eco-friendly roads.

The contributions of this study can be summarized into two. First, we developed a systematic approach for analyzing localized text and extracting keywords. While we acknowledge that our morphological analyzer (Mecab-ko) is specifically tailored to the Korean language (i.e., the study is localized), the proposed research framework is universally replicable. This approach can serve as a model for other countries where the primary language lacks support from official linguistic analysis packages. By refining and enhancing our methodology in the Korean context, we provide a foundation for future studies that can expand to localized, larger, or even global datasets. We believe this approach ensures both methodological rigor and broader relevance.

Second, we have identified specific keywords for central governments and municipalities, which can be referred to in defining future goals, strategies, and policy-oriented approaches. While the four identified topics—unmanned payment systems, intelligent road infrastructure, connected automated driving roads, and eco-friendly roads—may appear self-evident and consistent with prior findings, they provide practical value for policymakers and practitioners in shaping targeted policies. Although these keywords are globally applicable and well-established in the literature, they offer unique meaningful insights when contextualized within Korea, providing a solid theoretical foundation for decision making.

Based on these findings, we can glimpse into how future road infrastructure in Korea will be reshaped via illustration of topic modelling, specific labels, and their elements. Evidently, a digitalized road without a human component is just around the corner. Free-flow tolling services are currently available, and the development of advanced technologies that focuses on digital roads is ongoing. Fully automated systems will soon become available, and the keyword sustainability will continue to receive critical attention in the transportation sector. Given that, more specific strategies for the provision of future road infrastructure and management of the entire system need to be planned. We believe that such preparation will contribute to the maximization of investment on roads in the future.

Author Contributions

Conceptualization, I.Y. and S.C.; methodology, I.K. and I.Y.; software, I.K. and H.L.; validation, S.C., J.P. and I.Y.; formal analysis, I.K., H.L. and J.P.; investigation, S.C. and I.Y.; resources, I.K. and I.Y.; data curation, I.K. and I.Y.; writing—original draft preparation, I.K. and S.C.; writing—review and editing, H.L., J.P. and I.Y.; visualization, I.K.; supervision, S.C. and I.Y.; project administration, I.Y.; funding acquisition, I.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant 22AMDP-C162184-02).

Data Availability Statement

The data used in this study will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Karabegović, I. Digital Technology as the Key Factor in the Fourth Industrial Revolution-Industry 4. 0. Int. J. Adv. Res. Sci. Eng. Technol. 2017, 3, 17–22. [Google Scholar]
Ghobakhloo, M. Industry 4.0, digitization, and opportunities for sustainability. J. Clean. Prod. 2020, 252, 119869. [Google Scholar] [CrossRef]
Donnellan, P.R. The Future of Mobility—Electric, Autonomous, and Shared Vehicles. IEEE Eng. Manag. Rev. 2019, 46, 16–18. [Google Scholar] [CrossRef]
Erkollar, A.; Oberer, B. Sustainable cities need smart transportation: The Industry 4.0 Transportation Matrix. Sigma J. Eng. Nat. Sci. 2018, 9, 359–370. [Google Scholar]
United Nations Economic Commission for Europe (UNECE). Transport and the Sustainable Development Goals. Available online: https://unece.org/DAM/trans/conventn/UN_Transport_Agreements_and_Conventions.pdf (accessed on 7 April 2022).
Korea Trade-Investment Promotion Agency. What Is C-ITS for Smart Autonomous Driving? Available online: https://dream.kotra.or.kr/kotranews/cms/news/actionKotraBoardDetail.do?SITE_NO=3&MENU_ID=180&CONTENTS_NO=1&bbsGbn=243&bbsSn=243&pNttSn=190003 (accessed on 1 April 2022).
Ministry of Culture Agency. Korean Version of the New Deal Comprehensive Plan. Available online: https://www.moef.go.kr/com/cmm/fms/FileDown.do?atchFileId=ATCH_000000000014749&fileSn=3 (accessed on 20 April 2022).
Hamid, U.Z.A.; Zamzuri, H.; Limbu, D.K. Internet of vehicle (IoV) applications in expediting the implementation of smart highway of autonomous vehicle: A survey. In Performability in Internet of Things; Springer: Cham, Switzerland, 2019; pp. 137–157. [Google Scholar] [CrossRef]
Singh, R.; Sharma, R.; Akram, S.; Gehlot, A.; Buddhi, D. Highway 4.0: Digitalization of highways for vulnerable road safety development with intelligent IoT sensors and machine learning. Saf. Sci. 2021, 143, 105407. [Google Scholar] [CrossRef]
Putri, T.D. Intelligent transportation systems (ITS): A systematic review using a Natural Language Processing (NLP) approach. Heliyon 2021, 7, e08615. [Google Scholar] [CrossRef]
Ali, F.; El-Sappagh, S.; Kwak, D. Fuzzy ontology and LSTM-based text mining: A transportation network monitoring system for assisting travel. Sensors 2019, 19, 234. [Google Scholar] [CrossRef] [PubMed]
Salloum, S.A.; Al-Emran, M.; Monem, A.A.; Shaalan, K. Using text mining techniques for extracting information from research articles. In Intelligent Natural Language Processing: Trends and Applications; Springer: Cham, Switzerland, 2018; Volume 740, pp. 373–397. [Google Scholar] [CrossRef]
Zhang, F.; Fleyeh, H.; Wang, X.; Lu, M. Construction site accident analysis using text mining and natural language processing techniques. Autom. Constr. 2019, 99, 238–248. [Google Scholar] [CrossRef]
Onan, A. Two-stage topic extraction model for bibliometric data analysis based on word embeddings and clustering. IEEE Access 2019, 7, 145614–145633. [Google Scholar] [CrossRef]
Gupta, R.K.; Agarwalla, R.; Naik, B.H.; Evuri, J.R.; Thapa, A. Prediction of Research Trends using LDA based Topic Modeling. Glob. Transit. Proc. 2022, 3, 298–304. [Google Scholar] [CrossRef]
Roque, C.; Cardoso, J.L.; Connell, T.; Schermers, G.; Weber, R. Topic analysis of Road safety inspections using latent dirichlet allocation: A case study of roadside safety in Irish main roads. Accid. Anal. Prev. 2019, 131, 336–349. [Google Scholar] [CrossRef]
Sun, L.; Yin, Y. Discovering themes and trends in transportation research using topic modeling. Transp. Res. Part C Emerg. Technol. 2017, 77, 49–66. [Google Scholar] [CrossRef]
Hidayatullah, A.F.; Ma’arif, M.R.; Habibie, M.; Khomsah, S. Indonesia infrastructure development topic discovery on online news with latent Dirichlet allocation. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1077, 012012. [Google Scholar] [CrossRef]
Liu, Q.; Zheng, Z.; Zheng, J.; Chen, Q.; Liu, G. Health communication through news media during the early stage of the COVID-19 outbreak in China: Digital topic modeling approach. J. Med. Internet Res. 2020, 22, e19118. [Google Scholar] [CrossRef] [PubMed]
Yamamoto, M.; Umemura, N.; Kawano, H. Proposal of Japanese vocabulary difficulty level dictionaries for automated essay scoring support system using rubric. J. Oper. Res. Soc. 2020, 8, 601–617. [Google Scholar] [CrossRef]
Kwon, I. Viewpoints in the Korean Verbal Complex: Evidence, Perception, Assessment, and Time; University of California: Berkeley, CA, USA, 2012. [Google Scholar]
Lee, D.G.; Rim, H.C. Probabilistic modeling of Korean morphology. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 945–955. [Google Scholar] [CrossRef]
Hotho, A.; Nürnberger, A.; Paaß, G. A brief survey of text mining. J. Lang. Technol. Comput. Linguist. 2005, 20, 19–62. [Google Scholar] [CrossRef]
Tan, A.H. Text mining: The state of the art and the challenges. In Proceedings of the PAKDD 1999 Workshop on Knowledge Discovery from Advanced Databases, Beijing, China, 26–28 April 1999; pp. 65–70. [Google Scholar]
Kaplan, R.M. A method for tokenizing text. In Inquiries into Words, Constraints and Contexts; CSLI Publications: Stanford, CA, USA, 2005; p. 55. [Google Scholar]
Kang, H.; Yang, J. Selection of the Optimal Morphological Analyzer for a Korean Word2vec Model. In Proceedings of the Korea Information Processing Society Conference, Seoul, Republic of Korea, 31 October 2018; pp. 376–379. [Google Scholar] [CrossRef]
Lee, J. Three-step probabilistic model for Korean morphological analysis. J. KIISE Softw. Appl. 2011, 38, 257–268. [Google Scholar]
Aizawa, A. An information-theoretic perspective of tf–idf measures. Inf. Process. Manag. 2003, 39, 45–65. [Google Scholar] [CrossRef]
Ghag, K.; Shah, K. Senti TFIDF–Sentiment classification using relative term frequency inverse document frequency. Int. J. Adv. Comput. Sci. Appl. 2014, 5, 36–43. [Google Scholar] [CrossRef]
Gottron, T. Document word clouds: Visualising web documents as tag clouds to aid users in relevance decisions. In Research and Advanced Technology for Digital Libraries—ECDL 2009; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5714, pp. 94–105. [Google Scholar] [CrossRef]
Hofmann, T. Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 2001, 42, 177–196. [Google Scholar] [CrossRef]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Andrzejewski, D.; Zhu, X. Latent dirichlet allocation with topic-in-set knowledge. In Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing, Boulder, CO, USA, 4–5 June 2009; pp. 43–48. [Google Scholar]
Krestel, R.; Fankhauser, P.; Nejdl, W. Latent dirichlet allocation for tag recommendation. In Proceedings of the Third ACM Conference on Recommender Systems (RecSys’09), New York, NY, USA, 23–25 October 2009; pp. 61–68. [Google Scholar]
Park, C.; Eo, S.; Moon, H.; Lim, H. Should we find another model? Improving Neural Machine Translation Performance with ONE-Piece Tokenization Method without Model Modification. In Proceedings of the NAACL HLT 2021: Industry Papers, Online, 6–11 June 2021; pp. 97–104. [Google Scholar] [CrossRef]
McCann, P. Fugashi, a tool for tokenizing Japanese in Python. arXiv 2020, arXiv:2010.06858. [Google Scholar]
Onah, D.F.; Pang, E.L. MOOC design principles: Topic modelling-PyLDavis visualization & summarisation of learners’ engagement. In Proceedings of the 13th International Conference on Education and New Learning Technologies, Online, 5–6 July 2021; pp. 1082–1091. [Google Scholar] [CrossRef]
Kim, J.; Park, S.; Park, S.; Jeong, H.; Yun, I. Application of a Topic Model on the Korea Expressway Corporation’s VOC Data. J. Inf. Technol. Serv. 2020, 19, 1–13. [Google Scholar] [CrossRef]
Stevens, K.; Kegelmeyer, P.; Andrzejewski, D.; Buttler, D. Exploring topic coherence over many models and many topics. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Stroudsburg, PA, USA, 12–14 July 2012; pp. 952–961. [Google Scholar]
Khder, M.; Dawood, K.; Abdal, R.; Alqaisy, S. Automated Road Toll Collection and Vehicle Tracking (ARTCVT) Using Advanced RFID and GSM Technology. In Proceedings of the 2022 ASU International Conference in Emerging Technologies for Sustainability and Intelligent Systems (ICETSIS), Manama, Bahrain, 22–23 June 2022; pp. 86–90. [Google Scholar] [CrossRef]
Chattopadhyay, D.; Rasheed, S.; Yan, L.; Lopez, A.; Farmer, J.; Brown, D. Machine Learning for Real-Time Vehicle Detection in All-Electronic Tolling System. In Proceedings of the 2020 Systems and Information Engineering Design Symposium (SIEDS), Charlottesville, VA, USA, 24 April 2020; pp. 1–6. [Google Scholar] [CrossRef]
Munirathinam, S. Industry 4.0: Industrial internet of things (IIOT). Adv. Comput. 2020, 117, 129–164. [Google Scholar] [CrossRef]
Wang, Z.; Sun, H.; Zhang, H.; Xie, H.; Chen, Z. A Novel Assessment and Administration Method of Autonomous Vehicle. SAE Int. J. Adv. Curr. Pract. Mobil. 2020, 2, 3312–3319. [Google Scholar] [CrossRef]
Wang, C.; Miao, Z.; Chen, X.; Cheng, Y. Factors affecting changes of greenhouse gas emissions in Belt and Road countries. Renew. Sustain. Energy Rev. 2021, 147, 111220. [Google Scholar] [CrossRef]

Figure 1. Representation of the research procedure.

Figure 2. Overall process of LDA topic modelling.

Figure 3. Algorithm diagram of text mining.

Figure 4. Word visualization using a word cloud.

Figure 5. Initial result of topic modelling with LDA.

Figure 6. Variation of coherence scores by number of topics.

Table 1. Examples of word standardization.

Standardization	Primitive Words
Autonomous vehicle	Autonomous car (vehicle), AV, Autonomous driving, Automated car (vehicle), Automated driving, Driverless car (vehicle), Self-driving car (vehicle)
Digital infrastructure	Digital infra, digital infra, Digital infrastructure, digital infrastructure
Smart highway	Smart highway, smart highway, Smart Highway, Smart expressway, smart expressway, Smart Expressway
Information and Communication Technology	Information and Communication Technology, information and communication technology, ICT, ict
Internet of Things	Internet of Things, internet of things, IOT, iot
Artificial Intelligence	Artificial Intelligence, artificial intelligence, AI, ai
Fourth Industrial Revolution	Industry 4.0(4), Fourth Industrial Revolution, fourth industrial revolution, The 4th Industrial Revolution, 4IR
Intelligent Transport System	Intelligent Transport System, intelligent transport system, ITS, its
Cooperative Intelligent Transport Systems	Cooperative Intelligent Transport Systems, cooperative intelligent transport systems, CITS, C-ITS, cits, c-its

Table 2. Example of text data for comparison of morphological analysis.

Contexts

Accordingly, the importance of establishing a digital infrastructure environment along with core autonomous vehicle technologies is also being emphasized. In terms of road infrastructure, various technologies and services are being developed in addition to providing congestion information and traffic signal controletc. For example, various technologies and services are being developed based on real-time bidirectional communication of vehicles and infrastructure, such as providing incident information, emergency vehicle access guidance services, digital virtual facility services, merge and diverge area accident prevention servicesetc.

Note: Underlined texts denote that there is a spacing error.

Table 3. Results of morphological analysis by the Python package for Korean NLP.

Python Package	Contexts
Mecab-ko	‘Autonomous vehicle’, ‘Technologies’, ‘Digital infrastructure’, ‘Emphasized’, ‘Road infrastructure’, ‘Congestion’, ‘Information’, ‘Traffic signal control’, ‘Vehicles’, ‘Infrastructure’, ‘Real-time’, ‘bidirectional communication’, ‘Incident’, ‘Emergency’, ‘Access’, ‘Digital’, ‘Virtual facility’
Hannanum	‘Autonomous vehicle’, ‘Core’, ‘Technologies’, ‘along with’, ‘Digital infrastructure’, ‘environment’, ‘establishing’, ‘importance’, ‘is also being emphasized’, ‘road infrastructure’, ‘In terms of’, ‘congestion’, ‘information’, ‘providing’, ‘traffic signal controletc’, ‘vehicles and’, ‘infrastructure’
OKT	‘Accordingly’, ‘Vehicle’, ‘Core’, ‘Technologies’, ‘Digital’, ‘infrastructure’, ‘environment’, ‘establishing’, ‘importance’, ‘emphasized’, ‘technologies’, ‘developed’, ‘road’, ‘Infrastructure’, ‘congestion’, ‘information’, ‘providing’, ‘traffic signal’, ‘control’, ‘etc’, ‘in addition to’ ‘vehicles’
Komoran	‘Accordingly’, ‘Accordingly’, ‘Vehicle’, ‘Core’, ‘Technologies’, ‘Digital’, ‘infrastructure’, ‘environment’, ‘establishing’, ‘importance’, ‘emphasized’, ‘technologies’, ‘developed’, ‘road’, ‘Infrastructure’, ‘congestion’, ‘information’, ‘providing’, ‘traffic signal’, ‘control’, ‘vehicles’, ‘infrastructure’
KKma	‘Accordingly’, ‘Accordingly’, ‘Vehicle’, ‘Core’, ‘Technologies’, ‘Digital’, ‘infrastructure’, ‘environment’, ‘establishing’, ‘importance’, ‘emphasized’, ‘technologies’, ‘developed’, ‘road’, ‘Infrastructure’, ‘congestion’, ‘information’, ‘providing’, ‘traffic’, ‘traffic signal’, ‘controletc’

Table 4. Extracted keywords using TF-IDF.

Weighted Importance	Keywords	TF-IDF Score
1	Infrastructure	6.181
2	Artificial intelligence	5.966
3	Real time	5.628
4	Cooperative intelligent transport systems	5.542
5	Autonomous vehicles	5.258
6	Sensor	5.096
7	Internet of things	4.828
8	Management	4.602
9	Digital	4.099
10	Robot	3.836
11	Big data	3.720

Table 5. Salient keywords by topic group.

Topic Group	Keywords
1	Toll, management, smart tolling, intelligent, and tollgate
2	Infrastructure, structure, Internet of things, digital, and cooperative intelligent transport systems
3	Autonomous vehicles, artificial intelligence, real time, big data, and sensor
4	Green road, real time, energy, carbon, and eco-friendly

Table 6. Profiles of topic groups.

Topic	Label	Share (%)
1	Unmanned payment systems	35.6
2	Intelligent road infrastructure	27.4
3	Connected automated driving road	23.0
4	Eco-friendly road	14.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, I.; Choi, S.; Lee, H.; Park, J.; Yun, I. A Glimpse at the Future Technological Trends of Road Infrastructure: Textual Information-Based Data Retrieval. Infrastructures 2024, 9, 233. https://doi.org/10.3390/infrastructures9120233

AMA Style

Kim I, Choi S, Lee H, Park J, Yun I. A Glimpse at the Future Technological Trends of Road Infrastructure: Textual Information-Based Data Retrieval. Infrastructures. 2024; 9(12):233. https://doi.org/10.3390/infrastructures9120233

Chicago/Turabian Style

Kim, Inyoung, Sungtaek Choi, Hyejin Lee, Jeehyung Park, and Ilsoo Yun. 2024. "A Glimpse at the Future Technological Trends of Road Infrastructure: Textual Information-Based Data Retrieval" Infrastructures 9, no. 12: 233. https://doi.org/10.3390/infrastructures9120233

APA Style

Kim, I., Choi, S., Lee, H., Park, J., & Yun, I. (2024). A Glimpse at the Future Technological Trends of Road Infrastructure: Textual Information-Based Data Retrieval. Infrastructures, 9(12), 233. https://doi.org/10.3390/infrastructures9120233

Article Menu

A Glimpse at the Future Technological Trends of Road Infrastructure: Textual Information-Based Data Retrieval

Abstract

1. Introduction

2. Literature Review

2.1. Future Technology of Highway

2.2. Text Analytics

2.3. LDA Topic Modelling

2.4. Implications

3. Methodology

3.1. Text Mining

3.1.1. Overview

3.1.2. Text Preprocessing

3.1.3. TF-IDF for Extracting Keywords

3.1.4. LDA for Clustering Keywords

3.2. Data Collection

4. Model Estimation

4.1. Data Manipulation

4.2. Keyword Extraction and Visualization

4.3. Trends for the Future Technology of Road Infrastructure

4.3.1. Topic Modelling with LDA

4.3.2. Labelling Topic Groups and Insights into Future Roads

4.3.3. Suggestions for Improvement

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI