Next Article in Journal
In Vitro Bioactivities of Cereals, Pseudocereals and Seeds: Assessment of Antiglycative and Carbonyl-Trapping Properties
Previous Article in Journal
The Temperature Dependence of the Parameters of LED Light Source Control Devices Powered by Pulsed Voltage
Previous Article in Special Issue
AraFast: Developing and Evaluating a Comprehensive Modern Standard Arabic Corpus for Enhanced Natural Language Processing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Development of a Named Entity Recognizer for Detecting Personal Information Using a Korean Pretrained Language Model

1
Technology Strategy Research Institute, World Vertex Co., Ltd., Seoul 06748, Republic of Korea
2
Department of Edutech, Graduate School, Korea National Open University, Seoul 03087, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(13), 5682; https://doi.org/10.3390/app14135682
Submission received: 30 April 2024 / Revised: 20 June 2024 / Accepted: 25 June 2024 / Published: 28 June 2024
(This article belongs to the Special Issue Natural Language Processing: Theory, Methods and Applications)

Abstract

:
Social network services and chatbots are susceptible to personal information leakage while facilitating language learning without time or space constraints. Accurate detection of personal information is paramount in avoiding such leaks. Conventionally named entity recognizers commonly used for this purpose often fail owing to errors of unrecognition and misrecognition. Research in named entity recognition predominantly focuses on English, which poses challenges for non-English languages. By specifying procedures for the development of Korean-based tag sets, data collection, and preprocessing, we formulated directions on the application of entity recognition research to non-English languages. Such research could significantly benefit artificial intelligence (AI)-based natural language processing globally. We developed a personal information tag set comprising 33 items and established guidelines for dataset creation, later converting it into JSON format for AI learning. State-of-the-art AI models, BERT and ELECTRA, were employed to implement and evaluate the named entity recognition (NER) model, which achieved an 0.943 F1-score and outperformed conventional recognizers in detecting personal information. This advancement suggests that the proposed NER model can effectively prevent personal information leakage in systems processing interactive text data, marking a significant stride in safeguarding privacy across digital platforms.

1. Introduction

The Korean wave [1], which started with the popularity of Korean dramas and movies, has spread globally through the spread of K-Pop, expanding into various K-contents. Owing to the increase in globalization and overseas expansion of Korean companies, policies aimed at increasing the number of international students, multicultural families resulting from international marriages, and immigrant workers, the interest in Korean language and the number of Korean language learners have been steadily increasing [2].
Based on data from the global language learning app Duolingo, the US news channel Cable News Network (CNN channel) reported a 38.3 percent increase in Korean language learners in US higher-education institutions between 2016 and 2021 (from 10,936 to 19,270 students) [3]. The number of UK higher-education students taking Korean courses has increased more than threefold from 2012 to 2018. King Sejong Institute, in charge of overseas Korean language education and promoting Korean culture, operated 244 branches in 84 countries worldwide in 2022. Their number of students increased nearly 160 times from 740 in 2007 to 117,636 in 2022 [4].
Social network services (SNSs), chatbots, and similar services can be convenient tools for educating and learning various languages without time and space constraints. YouTube, the world’s largest video platform, can also be used for language learning. Korean language learners use YouTube to acquire common conversation skills and pronunciations; this highlights the importance of using SNSs for teacher–student communication [5]. The Korean language department at a Chinese university conducted a Korean-speaking class using WeChat. Immediate information retrieval and sharing helped students understand learning materials and express their opinions, increasing their interest in learning [6]. A US university conducted vocabulary and grammar practice activities using Instagram for students taking beginner Korean classes. They provided feedback to the students on their posts upon completion. This increased the interest of students in the Korean language and encouraged their participation in learning [7]. The conversational Korean language education system [8] that uses chatbots involves engaging students in dialogs on various situational topics. It also enables precise recognition of speech and detailed evaluation of pronunciation and stress patterns, facilitating effective independent learning.
Chatbots and SNS-based education have numerous advantages; however, because they are universal and scalable and target many learners, they are vulnerable to personal information leakage. In 2021, a personal information leakage incident occurred in South Korea involving iLUDA, an artificial intelligence (AI) chatbot [9]. The iLUDA company used KakaoTalk dialogs collected from its app to develop and operate iLUDA. Data, including non-pseudonymized personal information, such as names, phone numbers, and addresses, were collected without explicit consent. The iLUDA service was terminated, and the company was fined owing to personal information leakage and sexual harassment issues. This was the first case in South Korea where the indiscriminate use of AI to handle personal information was sanctioned. Afterwards, the Ministry of Interior and Safety distributed a guide titled “ChatGPT Usage Guidelines and Precautions” [10] to help the public sector effectively use ChatGPT. It provides guidance on personal information protection, leakage of important information, and safety measures when using ChatGPT.
Meanwhile, HappyTalk, a messenger-based chat counseling solution, revealed that in July 2021, its server was forcibly breached by intruders, resulting in customer information leakage [11]. According to the developer, the server was breached through messages received via chat inquiries, and customer counseling information was accessed. A subsequent internal investigation revealed that customer personal information, such as names and phone numbers, had been leaked.
According to BleepingComputer, a hacking forum leaked the personal information of 2.6 million users of Duolingo, the world’s largest language learning site [12]. The data were scrapped using a publicly available application programming interface (API) and sold on the breached hacking forum. The leaked data contained public information, such as login IDs and real names, and private information, such as emails and completed courses. Furthermore, in 2021, Facebook experienced a data breach through a friend request API bug, exposing the phone numbers of over 500 million users. Similarly, Twitter also experienced personal information leakage, including the email addresses of millions of users.
These examples highlight the risks of using data generated through language-based services, such as SNS and chatbots, data analysis, and AI tasks. These issues can be broadly categorized into two main groups: The first issue involves data collection and acquisition. Data are primarily collected by scraping publicly available text on the internet, such as social networking sites. This may result in privacy breaches. The second issue pertains to using AI models trained to collect data. There is a risk of personal information being regenerated from the inference results of AI models. The risk increases significantly with the amount of data [13].
Accurately detecting personal information in data is paramount in addressing these issues, as are the crucial steps of de-identifying or pseudonymizing these data. Previous research has extensively used named entity recognizers for this task. However, these conventional methods are prone to errors of unrecognition and misrecognition, leading to incomplete or inaccurate detection of personal information [14]. Challenges arise when detecting unique identifiers, such as resident registration and passport numbers. Furthermore, certain attributes, including gender and blood type, remain undetectable. These limitations underscore the need for advanced personal information processing solutions.
Named entity recognition (NER) studies have been primarily conducted with data in English. Owing to the predominance of English in publicly available datasets, working with English alternatives can present significant challenges [15]. NER research based on the Korean language is inadequate, although Korean is becoming globally used. In recent research on personal information detection using a Korean named entity recognizer, attempts were made to use additional information, such as sentence intention or speaker information. However, the study faced limitations owing to insufficient data [16,17]. Therefore, defining a privacy named entity (PNE) tag set and collecting and processing data will positively impact AI-based natural language processing research outside the English-speaking world.
We selected 33 personal information items to identify personal information in interactive texts and built a set of personal information tags. We then collected data using data-building guidelines and converted them into JSON format suitable for AI learning. We selected bidirectional encoder representations from the transformer (BERT) and efficiently trained an encoder that accurately classifies token replacements (ELECTRA) as the pretrained language models. These were implemented and outperformed general object name detectors in personal information detection.
The remainder of this paper is organized as follows. Section 2 explores research related to NER and personal information. Section 3 presents the establishment of a PNE tag set, interactive data collection and processing, the structuring and building of datasets, the construction of learning models, and the preprocessing of training data. In Section 4, a false detection case analysis, conducted to determine areas of improvement, is discussed, and the performance of the model is verified. Section 5 summarizes the research findings and proposes future research directions.

2. Related Works

2.1. Named Entity Recognition

Named entities are proper nouns, such as personal, place, and organization names. NER is a fundamental task in natural language processing, which automatically identifies and classifies predefined specific language expressions in an unstructured natural language text, primarily proper nouns. NER can be used for various purposes, such as Q&A, chatbots, reservation systems, and customer service interactions.
Various attempts have been made to address NER tasks (e.g., Q&A and chatbots). The three main approaches to addressing NER tasks used before the advent of deep learning are rule-based, unsupervised, and feature-based supervised learning. Rule-based NER uses domain-specific dictionaries or syntax–lexicon patterns and has high accuracy in specific domains; however, it lacks recall, making it difficult to transfer data to other systems. Unsupervised learning mainly achieves NER through clustering. It extracts named entities from groups based on contextual similarity. The main concept is that mentions of named entities can be identified through vocabulary resources, patterns calculated from large corpora, and statistics. Other studies have used supervised learning to apply multiclass classification techniques or sequence labeling, such as NER [18].
Other studies have used deep learning models, such as convolutional neural networks and recurrent neural networks (RNNs), including long short-term memory. In South Korea, in 2018, a model that combined a bidirectional RNN and conditional random field with ensemble techniques showed significant performance improvement in the Naver NLP Challenge [19]. Transformer-based pretrained language models are a new paradigm to address the NER challenge and have significantly improved NER task performance [18].
Based on existing patterns and morphological analysis, traditional personal information detection systems face problems related to low detection rates, labor, and time consumption. To improve the detection rate through iterative experiments, Naver nlp- challenge proposed a machine learning method to train patterns and types of personal information from several electronic documents [20].
Personal information cannot be identified using regular expressions in atypical text data or existing NERs. Therefore, Dias et al. [15] proposed a method to train the BERT model using speaker information and label two tags for a single phrase. The intention was to tag speaker A as NAME A (NMA), speaker B as NAME B (NMB), and the name of the person who did not participate in the dialog as NAME OTHERS (NMO) [17]. However, the scope of that study was limited to testing names among personal information items. Furthermore, the quality and size of the dataset were insufficient for personal information detection.
Seo et al. [16] proposed a model for detecting personal information. This model uses the intended information of a sentence as additional information in the named entity learning process and a de-identification technique that considers the utility of personal information. The model uses Korean pretrained language models to simultaneously classify the intent of sentences and detect personal information in atypical text data. However, the personal information tag set used for identification consisted of only seven tags, making it difficult for the model to detect various types of personal information.
Kim and Lim [21] used deep learning techniques to develop an NER model that focuses on criminal investigation. They collected texts from the criminal investigation domain and redefined the classification of named entities required for crime analysis. They conducted an experiment where they categorized the domain of criminal investigation into 9 main categories and 56 subcategories. Their experimental results showed that all categories were identified exceptionally well. Their study aimed to enhance the efficiency of crime prevention and investigation by automatically extracting the level of crime using the NER designed for crime investigation.
Go et al. [22] developed an NER model for efficient dialog information prediction, focusing on household chemicals. They defined a new named entity tag set comprising the manufacturer, name, detailed items, formulation classification, ingredients, and inflow routes of the product. They suggested that using a user dictionary for preprocessing on a specific domain is efficient and recommended reducing the number of filters in character-level convolutional neural networks to reduce model complexity.
Previous research has shown that in specific domains, creating a domain-specific named entity tag set can significantly improve accuracy compared with using a general NER. Transformer-based models, such as BERT and ELECTRA, have recently shown excellent performance and are now widely used in the NER field. AI-based personal information detection technology should be actively used to reduce the human resources and time required to verify false detections based on traditional patterns or regular expressions.

2.2. Personal Information Data

The Korea Internet and Security Agency, a subsidiary of the Personal Information Protection Commission of South Korea, operates a privacy portal [23]. The Personal Information Protection Act defines personal information as information related to a living individual, including the following [24]:
(a)
Personal information identifying an individual, such as name, resident registration number, and images.
(b)
The information provided can easily be combined with other information to identify the individual because it is insufficient on its own to identify a specific individual.
(c)
Additional information used to restore the original state of an individual by pseudonymizing (a) or (b), making it impossible to identify a specific person (pseudonymous information).
Therefore, a natural person must be the subject of personal information. Information about the name, address, executive information, and financial performance of a corporation is not personal information; therefore, it is not protected by the Personal Information Protection Act. Personal information includes various personal details, ranging from basic information (name, resident registration number, etc.) to more private information (social and economic status, education, health, property, cultural activities, political inclinations, etc.).
The privacy portal categorizes personal information into various categories, as listed in Table 1. Identity information encompasses general and family information. Physical information is categorized into body and medical information. Mental information includes preferences, disposition, and inner secrets. Social information consists of education, military service, employment, and legal. Property information encompasses income, credit, real estate, and other revenue information. Finally, miscellaneous information includes communication, location, habits, and hobbies.
The Personal Information Protection Commission categorizes personal information items into grades and types according to the guidelines listed in Table 2, Table 3 and Table 4. The categories consist of Grades 1–3. Grade 1 encompasses information regarding unique identification, sensitivity, certification, credit/financial, and location.
Grade 2 comprises personal identification, body, family, education and training, military service, real estate, income, employment, legal, medical, organizational, habits and hobbies, and personal image information.
Grade 3 includes telecommunications, processed, and limited personal identification information.

3. Materials and Methods

In this study, based on Table 1, Table 2, Table 3 and Table 4 and advice from industry professionals on information protection and security, we analyzed 39 items related to personal information to detect personal information that may arise in SNS and AI chatbots. We defined 33 items and constructed a personal information tag set. We developed guidelines for creating data on defined personal information items and collected data accordingly. We then transformed the collected data into JSON format suitable for AI learning based on the data structure definition document and conducted model training. Consequently, we developed a named entity recognizer to detect personal information.
First, we established a privacy-specific named entity tag set. Traditional NER tag sets that contain multiple items within a single tag fail to detect certain personal information items and inaccurately detect information. Consequently, we identified 39 personal information items as appropriate for inclusion in a privacy-specific named entity tag set. We evaluated the detectability of these items using NER and assessed their exposure risk when used alone or in combination to consider the extent of de-identification required. A personal information tag set comprising 33 items, frequently used in interactive texts and posing significant exposure risks, was constructed. This set includes names, nicknames, gender, height, weight, account numbers, card numbers, and sensitive unique identification numbers, such as resident registration, passport, driver’s license, and alien registration.
Second, we collected interactive text data containing personal information. We established guidelines for generating virtual dialog data incorporating personal information and refined them through an inspection process to ensure their appropriateness for natural dialogs.
Third, we created a training dataset using a data structure definition document as a guide. This involved labeling dialog topics, general named entities, PNEs, and speakers, with PNEs labeled according to the specified tag set.
Fourth, using the begin-inside-outside (BIO) tagging method commonly used in NER, we preprocessed data for deep learning training, where ‘B’, ‘I’, and ‘O’ denote the beginning of a named entity, interior, and external parts not included in the named entity, respectively.
Fifth, the related literature helped us select the deep learning model. We selected the transformer-based BERT and ELECTRA models because they demonstrated high performance in natural language processing.
Sixth, we conducted experiments using the KPF-BERT and KR-ELECTRA models for BERT and ELECTRA. We used PyTorch 1.13 as a deep learning framework and implemented the model on a server equipped with four NVIDIA RTX A5000 GPUs (Nvidia, Santa Clara, CA, USA) to train and evaluate the models. We then identified directions for improvement.

3.1. Construction of a Privacy Named Entity Tag Set

For the experiment, we constructed a tag set for PNEs to recognize personal information items that cannot be identified by a general NER. The 39 items, comprising the types of personal information defined by personal information portals and committees, are listed in Table 5. We then categorized these items into detection techniques, exposure risks, and de-identification scope. We also categorized the detection methods into rule- and NER-based detection to investigate the scope of detection. We classified personal information items as accurately (v) or inaccurately detectable because multiple items could belong to a single category (△). We categorized exposure risk into high (H), medium (M), and low (L) levels depending on a single or multiple factors causing the exposure, based on consultations with industry professionals in information protection and security. We categorized the anonymization scope into pseudonymization/substitution, deletion, categorization, and masking methods.
Following discussions involving ten experts in personal information detection and de-identification solutions, university professors, researchers, and AI and system integration development specialists, we selected 33 items from the initial 39 items targeted for personal information collection, as listed in Table 6. We selected the items based on the criteria for collection and use in the interactive text. We added unique identification information, such as resident registration and passport numbers, financial information, credit card and bank account numbers, and personal information, such as workplace, department, and position. Although this information can be detected using regular expressions, it has a high false detection rate.
A privacy NER tag set was developed by analyzing the existing NER tags for the 33 selected items. First, we added a new tag set for items without existing NER tags to align with the personal information category. These items included CV_SEX (gender), TM_BLOOD_TYPE (blood type), OG_DEPARTMENT (department), QT_IP (IP information), and CV_MILITARY_CAMP (military unit). Second, we refined the previously unified tag set into more specific categories to align with the personal information section. We divided PS_NAME into PS_NAME (name), PS_NICKNAME (nickname), and PS_ID (ID). The unique identification information, account number, and license plate number that were integrated under QT_OTHERS were subdivided into QT_RESIDENT_NUMBER (resident registration number), QT_ACCOUNT_NUMBER (account number), and QT_PLATE_NUMBER (license plate number), respectively. Third, we specified the tag sets that were previously vague in the personal information category. DT_OTHERS and OGG_OTHERS were specified as DT_BIRTH (date of birth) and OGG_CLUB (club/society). Fourth, we applied the remaining items similarly to the existing NER tag set and included QT_AGE (age), QT_LENGTH (height), TMI_EMAIL (email address), and OGG_EDUCATION (school).

3.2. Interactive Data Collection and Processing

Considering the widespread use of SNS and AI chatbots, we collected dialog data. We conducted a preliminary independent investigation of various platforms, such as KakaoTalk, Twitter, Facebook, AI Hub, Naver News comments, YouTube comments, and online shopping mall comments. According to the investigation, personal information was rare in open spaces such as Naver News, YouTube, and online shopping mall comments. However, in the clothing category of online shopping malls, we observed some personal information related to body size. On Facebook, specific groups, such as job seekers or affiliated schools, often exposed information related to businesses for job opportunities, contact details of responsible individuals, emails, and associated information. As Twitter allows users to remain anonymous, the names and account numbers in transactional posts are often easily revealed. Although the AI Hub data contained some personal information, they were of limited type and quantity, masked, and therefore not useful.
Leaking personal information from interactive data is a serious problem; however, the amount of personal information in everyday dialogs is minimal. After examining the KakaoTalk data of six individuals (including three senior-level researchers with over ten years of experience, two manager-level researchers with over four years of experience, and one assistant-level researcher with less than four years of experience), we identified less than 20 instances of dialogs revealing personal information, excluding names and nicknames, over six months to one year. Furthermore, we collected data through crowdsourcing because the interactive text contained sensitive personal information. However, the small sample size prevented the collection of the desired amount of data. We collected interactive personal information and processed it as follows.
First, we constructed dialog data containing personal information based on various topics. Interactive data refer to dialogs on various platforms, such as messengers, social media, posts, comments, and call centers. We classified the dialog topics into eight types, as listed in Table 7, and defined the personal information items corresponding to each type. Dialog topics were categorized as “personal and relationships”, “housing and life”, “shopping and trading”, “public services”, “leisure and entertainment”, “work and occupation”, “beauty and health”, and “learning and career”. We categorized the prioritized dialog types into all categories except “personal and relationships” and classified dialogs that did not fall into any other category as “personal and relationships”. We defined the personal information categories for each dialog type to minimize duplication and included many items within each dialog set.
Second, we constructed the dialog data using everyday dialog sentences, each consisting of a minimum of three turns and an average of four turns. A dialog turn is a single turn that consists of one exchange between speakers 1 (P01) and 2 (P02). The tagging criteria for a dialog are listed in Table 8. The tagging scope encompassed PNEs, dialog topics, and speaker comments.
We collected personal information according to the data collection guidelines. The annotation guidelines for the PNE were defined based on the existing entity name annotation guidelines following discussions with language-related university professors and researchers on the definition of a PNE tag set. All 15 annotators have majored in linguistics; they completed all prior training on the guidelines and then worked on the annotations. Where the annotation definition was not clear, detailed inspection rules were established based on consensus, and the annotation guidelines were revised. Data annotations were inspected according to relevant criteria; five inspectors inspected data annotations, and uniformity of the annotations was secured based on the feedback from the inspectors. Quality was improved by deploying five management and support personnel, as well as annotators and inspectors.
The quantity for each item, set based on the advice of a natural language processing expert, was that a minimum of 100 data points must be used for the learning process. The total collection target was set as 20,000, with individual quantities determined based on the collection and detection difficulty of each item. Unique identification information, such as resident registration and passport numbers, is only occasionally revealed in dialogs and has a consistent pattern, resulting in fewer instances than other items. The “name” that was bound to appear the most in the conversation occurred more than three times the target quantity. Thus, we collected approximately 22,309 data points.

3.3. Structuring and Building Datasets

To convert the collected data into JSON format, which is suitable for AI learning, we wrote a data structure definition document based on the JSON structure in the NER corpus of the National Institute of Korean Language, as listed in Table 9. We labeled the dialog_type based on Table 7, in which personal information items are listed by the dialog topic. The PNE for detecting personal information was labeled based on the items defined in Table 6, and a PNE tag set was created.

3.4. Learning Models and Training Data Preprocessing

3.4.1. BIO Tagging

Tokenization is dividing a given sentence into specific word fragments, called tokens, for mathematical calculations. Word embedding converts these tokens into vector representations for calculation. All sentences were transformed into operable vector representations using tokenization and word embedding. To incorporate information about entity names into vector representations for learning, BIO tagging (defined in Section 3) is the most used tagging task in NER [21,27]. In this study, we annotated the collected data using BIO tagging, as shown in Figure 1.

3.4.2. Building a Learning Model

Traditional RNN-based models suffer from prolonged calculation times due to single-word sequential operation. Launched by Google in 2017, the transformer model addresses this issue by employing attention mechanisms to process entire sentences in parallel, thereby reducing memory and computational demands. This simultaneous processing enhances performance and facilitates efficient training [28]. We used transformer-based BERT and ELECTRA models for the experiments.
In 2018, Google released a language model pretrained on a large amount of training data. During pretraining, the model learns using the masked language model and next sentence prediction. It demonstrated high performance with the addition of a layer, minimal data, and fine tuning of the training time. This is referred to as transfer learning. To enhance the performance of BERT in a specific field, the collection of language data from that field and additional training are necessary [29]. ELECTRA is more resource-efficient than BERT; thus, it efficiently learns with relatively fewer resources. It improves training efficiency by using a new pretraining task called replaced token detection which allows for the generation of multiple sentences with partially changed words from a sampled sentence. This enhancement is achieved concerning the size of the prepared dataset [30].
In this study, the experiments were conducted on a server equipped with four NVIDIA RTX A5000 GPUs. PyTorch served as the deep learning framework. KPF-BERT and KR-ELECTRA were used as the BERT and ELECTRA models, respectively.
KPF-BERT is a BERT model released by the Korea Press Foundation (KPF), trained on over 40 million articles from 20 years of Bigkinds data [31]. The Computational Linguistics Lab at Seoul National University released the KR-ELECTRA model after training it on 34 GB of Korean text data, including Wikipedia articles, news, legal texts, news comments, and product reviews [32].
We performed tokenization for the KPF-BERT and KR-ELECTRA models using the WordPiece tokenizer, a subword tokenizer. The vocabulary size was 36,439 for KPF-BERT and 30,000 for KR-ELECTRA. For the experiments, we used the AdamW optimizer, a learning rate of 5 × 10−5, 30 training epochs, and batch sizes of 4 and 24. We also set the maximum sequence length to 128 and 256 tokens when training on a sentence and dialog, respectively.

3.4.3. Model Experiment Dataset

We used 4581 dialog sets, divided into training and test datasets of 4022 (19,650 tags) and 559 (2659 tags), respectively. The number of tags for each personal information item in the training and test datasets is listed in Table 10.
We divided the training into two parts to conduct the PNE detection experiments with the KPF-BERT and KR-ELECTRA models: learning each dialog line by line and learning the entire dialog unit.

4. Experimental Results

In NER, the F1-score is commonly used as the evaluation metric. Considering the significant differences in the quantities of 33 personal information items, a micro-average was deemed advantageous in calculating a balanced average. Figure 2 illustrates an example of PNE detection. It is displayed as <personal information item: named entity tag> in the OUTPUT for convenience. However, it was mapped to BIO tagging and the corresponding named entity tag in the actual detection.

4.1. Performance Results of the KPF-BERT and KR-ELECTRA Models

The experimental results for each model are summarized in Table 11. For a batch size of 24, the KPF-BERT [33] model trained on dialog data achieved the highest performance of 93.4 percent. When examined individually, KPF-BERT showed a PNE detection rate 0.9 percent higher when trained at the conversational level than at the sentence level. KR-ELECTRA [34] showed a detection rate 9.9 percent higher when trained at the sentence level than at the conversational level. This is because KPF-BERT is trained on articles longer than 512 subwords. We trained KPF-BERT to process documents independently by providing a stride to address this issue. Consequently, it performed better in conversational training when we set the maximum sequence length to 512. Owing to its maximum sequence length of 128, KR-ELECTRA is a pretrained model that performs better when trained on shorter sentence units.
Therefore, performance depends on the maximum sequence length suitable for each model. KPF-BERT, which can handle many tokens when trained on the dialog level, demonstrated the best performance. Therefore, identifying the context of the dialog can be considered helpful in detecting PNEs.
After confirming the excellent performance of KPF-BERT in dialog-level training, we tuned the hyperparameters to further enhance its performance. After experimenting with batch sizes of 4 and maximum sequence lengths of 512 for both models, KPF-BERT achieved an improved F1-score of 94.3 percent, 0.9 percentage points higher than the previous best performance of 93.4 percent.
The experimental results for the personal information category of KPF-BERT are summarized in Table 12. According to the results, KPF-BERT showed the highest performance with an F1-score of 0.943. In identifying 33 specific items of personal information, numerical data with patterns or restricted forms of personal information showed a high identification performance of over 90 percent. Specific unique identification information, such as passport and driver’s license numbers, blood type, mobile phone number, general phone/fax number, email address, URL, and IP address, demonstrated a 100 percent detection rate, with an F1-score of 1.0. Sixteen items, including date of birth, height, weight, and some additional items, showed a recall of 1.0, indicating a consistent pattern (e.g., passport number: M123A4567, mobile phone number: 010-1234-5678) among these items. However, we observed a few false detections, such as mistaking a date for the date of birth or incorrectly identifying simple criteria unrelated to personal information, such as height or weight. Nicknames, clubs/societies, and places are items with diverse forms and uses, which complicates predictions. Job title/position, although simpler in form than the previous three items, showed relatively low detection rates because many cases that do not correspond to job titles/positions, such as occupations, titles, and relationships, were included.

4.2. Analysis Results for False Detection Cases

After examining the cases of false detection in the personal information category, we found errors due to missing annotations, inaccuracies in the annotation scope, or false detection based on similar forms. Missing annotations are cases where data are omitted owing to user error during data collection. Scope errors in annotations mostly occur when vocative particles such as ‘~ah’ and ‘~ya’ or suffixes such as ‘~nim’ and ‘~ssi’ are included in names, nicknames, job titles, or positions. Although annotation scope errors are useful for capturing entities with forms similar to personal information, they often misidentify these entities as different named entities. These cases constituted most false detections; their specific details are presented below.

4.2.1. Annotation Scope Errors

Incorrectly specifying the annotation range during data labeling causes the model to detect named entities incorrectly; however, it may appear to make false detections owing to incorrect ground-truth labeling. Typically, names, nicknames, and job titles/positions including vocative particles such as ‘~ah’ and ‘~ya’ or suffixes such as ‘~nim’ and ‘~ssi’ are annotated incorrectly. Some examples are listed in Table 13.

4.2.2. Similar Forms of False Detections

Although the model effectively identified entities with forms similar to personal information, it often misidentified them as other named entities; some examples of such cases are listed in Table 14. This constitutes the mainstream of false detection cases, highlighting the importance of establishing clear criteria for proper identification of each named entity as personal information, ensuring consistent and accurate annotation, and considering the context. Similar forms of seven erroneous cases are as follows:
(1)
‘name’ and ‘nickname’ are interchangeably used to refer to a person. Because foreign or baptismal names resemble nicknames, confusion may occur when using them.
(2)
There is a similarity between a nickname and a club/society in that both allow users to freely create words without a fixed format, which may cause false detection.
(3)
Similar to places that imply participation or visitation, the term ‘club/society’ can also indicate membership, leading to misinterpretation.
(4)
Because the club/society and workplace pertain to an individual’s affiliation, it is possible to choose the wrong word when filling out the form.
(5)
The term ‘self-employed’ should be detected as a workplace depending on the context; however, it is sometimes misidentified as a place. Differentiating between places and workplaces in each context is important because insufficient data can lead to false detection.
(6)
Although the last digits of a resident registration and alien number may have different rules depending on the individual, they are identical in form and usage, leading to cases of mistaken identity. Therefore, conclusions must be drawn based on the context of the preceding and following dialog.
(7)
Because card and account numbers appear in financial dialogs and are often accompanied by the mention of credit card companies or banks, they can be falsely detected. In this experiment, card numbers were occasionally mistaken for account numbers; however, account numbers were never mistaken for card numbers. Therefore, improving the data reduces the probability of errors.

5. Conclusions

In this study, we analyzed 33 out of 39 items of personal information to identify personal information from dialog-type texts and constructed a personal information tag set. Some personal information items could not be accurately detected owing to the limitations of the general NER tag set. We established guidelines for the defined personal information categories to facilitate data construction and collected dialog-type text data. According to the data structure specification, we converted the collected data into JSON format, which is suitable for AI learning. Subsequently, we performed experiments on the PNEs using the BERT and ELECTRA models. The best performance in this study was 94.3%, based on F1-score, when KPF-BERT was trained by setting batch sizes to 4 and Max Sequence Length to 512, and increasing the length of sentences from sentence units to conversation units [35]. In detecting 33 specific items of personal information, a high performance of over 90 percent was achieved on numerical data with patterns or limited forms of personal information. Unique identification information, blood type, mobile phone number, email, URL, and IP had a 100 percent detection rate, with an F1-score of 1.0. For 16 items, including date of birth, height, and weight, the recall was 1.0, indicating that these items had a consistent pattern. However, many items, including nicknames, clubs/societies, places, and job titles/positions had a lower detection rate than other categories owing to their diverse forms and usage, making prediction challenging. Thus, additional training data for items with low detection rates must be collected, and detection performance needs to be improved through additional learning.
Through this process, we established a foundation to flexibly address any changes in the personal information section or the need for additional information for learning and to proactively respond to potential personal information leakage. Named entity recognition research has generally been conducted based on English, with a tendency to focus on model performance while overlooking aspects such as data collection and preprocessing. In this study, we explored procedures such as data collection, preprocessing, and the conversion of training data to provide directions on the application of entity recognition research to non-English languages. The developed NER differs from previous NERs and addresses the issue of personal information detection. However, detecting personal information presents challenges in assessing the risk of personal information leakage. Future research should explore topics beyond simple personal information detection and evaluate the risk of data leakage based on the detected personal information. Furthermore, detecting personal information requires research experiments using generative pretrained transformers and large language models. We anticipate that further research building upon our model will be able to prevent personal information leakage in various systems that generate large-scale text data.

Author Contributions

Conceptualization, T.K. and S.J.; resources and data curation, Y.C. and H.S.; methodology, S.J., Y.C., and H.S.; validation, S.J. and H.W.; investigation, Y.C. and H.S.; formal analysis, H.S. and Y.C.; visualization, Y.C. and H.S.; writing—original draft, S.J.; writing—review and editing, S.J. and H.W.; project administration, T.K. and S.J.; funding acquisition, T.K. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the Personal Information Protection Commission of the Republic of Korea and the Korea Internet & Security Agency (KISA), grant number 1781000017.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study. Requests to access the datasets should be directed to the email address [email protected].

Acknowledgments

This study was supported by the Personal Information Protection Commission of the Republic of Korea and the Korea Internet & Security Agency (KISA). Therefore, the authors thank the Personal Information Protection Commission of the Republic of Korea and the Korea Internet & Security Agency (KISA) for their technical and financial support.

Conflicts of Interest

Authors Sungsoon Jang, Yeseul Cho, Hyeonmin Seong and Taejong Kim were employed by the company Technology Strategy Research Institute, World Vertex Co., Ltd., Seoul, Korea The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of the data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Available online: https://kofice.or.kr/b20industry/b20_industry_03_view.asp?seq=8009 (accessed on 15 March 2024).
  2. Available online: https://eiec.kdi.re.kr/publish/naraView.do?fcode=00002000040000100009&cidx=14502&sel_year=2023&sel_month=10 (accessed on 15 March 2024).
  3. Available online: https://edition.cnn.com/2023/01/17/asia/korean-language-learning-rise-hallyu-intl-hnk-dst/index.html (accessed on 15 March 2024).
  4. Available online: https://www.ksif.or.kr/com/cmm/EgovContentView.do?menuNo=10101100 (accessed on 23 February 2024).
  5. Lee, J.H. A study on foreign learner’s learning experience in Korean using YouTube. JHSS 2020, 11, 285–300. [Google Scholar]
  6. Kim, H.-J. A case study on a Korean speaking class using SNS, The Korean Association of Speech. Communication 2016, 34, 139–172. [Google Scholar]
  7. Choi, S.-J. A study on Korean education using Instagram as a mobile-assisted language learning tool: The case of beginning Korean class and learners’ perception in American College. J. Lang. Cult. 2021, 17, 383–415. [Google Scholar]
  8. Available online: https://www.boannews.com/media/view.asp?idx=101117 (accessed on 15 March 2024).
  9. Available online: https://www.boannews.com/media/view.asp?idx=119138 (accessed on 15 March 2024).
  10. Available online: https://www.mois.go.kr/frt/bbs/type010/commonSelectBoardArticle.do?bbsId=BBSMSTR_000000000008&nttId=100278 (accessed on 23 February 2024).
  11. Available online: https://www.boannews.com/media/view.asp?idx=99333 (accessed on 15 March 2024).
  12. Available online: https://www.bleepingcomputer.com/news/security/scraped-data-of-26-million-duolingo-users-released-on-hacking-forum/ (accessed on 6 April 2024).
  13. Kim, B.P. Legal challenges in large-scale language models. KAFIL 2022, 26, 173–217. [Google Scholar]
  14. Choi, D.; Kim, S.H.; Cho, J.-M.; Jin, S.-H.; Cho, H.S. Personal information exposure on social network service. KIISC 2013, 23, 977–983. [Google Scholar]
  15. Dias, M.; Boné, J.; Ferreira, J.C.; Ribeiro, R.; Maia, R. Named entity recognition for sensitive data discovery in Portuguese. Appl. Sci. 2020, 10, 2303. [Google Scholar] [CrossRef]
  16. Seo, D.-K.; Kim, G.-W.; Kim, J.-Y.; Lee, D.-H. Personal information detection and de-identification system using sentence intent classification and named entity recognition. In Proceedings of the Korea Institute of Information Security & Cryptology Conference, Online, 6–7 November 2020; Volume 27, pp. 1018–1021. [Google Scholar]
  17. Cha, D.H.; Know, B.K.; Youn, H.C.; Hyup Lee, G.; Joo, J.W.J. A study on identifying personal information on conversational text data. In Proceedings of the Korea Institute of Information Security & Cryptology Conference, Seoul, Republic of Korea, 3–5 November 2022; Volume 29, pp. 11–13. [Google Scholar]
  18. Li, J.; Sun, A.; Han, J.; Li, C. A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 2020, 34, 50–70. [Google Scholar] [CrossRef]
  19. Seo, Y.H. A Study on Improvement of Identification Rate of Personal Data Using Machine Learning. Master’s Thesis, Soongsil University Graduate School, Seoul, Republic of Korea, 2019; pp. 279–281. [Google Scholar]
  20. Available online: https://github.com/naver/nlp-challenge (accessed on 6 April 2024).
  21. Kim, H.-D.; Lim, H.-S. A named entity recognition model in the criminal investigation domain using a pretrained language model. J. Korea Converg. Soc. 2022, 13, 13–120. [Google Scholar]
  22. Go, M.-H.; Kim, H.-D.; Lim, H.-Y.; Lee, Y.-L.; Ji, M.-G.; Kim, W.I. A study on named entity recognition for effective dialogue information prediction. Broadcast. Eng. 2019, 24, 58–66. [Google Scholar]
  23. Available online: https://www.privacy.go.kr/ (accessed on 23 February 2024).
  24. Available online: https://www.privacy.go.kr/front/contents/cntntsView.do?contsNo=27 (accessed on 23 February 2024).
  25. Available online: https://www.privacy.go.kr/front/contents/cntntsView.do?contsNo=35 (accessed on 23 February 2024).
  26. Available online: https://www.law.go.kr/LSW/flDownload.do?flSeq=116296825&flNm=%5B%EB%B3%84%ED%91%9C+1%5D+%EA%B0%9C%EC%9D%B8%EC%A0%95%EB%B3%B4+%25E (accessed on 23 February 2024).
  27. Kim, W.-H.; Lee, S.-J.; Lee, J.-H. Improving the accuracy of extracting sentiment in Korean text through the BIO tagging and triplet methods. Int. J. Foreign Stud. 2021, 57, 345–366. [Google Scholar]
  28. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  29. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  30. Clark, K.; Luong, M.-T.; Le, Q.V.; Manning, C.D. ELECTRA: Pretraining Text Encoders as Discriminators Rather Than Generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
  31. Son, H.-W.; Han, Y.-M.; Nam, K.-H.; Han, S.-B.; Yo, G.-S. Development of a News Trend Visualization System based on KPF-BERT for Event Changes and Entity Sentiment Analysis. Proc. JKIIT 2024, 22, 203–213. [Google Scholar] [CrossRef]
  32. Available online: https://huggingface.co/snunlp/KR-ELECTRA-generator/blob/main/README.md (accessed on 23 February 2024).
  33. Available online: https://github.com/KPFBERT/kpfbert (accessed on 23 February 2024).
  34. Cho, W.-J.; Shin, G.-P.; Lee, W.-J.; Son, S.-H.; Song, H.-W.; Lee, J.-H.; Lee, H.-J.; Jo, S.-Y. KoELECTRA-Based Named Entity Recognition Using Korean Morphological Analyzers. In Proceedings of the Korean Institute of Information Scientists and Engineers 2021, Jeju, Republic of Korea, 23–25 June 2021; pp. 1897–1899. [Google Scholar]
  35. Woo, H.-S.; Kim, J.-M.; Lee, W.-G. Validation of text data preprocessing using a neural network model. Math. Probl. Eng. 2020, 2020, 1958149. [Google Scholar] [CrossRef]
Figure 1. Example of BIO tagging.
Figure 1. Example of BIO tagging.
Applsci 14 05682 g001
Figure 2. Example of personal information detection in dialog texts.
Figure 2. Example of personal information detection in dialog texts.
Applsci 14 05682 g002
Table 1. The types of personal information classified by the privacy portal [25].
Table 1. The types of personal information classified by the privacy portal [25].
ClassificationPersonal Information Items
Identity informationGeneral informationFull name, resident registration number, address, phone number, date of birth, place of birth, gender, etc.
Family informationFamily relations, family member information, etc.
Physical informationBody informationFace, iris, voice, genetic information, fingerprints, height, weight, etc.
Medical and health informationMedical conditions, medical history, physical disabilities, disability ratings, medical history, and physical exam information, such as blood type, IQ, and drug tests.
Mental informationPreferences and dispositionBook and video rental records, magazine subscription information, purchase history, website browsing history, etc.
Inner secretsIdeology, creed, religion, values, political party or union membership, activities, etc.
Social informationEducationEducation, grades, attendance, technical certifications and professional licenses, disciplinary records, student records, health records, etc.
Military serviceMilitary service, number and rank, discharge type, military unit, specialties, etc.
LaborWorkplace, employer, place of employment, work history, reward and punishment records, job evaluation records, etc.
Legal informationCriminal records, court records, fines paid, etc.
Property informationIncomeSalary, bonuses and commissions, interest income, business income, etc.
CreditLoan and security pledge history, credit card numbers, passbook account numbers, credit information, etc.
Real estateOwned homes, land, cars, other vehicles, stores, buildings, etc.
Other revenuesInsurance (health, life, etc.), enrollment status, vacation, sick leave, etc.
Miscellaneous informationCommunicationEmail addresses, phone calls, log files, cookies, etc.
LocationLocation of individuals by GPS and mobile phone.
Habits and hobbiesSmoking, alcohol consumption, preferred sports and entertainment, leisure activities, gambling propensity, etc.
Table 2. The types of personal information according to the Personal Information Protection Commission’s guidelines (Grade 1) [26].
Table 2. The types of personal information according to the Personal Information Protection Commission’s guidelines (Grade 1) [26].
GradeTypePersonal Information Items
Grade 1Unique identification informationResident registration number, passport number, driver’s license number, and alien registration number.
Sensitive informationPersonal information is likely to result in a significant invasion of privacy, such as ideas, beliefs, membership in or withdrawal from a trade union or political party, political opinions, health, and sexual life.
Genetic information, criminal background information, medical history, physical and mental disabilities, sexual orientation, and disabilities (disability or not, disability class).
Authentication informationPasswords and biometrics (fingerprint, iris, vein, etc.).
Credit/financial informationCredit card numbers, account numbers, bank names, depository institutions, credit information, payment authorization numbers, loan balances and payment status, mortgages, late and missed payments, and records of wage garnishment notifications.
Location informationPersonal location using GPS or mobile phone.
Table 3. The types of personal information according to the Personal Information Protection Commission’s guidelines (Grade 2) [26].
Table 3. The types of personal information according to the Personal Information Protection Commission’s guidelines (Grade 2) [26].
GradeTypePersonal Information Items
Grade 2Personal identification informationPersonal name, personal address, personal phone number, mobile phone number, email address, date of birth, gender, place of birth, domicile, and nationality.
Body informationHeight, bust, weight, and DNA.
Family informationFamily situation, names of family members, resident registration number, date of birth, place of birth, occupation, phone number, mobile phone number, marital status, and hobbies.
Education and training informationSchool attendance, final education, grades, technical certifications and professional licenses, completed training programs, extracurricular activities, rewards, and penalties.
Military service informationMilitary number and rank, discharge type, specialties, and military unit.
Real estate informationOwned homes, land, cars, other vehicles, stores, and buildings.
Income informationCurrent salary, salary history, bonuses and commissions, other sources of income, interest income, business income, and other income revenues.
Insurance (health, life, etc.), enrollment status, company overhead, investment programs, retirement programs, vacations, and sick leave.
Employment informationCurrent employer, company address, supervisor’s name, performance evaluation records, training records, attendance records, punishment records, work attitude, and personality test results.
Legal informationCriminal records, motor vehicle violation records, bankruptcy and collateral records, arrest records, divorce records, and tax records.
Medical informationPast medical records, psychiatric records, physical disabilities, body information such as blood type, IQ, and drug tests, and family medical history.
Organizational informationUnion membership, religious affiliation, political party membership, and club membership.
Habits and hobbiesSmoking, alcohol consumption, preferred sports and entertainment, leisure activities, video rental history, and gambling propensity.
Personal video informationPersonal video information stored on video surveillance equipment (CCTV).
Table 4. The types of personal information according to the Personal Information Protection Commission’s guidelines (Grade 3) [26].
Table 4. The types of personal information according to the Personal Information Protection Commission’s guidelines (Grade 3) [26].
GradeTypePersonal Information Items
Grade 3Telecommunications informationIP information, MAC address, site visit history, phone call history, log files, and cookies.
Processed informationStatistical information and subscriber tendency.
Limited personal identification informationMembership information, employee number, and personally identifiable information for internal use.
Table 5. Analysis of targets for personal information collection.
Table 5. Analysis of targets for personal information collection.
ItemsDetection TechniqueExposure RiskDe-Identification Scope
Regular
Expression
NERSole
Exposure
Combined
Exposure
Pseudonyms/
Substitutions
DeletionCategorizationMasking
PersonalName MHvv v
GeneralNickname MMvv v
Date of birth LMvvvv
Age vLMvvvv
Anniversaries LLvvvv
Nationality vMLvvvv
BodyGender LLvv v
Height vMLvvvv
Weight vMLvvvv
Blood type MLvv v
HealthMedical insurance numberv
(Many false detections)
HHvv v
Medical history vMMvvvv
Unique identification numberResident registration numbervHHvv v
Alien numbervHHvv v
Passport numberv
(Many false detections)
HHvv v
Driver’s license numbervHHvv v
General identification informationMobile phone numbervHHvv v
General phone/FAX numbervMMvv v
Card numbervHHvv v
Account numberv
(Many false detections)
HHvv v
Email addressvvHHvv v
License plate number HHvv v
WorkplaceWorkplace MMvv v
Department MMvv v
Job title/position vMMvvvv
SchoolSchool vMMvv v
Grade vMLvvvv
Major MLvvvv
LocationAddress vHMvvvv
Building name vLMvv v
Address (hometown) HMvvvv
House type LLvv v
v: accurately detectable, △: inaccurately detectable, H: high, M: medium, L: low.
Table 6. Privacy named entity (PNE) tag set.
Table 6. Privacy named entity (PNE) tag set.
DivisionNamed Entity ItemGeneral Named Entity Tag SetPNE Tag Set
1NamePS_NAMEPS_NAME
2NicknamePS_NAMEPS_NICKNAME
3Date of birthDT_OTHERSDT_BIRTH
4AgeQT_AGEQT_AGE
5Gender-CV_SEX
6HeightQT_LENGTHQT_LENGTH
7WeightQT_WEIGHTQT_WEIGHT
8Blood type-TM_BLOOD_TYPE
9ReligionOGG_RELIGIONOGG_RELIGION
10NationalityLCP_COUNTRYLCP_COUNTRY
11Club/societyOGG_OTHERSOGG_CLUB
12AddressLCLC_ADDRESS
13PlaceLC, AF_BUILDINGLC_PLACE
14Resident registration numberQT_OTHERSQT_RESIDENT_NUMBER
15Alien numberQT_OTHERSQT_ALIEN_NUMBER
16Passport numberQT_OTHERSQT_PASSPORT_NUMBER
17Driver’s license numberQT_OTHERSQT_DRIVER_NUMBER
18Mobile phone numberQT_PHONEQT_MOBILE
19General phone/FAX numberQT_PHONEQT_PHONE
20Card numberQT_OTHERSQT_CARD_NUMBER
21Account numberQT_OTHERSQT_ACCOUNT_NUMBER
22Email addressTMI_EMAILTMI_EMAIL
23License plate numberQT_OTHERSQT_PLATE_NUMBER
24WorkplaceOGOG_WORKPLACE
25Department-OG_DEPARTMENT
26Job title/positionCV_POSITIONCV_POSITION
27SchoolOGG_EDUCATIONOGG_EDUCATION
28GradeQT_ORDERQT_GRADE
29MajorFDFD_MAJOR
30IDPS_NAMEPS_ID
31URLTMI_SITETMI_SITE
32IP information-QT_IP
33Military unit-CV_MILITARY_CAMP
Table 7. Personal information items by dialog topic.
Table 7. Personal information items by dialog topic.
Dialog TopicPersonal Information Items
Personal and relationshipsName, nickname, date of birth, age, gender, religion, nationality, military unit, etc. (almost all personal information items can be included)
Housing and lifeAddress, place, and license plate number
Shopping and tradingName, date of birth, mobile phone number, card number, account number, email address, ID, and URL
Public servicesResident registration number, alien number, passport number, driver’s license number, general phone/FAX number, and IP information
Leisure and entertainmentNickname, club/society, and ID
Work and occupationWorkplace, department, and job title/position
Beauty and healthAge, gender, height, weight, and nlood type
Learning and careerSchool, grade, and major
Table 8. Dialog tagging conditions.
Table 8. Dialog tagging conditions.
CriteriaConditions
The answer is not correctIf the speaker talks about estimates, not exact numbers, etc., it is excluded from tagging.
If the speaker provides an incorrect answer about personal information, it is tagged.
If any option (including alternatives and drafts) contains personal information, it is tagged.
This applies to organizations and not individuals.It is tagged if the speaker talks about an organization with the same personal information.
The same personal information is repeated within a setIf a name/nickname, etc., is called or the same information appears repeatedly, it is tagged.
Table 9. Data structure definitions.
Table 9. Data structure definitions.
DivisionItem NameTypeDescriptionRemarks
1idstringDocument IDThis is created with certain rules and forms the basis of the child IDs (3.1, 3.3.1).
2metadata Enter the established metadata information when collecting data.
3document
 3.1idstringDialog dataset ID[document ID].1;
[document ID].2.
 3.2metadata Inclusion of the dialog type is required.
  3.2.1dialog_typestringDialog typeEight types of dialog topics.
 3.3sentence
  3.3.1idstringSentence ID[document ID].1.1;
[document ID].1.2.
  3.3.2formstringSentenceOnly one sentence is received as input.
  3.3.3pidstringSpeaker IDP01, P02, etc.
  3.3.4NE General named entitiesLabeled to 15 major categories according to the TTA standard named entity tag set (TTAK.KO-10.0852).
   3.3.4.1idintegerNamed entity ID1, 2, 3, etc.
   3.3.4.2formstringNamed entityHong Gil-Dong, etc.
   3.3.4.3labelstringNamed entity tagPS, LC, etc.
Table 10. Number of tags for personal information items in training and test sets.
Table 10. Number of tags for personal information items in training and test sets.
NO.Named Entity TagTrainTestNO.Named Entity TagTrainTest
1PS_NAME169721018QT_MOBILE54266
2PS_NICKNAME113816319QT_PHONE55071
3DT_BIRTH5636920QT_CARD_NUMBER72977
4QT_AGE5367921QT_ACCOUNT_NUMBER71984
5CV_SEX4036122TMI_EMAIL54277
6QT_LENGTH5406623QT_PLATE_NUMBER39859
7QT_WEIGHT5257824OG_WORKPLACE907119
8TM_BLOOD_TYPE3653825OG_DEPARTMENT682101
9OGG_RELIGION3755126CV_POSITION913136
10LCP_COUNTRY5318027OGG_EDUCATION867110
11OGG_CLUB103716228QT_GRADE41558
12LC_ADDRESS96411829FD_MAJOR53998
13LC_PLACE100712430PS_ID52876
14QT_RESIDENT_NUMBER1821831TMI_SITE39367
15QT_ALIEN_NUMBER1821832QT_IP17822
16QT_PASSPORT_NUMBER1772333CV_MILITARY_CAMP34660
17QT_DRIVER_NUMBER18020 Total19,6502659
Table 11. Experiment results of models.
Table 11. Experiment results of models.
ModelBatch SizeMax Seq LengthPrecisionRecallF1-Score
KPF-BERTSentence unit241280.9160.9340.925
Dialog unit242560.9230.9450.934
Dialog unit45120.9320.9550.943
KR-ELECTRASentence unit241280.9190.9440.931
Dialog unit242560.7490.9360.832
Dialog unit45120.7570.9440.834
Table 12. Experimental results for each KPF-BERT personal information item (F1-score: 0.943).
Table 12. Experimental results for each KPF-BERT personal information item (F1-score: 0.943).
No.Named Entity TagPrecisionRecallF1-ScoreNo.Named Entity TagPrecisionRecallF1-Score
1PS_NAME0.9320.9150.92318QT_MOBILE1.0001.0001.000
2PS_NICKNAME0.7920.8400.81519QT_PHONE1.0001.0001.000
3DT_BIRTH0.9461.0000.97220QT_CARD_NUMBER0.9871.0000.994
4QT_AGE0.9740.9620.96821QT_ACCOUNT_NUMBER0.9880.9880.988
5CV_SEX0.9680.9840.97622TMI_EMAIL1.0001.0001.000
6QT_LENGTH0.9571.0000.97823QT_PLATE_NUMBER1.0000.9830.992
7QT_WEIGHT0.9291.0000.96324OG_WORKPLACE0.8950.9250.910
8TM_BLOOD_TYPE1.0001.0001.00025OG_DEPARTMENT0.9080.9800.943
9OGG_RELIGION0.9621.0000.98126CV_POSITION0.8680.8810.874
10LCP_COUNTRY0.9880.9880.98827OGG_EDUCATION0.9300.9640.946
11OGG_CLUB0.8610.9200.89028QT_GRADE0.9671.0000.983
12LC_ADDRESS0.9420.9660.95429FD_MAJOR0.9900.9800.985
13LC_PLACE0.8240.8710.84730PS_ID0.9741.0000.987
14QT_RESIDENT_NUMBER1.0000.9440.97131TMI_SITE1.0001.0001.000
15QT_ALIEN_NUMBER0.9471.0000.97332QT_IP1.0001.0001.000
16QT_PASSPORT_NUMBER1.0001.0001.00033CV_MILITARY_CAMP0.9080.9830.944
17QT_DRIVER_NUMBER1.0001.0001.000
Table 13. Examples of annotation scope errors.
Table 13. Examples of annotation scope errors.
DivisionNamed EntityDialog
OriginalNickname<Chilchil-ah:PS_NICKNAME>, Did you lose your wallet again?
PredictionNickname<Chilchil:PS_NICKNAME>ah, Did you lose your wallet again?
OriginalJob title/position<Sajang-nim:CV_POSITION>, my child said he ate free food at your restaurant.
PredictionJob title/position<Sajang:CV_POSITION>nim, my child said he ate free food at your restaurant.
Table 14. Examples of similar forms of false detection.
Table 14. Examples of similar forms of false detection.
DivisionNamed EntityDialog
OriginalNameAre you <Kimnusia:PS_NAME> on 11/17/88?
PredictionNicknameAre you < Kimnusia:PS_NICKNAME> on 11/17/88?
OriginalClub/society<Ding-ging:PS_NICKNAME>ah, did you <Do-that:OGG_CLUB> also applied for this audition?
PredictionNickname<Ding-ging:PS_NICKNAME>ah, did you < Do-that:PS_NICKNAME> also applied for this audition?
OriginalPlaceLast year, you even performed at <Seoul Arts Center:LC_PLACE>.
PredictionClub/societyLast year, you even performed at <Seoul Arts Center:OGG_CLUB>.
OriginalPlaceI called you because a payment I did not make was paid at <GladiolasNail:LC_PLACE> five minutes ago.
PredictionWorkplaceI called you because a payment I did not make was paid at <GladiolasNail: OG_WORKPLACE > five minutes ago.
OriginalWorkplaceYes, this is <Chicken Syndrome Seoksan Branch:OG_WORKPLACE>.
PredictionPlaceYes, this is <Chicken Syndrome Seoksan Branch:LC_PLACE>.
OriginalCard numberNo, because I wrote <KB 4906 2560 6232 1691:QT_CARD_NUMBER>.
PredictionAccount numberNo, because I wrote <KB 4906 2560 6232 1691:QT_ACCOUNT_NUMBER>.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jang, S.; Cho, Y.; Seong, H.; Kim, T.; Woo, H. The Development of a Named Entity Recognizer for Detecting Personal Information Using a Korean Pretrained Language Model. Appl. Sci. 2024, 14, 5682. https://doi.org/10.3390/app14135682

AMA Style

Jang S, Cho Y, Seong H, Kim T, Woo H. The Development of a Named Entity Recognizer for Detecting Personal Information Using a Korean Pretrained Language Model. Applied Sciences. 2024; 14(13):5682. https://doi.org/10.3390/app14135682

Chicago/Turabian Style

Jang, Sungsoon, Yeseul Cho, Hyeonmin Seong, Taejong Kim, and Hosung Woo. 2024. "The Development of a Named Entity Recognizer for Detecting Personal Information Using a Korean Pretrained Language Model" Applied Sciences 14, no. 13: 5682. https://doi.org/10.3390/app14135682

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop