The Development of a Named Entity Recognizer for Detecting Personal Information Using a Korean Pretrained Language Model

Jang, Sungsoon; Cho, Yeseul; Seong, Hyeonmin; Kim, Taejong; Woo, Hosung

doi:10.3390/app14135682

Open AccessArticle

The Development of a Named Entity Recognizer for Detecting Personal Information Using a Korean Pretrained Language Model

by

Sungsoon Jang

¹

,

Yeseul Cho

¹,

Hyeonmin Seong

¹,

Taejong Kim

¹ and

Hosung Woo

^2,*

¹

Technology Strategy Research Institute, World Vertex Co., Ltd., Seoul 06748, Republic of Korea

²

Department of Edutech, Graduate School, Korea National Open University, Seoul 03087, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(13), 5682; https://doi.org/10.3390/app14135682

Submission received: 30 April 2024 / Revised: 20 June 2024 / Accepted: 25 June 2024 / Published: 28 June 2024

(This article belongs to the Special Issue Natural Language Processing: Theory, Methods and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Social network services and chatbots are susceptible to personal information leakage while facilitating language learning without time or space constraints. Accurate detection of personal information is paramount in avoiding such leaks. Conventionally named entity recognizers commonly used for this purpose often fail owing to errors of unrecognition and misrecognition. Research in named entity recognition predominantly focuses on English, which poses challenges for non-English languages. By specifying procedures for the development of Korean-based tag sets, data collection, and preprocessing, we formulated directions on the application of entity recognition research to non-English languages. Such research could significantly benefit artificial intelligence (AI)-based natural language processing globally. We developed a personal information tag set comprising 33 items and established guidelines for dataset creation, later converting it into JSON format for AI learning. State-of-the-art AI models, BERT and ELECTRA, were employed to implement and evaluate the named entity recognition (NER) model, which achieved an 0.943 F1-score and outperformed conventional recognizers in detecting personal information. This advancement suggests that the proposed NER model can effectively prevent personal information leakage in systems processing interactive text data, marking a significant stride in safeguarding privacy across digital platforms.

Keywords:

Korean interactive text; sensitive information; personally identifiable information (PII); named entity recognition (NER); pretrained language models

1. Introduction

The Korean wave [1], which started with the popularity of Korean dramas and movies, has spread globally through the spread of K-Pop, expanding into various K-contents. Owing to the increase in globalization and overseas expansion of Korean companies, policies aimed at increasing the number of international students, multicultural families resulting from international marriages, and immigrant workers, the interest in Korean language and the number of Korean language learners have been steadily increasing [2].

Based on data from the global language learning app Duolingo, the US news channel Cable News Network (CNN channel) reported a 38.3 percent increase in Korean language learners in US higher-education institutions between 2016 and 2021 (from 10,936 to 19,270 students) [3]. The number of UK higher-education students taking Korean courses has increased more than threefold from 2012 to 2018. King Sejong Institute, in charge of overseas Korean language education and promoting Korean culture, operated 244 branches in 84 countries worldwide in 2022. Their number of students increased nearly 160 times from 740 in 2007 to 117,636 in 2022 [4].

Social network services (SNSs), chatbots, and similar services can be convenient tools for educating and learning various languages without time and space constraints. YouTube, the world’s largest video platform, can also be used for language learning. Korean language learners use YouTube to acquire common conversation skills and pronunciations; this highlights the importance of using SNSs for teacher–student communication [5]. The Korean language department at a Chinese university conducted a Korean-speaking class using WeChat. Immediate information retrieval and sharing helped students understand learning materials and express their opinions, increasing their interest in learning [6]. A US university conducted vocabulary and grammar practice activities using Instagram for students taking beginner Korean classes. They provided feedback to the students on their posts upon completion. This increased the interest of students in the Korean language and encouraged their participation in learning [7]. The conversational Korean language education system [8] that uses chatbots involves engaging students in dialogs on various situational topics. It also enables precise recognition of speech and detailed evaluation of pronunciation and stress patterns, facilitating effective independent learning.

Chatbots and SNS-based education have numerous advantages; however, because they are universal and scalable and target many learners, they are vulnerable to personal information leakage. In 2021, a personal information leakage incident occurred in South Korea involving iLUDA, an artificial intelligence (AI) chatbot [9]. The iLUDA company used KakaoTalk dialogs collected from its app to develop and operate iLUDA. Data, including non-pseudonymized personal information, such as names, phone numbers, and addresses, were collected without explicit consent. The iLUDA service was terminated, and the company was fined owing to personal information leakage and sexual harassment issues. This was the first case in South Korea where the indiscriminate use of AI to handle personal information was sanctioned. Afterwards, the Ministry of Interior and Safety distributed a guide titled “ChatGPT Usage Guidelines and Precautions” [10] to help the public sector effectively use ChatGPT. It provides guidance on personal information protection, leakage of important information, and safety measures when using ChatGPT.

Meanwhile, HappyTalk, a messenger-based chat counseling solution, revealed that in July 2021, its server was forcibly breached by intruders, resulting in customer information leakage [11]. According to the developer, the server was breached through messages received via chat inquiries, and customer counseling information was accessed. A subsequent internal investigation revealed that customer personal information, such as names and phone numbers, had been leaked.

According to BleepingComputer, a hacking forum leaked the personal information of 2.6 million users of Duolingo, the world’s largest language learning site [12]. The data were scrapped using a publicly available application programming interface (API) and sold on the breached hacking forum. The leaked data contained public information, such as login IDs and real names, and private information, such as emails and completed courses. Furthermore, in 2021, Facebook experienced a data breach through a friend request API bug, exposing the phone numbers of over 500 million users. Similarly, Twitter also experienced personal information leakage, including the email addresses of millions of users.

These examples highlight the risks of using data generated through language-based services, such as SNS and chatbots, data analysis, and AI tasks. These issues can be broadly categorized into two main groups: The first issue involves data collection and acquisition. Data are primarily collected by scraping publicly available text on the internet, such as social networking sites. This may result in privacy breaches. The second issue pertains to using AI models trained to collect data. There is a risk of personal information being regenerated from the inference results of AI models. The risk increases significantly with the amount of data [13].

Accurately detecting personal information in data is paramount in addressing these issues, as are the crucial steps of de-identifying or pseudonymizing these data. Previous research has extensively used named entity recognizers for this task. However, these conventional methods are prone to errors of unrecognition and misrecognition, leading to incomplete or inaccurate detection of personal information [14]. Challenges arise when detecting unique identifiers, such as resident registration and passport numbers. Furthermore, certain attributes, including gender and blood type, remain undetectable. These limitations underscore the need for advanced personal information processing solutions.

Named entity recognition (NER) studies have been primarily conducted with data in English. Owing to the predominance of English in publicly available datasets, working with English alternatives can present significant challenges [15]. NER research based on the Korean language is inadequate, although Korean is becoming globally used. In recent research on personal information detection using a Korean named entity recognizer, attempts were made to use additional information, such as sentence intention or speaker information. However, the study faced limitations owing to insufficient data [16,17]. Therefore, defining a privacy named entity (PNE) tag set and collecting and processing data will positively impact AI-based natural language processing research outside the English-speaking world.

We selected 33 personal information items to identify personal information in interactive texts and built a set of personal information tags. We then collected data using data-building guidelines and converted them into JSON format suitable for AI learning. We selected bidirectional encoder representations from the transformer (BERT) and efficiently trained an encoder that accurately classifies token replacements (ELECTRA) as the pretrained language models. These were implemented and outperformed general object name detectors in personal information detection.

The remainder of this paper is organized as follows. Section 2 explores research related to NER and personal information. Section 3 presents the establishment of a PNE tag set, interactive data collection and processing, the structuring and building of datasets, the construction of learning models, and the preprocessing of training data. In Section 4, a false detection case analysis, conducted to determine areas of improvement, is discussed, and the performance of the model is verified. Section 5 summarizes the research findings and proposes future research directions.

2. Related Works

2.1. Named Entity Recognition

Named entities are proper nouns, such as personal, place, and organization names. NER is a fundamental task in natural language processing, which automatically identifies and classifies predefined specific language expressions in an unstructured natural language text, primarily proper nouns. NER can be used for various purposes, such as Q&A, chatbots, reservation systems, and customer service interactions.

Various attempts have been made to address NER tasks (e.g., Q&A and chatbots). The three main approaches to addressing NER tasks used before the advent of deep learning are rule-based, unsupervised, and feature-based supervised learning. Rule-based NER uses domain-specific dictionaries or syntax–lexicon patterns and has high accuracy in specific domains; however, it lacks recall, making it difficult to transfer data to other systems. Unsupervised learning mainly achieves NER through clustering. It extracts named entities from groups based on contextual similarity. The main concept is that mentions of named entities can be identified through vocabulary resources, patterns calculated from large corpora, and statistics. Other studies have used supervised learning to apply multiclass classification techniques or sequence labeling, such as NER [18].

Other studies have used deep learning models, such as convolutional neural networks and recurrent neural networks (RNNs), including long short-term memory. In South Korea, in 2018, a model that combined a bidirectional RNN and conditional random field with ensemble techniques showed significant performance improvement in the Naver NLP Challenge [19]. Transformer-based pretrained language models are a new paradigm to address the NER challenge and have significantly improved NER task performance [18].

Based on existing patterns and morphological analysis, traditional personal information detection systems face problems related to low detection rates, labor, and time consumption. To improve the detection rate through iterative experiments, Naver nlp- challenge proposed a machine learning method to train patterns and types of personal information from several electronic documents [20].

Personal information cannot be identified using regular expressions in atypical text data or existing NERs. Therefore, Dias et al. [15] proposed a method to train the BERT model using speaker information and label two tags for a single phrase. The intention was to tag speaker A as NAME A (NMA), speaker B as NAME B (NMB), and the name of the person who did not participate in the dialog as NAME OTHERS (NMO) [17]. However, the scope of that study was limited to testing names among personal information items. Furthermore, the quality and size of the dataset were insufficient for personal information detection.

Seo et al. [16] proposed a model for detecting personal information. This model uses the intended information of a sentence as additional information in the named entity learning process and a de-identification technique that considers the utility of personal information. The model uses Korean pretrained language models to simultaneously classify the intent of sentences and detect personal information in atypical text data. However, the personal information tag set used for identification consisted of only seven tags, making it difficult for the model to detect various types of personal information.

Kim and Lim [21] used deep learning techniques to develop an NER model that focuses on criminal investigation. They collected texts from the criminal investigation domain and redefined the classification of named entities required for crime analysis. They conducted an experiment where they categorized the domain of criminal investigation into 9 main categories and 56 subcategories. Their experimental results showed that all categories were identified exceptionally well. Their study aimed to enhance the efficiency of crime prevention and investigation by automatically extracting the level of crime using the NER designed for crime investigation.

Go et al. [22] developed an NER model for efficient dialog information prediction, focusing on household chemicals. They defined a new named entity tag set comprising the manufacturer, name, detailed items, formulation classification, ingredients, and inflow routes of the product. They suggested that using a user dictionary for preprocessing on a specific domain is efficient and recommended reducing the number of filters in character-level convolutional neural networks to reduce model complexity.

Previous research has shown that in specific domains, creating a domain-specific named entity tag set can significantly improve accuracy compared with using a general NER. Transformer-based models, such as BERT and ELECTRA, have recently shown excellent performance and are now widely used in the NER field. AI-based personal information detection technology should be actively used to reduce the human resources and time required to verify false detections based on traditional patterns or regular expressions.

2.2. Personal Information Data

The Korea Internet and Security Agency, a subsidiary of the Personal Information Protection Commission of South Korea, operates a privacy portal [23]. The Personal Information Protection Act defines personal information as information related to a living individual, including the following [24]:

(a): Personal information identifying an individual, such as name, resident registration number, and images.
(b): The information provided can easily be combined with other information to identify the individual because it is insufficient on its own to identify a specific individual.
(c): Additional information used to restore the original state of an individual by pseudonymizing (a) or (b), making it impossible to identify a specific person (pseudonymous information).

Therefore, a natural person must be the subject of personal information. Information about the name, address, executive information, and financial performance of a corporation is not personal information; therefore, it is not protected by the Personal Information Protection Act. Personal information includes various personal details, ranging from basic information (name, resident registration number, etc.) to more private information (social and economic status, education, health, property, cultural activities, political inclinations, etc.).

The privacy portal categorizes personal information into various categories, as listed in Table 1. Identity information encompasses general and family information. Physical information is categorized into body and medical information. Mental information includes preferences, disposition, and inner secrets. Social information consists of education, military service, employment, and legal. Property information encompasses income, credit, real estate, and other revenue information. Finally, miscellaneous information includes communication, location, habits, and hobbies.

The Personal Information Protection Commission categorizes personal information items into grades and types according to the guidelines listed in Table 2, Table 3 and Table 4. The categories consist of Grades 1–3. Grade 1 encompasses information regarding unique identification, sensitivity, certification, credit/financial, and location.

Grade 2 comprises personal identification, body, family, education and training, military service, real estate, income, employment, legal, medical, organizational, habits and hobbies, and personal image information.

Grade 3 includes telecommunications, processed, and limited personal identification information.

3. Materials and Methods

In this study, based on Table 1, Table 2, Table 3 and Table 4 and advice from industry professionals on information protection and security, we analyzed 39 items related to personal information to detect personal information that may arise in SNS and AI chatbots. We defined 33 items and constructed a personal information tag set. We developed guidelines for creating data on defined personal information items and collected data accordingly. We then transformed the collected data into JSON format suitable for AI learning based on the data structure definition document and conducted model training. Consequently, we developed a named entity recognizer to detect personal information.

First, we established a privacy-specific named entity tag set. Traditional NER tag sets that contain multiple items within a single tag fail to detect certain personal information items and inaccurately detect information. Consequently, we identified 39 personal information items as appropriate for inclusion in a privacy-specific named entity tag set. We evaluated the detectability of these items using NER and assessed their exposure risk when used alone or in combination to consider the extent of de-identification required. A personal information tag set comprising 33 items, frequently used in interactive texts and posing significant exposure risks, was constructed. This set includes names, nicknames, gender, height, weight, account numbers, card numbers, and sensitive unique identification numbers, such as resident registration, passport, driver’s license, and alien registration.

Second, we collected interactive text data containing personal information. We established guidelines for generating virtual dialog data incorporating personal information and refined them through an inspection process to ensure their appropriateness for natural dialogs.

Third, we created a training dataset using a data structure definition document as a guide. This involved labeling dialog topics, general named entities, PNEs, and speakers, with PNEs labeled according to the specified tag set.

Fourth, using the begin-inside-outside (BIO) tagging method commonly used in NER, we preprocessed data for deep learning training, where ‘B’, ‘I’, and ‘O’ denote the beginning of a named entity, interior, and external parts not included in the named entity, respectively.

Fifth, the related literature helped us select the deep learning model. We selected the transformer-based BERT and ELECTRA models because they demonstrated high performance in natural language processing.

Sixth, we conducted experiments using the KPF-BERT and KR-ELECTRA models for BERT and ELECTRA. We used PyTorch 1.13 as a deep learning framework and implemented the model on a server equipped with four NVIDIA RTX A5000 GPUs (Nvidia, Santa Clara, CA, USA) to train and evaluate the models. We then identified directions for improvement.

3.1. Construction of a Privacy Named Entity Tag Set

For the experiment, we constructed a tag set for PNEs to recognize personal information items that cannot be identified by a general NER. The 39 items, comprising the types of personal information defined by personal information portals and committees, are listed in Table 5. We then categorized these items into detection techniques, exposure risks, and de-identification scope. We also categorized the detection methods into rule- and NER-based detection to investigate the scope of detection. We classified personal information items as accurately (v) or inaccurately detectable because multiple items could belong to a single category (△). We categorized exposure risk into high (H), medium (M), and low (L) levels depending on a single or multiple factors causing the exposure, based on consultations with industry professionals in information protection and security. We categorized the anonymization scope into pseudonymization/substitution, deletion, categorization, and masking methods.

Following discussions involving ten experts in personal information detection and de-identification solutions, university professors, researchers, and AI and system integration development specialists, we selected 33 items from the initial 39 items targeted for personal information collection, as listed in Table 6. We selected the items based on the criteria for collection and use in the interactive text. We added unique identification information, such as resident registration and passport numbers, financial information, credit card and bank account numbers, and personal information, such as workplace, department, and position. Although this information can be detected using regular expressions, it has a high false detection rate.

A privacy NER tag set was developed by analyzing the existing NER tags for the 33 selected items. First, we added a new tag set for items without existing NER tags to align with the personal information category. These items included CV_SEX (gender), TM_BLOOD_TYPE (blood type), OG_DEPARTMENT (department), QT_IP (IP information), and CV_MILITARY_CAMP (military unit). Second, we refined the previously unified tag set into more specific categories to align with the personal information section. We divided PS_NAME into PS_NAME (name), PS_NICKNAME (nickname), and PS_ID (ID). The unique identification information, account number, and license plate number that were integrated under QT_OTHERS were subdivided into QT_RESIDENT_NUMBER (resident registration number), QT_ACCOUNT_NUMBER (account number), and QT_PLATE_NUMBER (license plate number), respectively. Third, we specified the tag sets that were previously vague in the personal information category. DT_OTHERS and OGG_OTHERS were specified as DT_BIRTH (date of birth) and OGG_CLUB (club/society). Fourth, we applied the remaining items similarly to the existing NER tag set and included QT_AGE (age), QT_LENGTH (height), TMI_EMAIL (email address), and OGG_EDUCATION (school).

3.2. Interactive Data Collection and Processing

Considering the widespread use of SNS and AI chatbots, we collected dialog data. We conducted a preliminary independent investigation of various platforms, such as KakaoTalk, Twitter, Facebook, AI Hub, Naver News comments, YouTube comments, and online shopping mall comments. According to the investigation, personal information was rare in open spaces such as Naver News, YouTube, and online shopping mall comments. However, in the clothing category of online shopping malls, we observed some personal information related to body size. On Facebook, specific groups, such as job seekers or affiliated schools, often exposed information related to businesses for job opportunities, contact details of responsible individuals, emails, and associated information. As Twitter allows users to remain anonymous, the names and account numbers in transactional posts are often easily revealed. Although the AI Hub data contained some personal information, they were of limited type and quantity, masked, and therefore not useful.

Leaking personal information from interactive data is a serious problem; however, the amount of personal information in everyday dialogs is minimal. After examining the KakaoTalk data of six individuals (including three senior-level researchers with over ten years of experience, two manager-level researchers with over four years of experience, and one assistant-level researcher with less than four years of experience), we identified less than 20 instances of dialogs revealing personal information, excluding names and nicknames, over six months to one year. Furthermore, we collected data through crowdsourcing because the interactive text contained sensitive personal information. However, the small sample size prevented the collection of the desired amount of data. We collected interactive personal information and processed it as follows.

First, we constructed dialog data containing personal information based on various topics. Interactive data refer to dialogs on various platforms, such as messengers, social media, posts, comments, and call centers. We classified the dialog topics into eight types, as listed in Table 7, and defined the personal information items corresponding to each type. Dialog topics were categorized as “personal and relationships”, “housing and life”, “shopping and trading”, “public services”, “leisure and entertainment”, “work and occupation”, “beauty and health”, and “learning and career”. We categorized the prioritized dialog types into all categories except “personal and relationships” and classified dialogs that did not fall into any other category as “personal and relationships”. We defined the personal information categories for each dialog type to minimize duplication and included many items within each dialog set.

Second, we constructed the dialog data using everyday dialog sentences, each consisting of a minimum of three turns and an average of four turns. A dialog turn is a single turn that consists of one exchange between speakers 1 (P01) and 2 (P02). The tagging criteria for a dialog are listed in Table 8. The tagging scope encompassed PNEs, dialog topics, and speaker comments.

We collected personal information according to the data collection guidelines. The annotation guidelines for the PNE were defined based on the existing entity name annotation guidelines following discussions with language-related university professors and researchers on the definition of a PNE tag set. All 15 annotators have majored in linguistics; they completed all prior training on the guidelines and then worked on the annotations. Where the annotation definition was not clear, detailed inspection rules were established based on consensus, and the annotation guidelines were revised. Data annotations were inspected according to relevant criteria; five inspectors inspected data annotations, and uniformity of the annotations was secured based on the feedback from the inspectors. Quality was improved by deploying five management and support personnel, as well as annotators and inspectors.

The quantity for each item, set based on the advice of a natural language processing expert, was that a minimum of 100 data points must be used for the learning process. The total collection target was set as 20,000, with individual quantities determined based on the collection and detection difficulty of each item. Unique identification information, such as resident registration and passport numbers, is only occasionally revealed in dialogs and has a consistent pattern, resulting in fewer instances than other items. The “name” that was bound to appear the most in the conversation occurred more than three times the target quantity. Thus, we collected approximately 22,309 data points.

3.3. Structuring and Building Datasets

To convert the collected data into JSON format, which is suitable for AI learning, we wrote a data structure definition document based on the JSON structure in the NER corpus of the National Institute of Korean Language, as listed in Table 9. We labeled the dialog_type based on Table 7, in which personal information items are listed by the dialog topic. The PNE for detecting personal information was labeled based on the items defined in Table 6, and a PNE tag set was created.

3.4. Learning Models and Training Data Preprocessing

3.4.1. BIO Tagging

Tokenization is dividing a given sentence into specific word fragments, called tokens, for mathematical calculations. Word embedding converts these tokens into vector representations for calculation. All sentences were transformed into operable vector representations using tokenization and word embedding. To incorporate information about entity names into vector representations for learning, BIO tagging (defined in Section 3) is the most used tagging task in NER [21,27]. In this study, we annotated the collected data using BIO tagging, as shown in Figure 1.

3.4.2. Building a Learning Model

Traditional RNN-based models suffer from prolonged calculation times due to single-word sequential operation. Launched by Google in 2017, the transformer model addresses this issue by employing attention mechanisms to process entire sentences in parallel, thereby reducing memory and computational demands. This simultaneous processing enhances performance and facilitates efficient training [28]. We used transformer-based BERT and ELECTRA models for the experiments.

In 2018, Google released a language model pretrained on a large amount of training data. During pretraining, the model learns using the masked language model and next sentence prediction. It demonstrated high performance with the addition of a layer, minimal data, and fine tuning of the training time. This is referred to as transfer learning. To enhance the performance of BERT in a specific field, the collection of language data from that field and additional training are necessary [29]. ELECTRA is more resource-efficient than BERT; thus, it efficiently learns with relatively fewer resources. It improves training efficiency by using a new pretraining task called replaced token detection which allows for the generation of multiple sentences with partially changed words from a sampled sentence. This enhancement is achieved concerning the size of the prepared dataset [30].

In this study, the experiments were conducted on a server equipped with four NVIDIA RTX A5000 GPUs. PyTorch served as the deep learning framework. KPF-BERT and KR-ELECTRA were used as the BERT and ELECTRA models, respectively.

KPF-BERT is a BERT model released by the Korea Press Foundation (KPF), trained on over 40 million articles from 20 years of Bigkinds data [31]. The Computational Linguistics Lab at Seoul National University released the KR-ELECTRA model after training it on 34 GB of Korean text data, including Wikipedia articles, news, legal texts, news comments, and product reviews [32].

We performed tokenization for the KPF-BERT and KR-ELECTRA models using the WordPiece tokenizer, a subword tokenizer. The vocabulary size was 36,439 for KPF-BERT and 30,000 for KR-ELECTRA. For the experiments, we used the AdamW optimizer, a learning rate of 5 × 10⁻⁵, 30 training epochs, and batch sizes of 4 and 24. We also set the maximum sequence length to 128 and 256 tokens when training on a sentence and dialog, respectively.

3.4.3. Model Experiment Dataset

We used 4581 dialog sets, divided into training and test datasets of 4022 (19,650 tags) and 559 (2659 tags), respectively. The number of tags for each personal information item in the training and test datasets is listed in Table 10.

We divided the training into two parts to conduct the PNE detection experiments with the KPF-BERT and KR-ELECTRA models: learning each dialog line by line and learning the entire dialog unit.

4. Experimental Results

In NER, the F1-score is commonly used as the evaluation metric. Considering the significant differences in the quantities of 33 personal information items, a micro-average was deemed advantageous in calculating a balanced average. Figure 2 illustrates an example of PNE detection. It is displayed as <personal information item: named entity tag> in the OUTPUT for convenience. However, it was mapped to BIO tagging and the corresponding named entity tag in the actual detection.

4.1. Performance Results of the KPF-BERT and KR-ELECTRA Models

The experimental results for each model are summarized in Table 11. For a batch size of 24, the KPF-BERT [33] model trained on dialog data achieved the highest performance of 93.4 percent. When examined individually, KPF-BERT showed a PNE detection rate 0.9 percent higher when trained at the conversational level than at the sentence level. KR-ELECTRA [34] showed a detection rate 9.9 percent higher when trained at the sentence level than at the conversational level. This is because KPF-BERT is trained on articles longer than 512 subwords. We trained KPF-BERT to process documents independently by providing a stride to address this issue. Consequently, it performed better in conversational training when we set the maximum sequence length to 512. Owing to its maximum sequence length of 128, KR-ELECTRA is a pretrained model that performs better when trained on shorter sentence units.

Therefore, performance depends on the maximum sequence length suitable for each model. KPF-BERT, which can handle many tokens when trained on the dialog level, demonstrated the best performance. Therefore, identifying the context of the dialog can be considered helpful in detecting PNEs.

After confirming the excellent performance of KPF-BERT in dialog-level training, we tuned the hyperparameters to further enhance its performance. After experimenting with batch sizes of 4 and maximum sequence lengths of 512 for both models, KPF-BERT achieved an improved F1-score of 94.3 percent, 0.9 percentage points higher than the previous best performance of 93.4 percent.

The experimental results for the personal information category of KPF-BERT are summarized in Table 12. According to the results, KPF-BERT showed the highest performance with an F1-score of 0.943. In identifying 33 specific items of personal information, numerical data with patterns or restricted forms of personal information showed a high identification performance of over 90 percent. Specific unique identification information, such as passport and driver’s license numbers, blood type, mobile phone number, general phone/fax number, email address, URL, and IP address, demonstrated a 100 percent detection rate, with an F1-score of 1.0. Sixteen items, including date of birth, height, weight, and some additional items, showed a recall of 1.0, indicating a consistent pattern (e.g., passport number: M123A4567, mobile phone number: 010-1234-5678) among these items. However, we observed a few false detections, such as mistaking a date for the date of birth or incorrectly identifying simple criteria unrelated to personal information, such as height or weight. Nicknames, clubs/societies, and places are items with diverse forms and uses, which complicates predictions. Job title/position, although simpler in form than the previous three items, showed relatively low detection rates because many cases that do not correspond to job titles/positions, such as occupations, titles, and relationships, were included.

4.2. Analysis Results for False Detection Cases

After examining the cases of false detection in the personal information category, we found errors due to missing annotations, inaccuracies in the annotation scope, or false detection based on similar forms. Missing annotations are cases where data are omitted owing to user error during data collection. Scope errors in annotations mostly occur when vocative particles such as ‘~ah’ and ‘~ya’ or suffixes such as ‘~nim’ and ‘~ssi’ are included in names, nicknames, job titles, or positions. Although annotation scope errors are useful for capturing entities with forms similar to personal information, they often misidentify these entities as different named entities. These cases constituted most false detections; their specific details are presented below.

4.2.1. Annotation Scope Errors

Incorrectly specifying the annotation range during data labeling causes the model to detect named entities incorrectly; however, it may appear to make false detections owing to incorrect ground-truth labeling. Typically, names, nicknames, and job titles/positions including vocative particles such as ‘~ah’ and ‘~ya’ or suffixes such as ‘~nim’ and ‘~ssi’ are annotated incorrectly. Some examples are listed in Table 13.

4.2.2. Similar Forms of False Detections

Although the model effectively identified entities with forms similar to personal information, it often misidentified them as other named entities; some examples of such cases are listed in Table 14. This constitutes the mainstream of false detection cases, highlighting the importance of establishing clear criteria for proper identification of each named entity as personal information, ensuring consistent and accurate annotation, and considering the context. Similar forms of seven erroneous cases are as follows:

(1): ‘name’ and ‘nickname’ are interchangeably used to refer to a person. Because foreign or baptismal names resemble nicknames, confusion may occur when using them.
(2): There is a similarity between a nickname and a club/society in that both allow users to freely create words without a fixed format, which may cause false detection.
(3): Similar to places that imply participation or visitation, the term ‘club/society’ can also indicate membership, leading to misinterpretation.
(4): Because the club/society and workplace pertain to an individual’s affiliation, it is possible to choose the wrong word when filling out the form.
(5): The term ‘self-employed’ should be detected as a workplace depending on the context; however, it is sometimes misidentified as a place. Differentiating between places and workplaces in each context is important because insufficient data can lead to false detection.
(6): Although the last digits of a resident registration and alien number may have different rules depending on the individual, they are identical in form and usage, leading to cases of mistaken identity. Therefore, conclusions must be drawn based on the context of the preceding and following dialog.
(7): Because card and account numbers appear in financial dialogs and are often accompanied by the mention of credit card companies or banks, they can be falsely detected. In this experiment, card numbers were occasionally mistaken for account numbers; however, account numbers were never mistaken for card numbers. Therefore, improving the data reduces the probability of errors.

5. Conclusions

In this study, we analyzed 33 out of 39 items of personal information to identify personal information from dialog-type texts and constructed a personal information tag set. Some personal information items could not be accurately detected owing to the limitations of the general NER tag set. We established guidelines for the defined personal information categories to facilitate data construction and collected dialog-type text data. According to the data structure specification, we converted the collected data into JSON format, which is suitable for AI learning. Subsequently, we performed experiments on the PNEs using the BERT and ELECTRA models. The best performance in this study was 94.3%, based on F1-score, when KPF-BERT was trained by setting batch sizes to 4 and Max Sequence Length to 512, and increasing the length of sentences from sentence units to conversation units [35]. In detecting 33 specific items of personal information, a high performance of over 90 percent was achieved on numerical data with patterns or limited forms of personal information. Unique identification information, blood type, mobile phone number, email, URL, and IP had a 100 percent detection rate, with an F1-score of 1.0. For 16 items, including date of birth, height, and weight, the recall was 1.0, indicating that these items had a consistent pattern. However, many items, including nicknames, clubs/societies, places, and job titles/positions had a lower detection rate than other categories owing to their diverse forms and usage, making prediction challenging. Thus, additional training data for items with low detection rates must be collected, and detection performance needs to be improved through additional learning.

Through this process, we established a foundation to flexibly address any changes in the personal information section or the need for additional information for learning and to proactively respond to potential personal information leakage. Named entity recognition research has generally been conducted based on English, with a tendency to focus on model performance while overlooking aspects such as data collection and preprocessing. In this study, we explored procedures such as data collection, preprocessing, and the conversion of training data to provide directions on the application of entity recognition research to non-English languages. The developed NER differs from previous NERs and addresses the issue of personal information detection. However, detecting personal information presents challenges in assessing the risk of personal information leakage. Future research should explore topics beyond simple personal information detection and evaluate the risk of data leakage based on the detected personal information. Furthermore, detecting personal information requires research experiments using generative pretrained transformers and large language models. We anticipate that further research building upon our model will be able to prevent personal information leakage in various systems that generate large-scale text data.

Author Contributions

Conceptualization, T.K. and S.J.; resources and data curation, Y.C. and H.S.; methodology, S.J., Y.C., and H.S.; validation, S.J. and H.W.; investigation, Y.C. and H.S.; formal analysis, H.S. and Y.C.; visualization, Y.C. and H.S.; writing—original draft, S.J.; writing—review and editing, S.J. and H.W.; project administration, T.K. and S.J.; funding acquisition, T.K. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the Personal Information Protection Commission of the Republic of Korea and the Korea Internet & Security Agency (KISA), grant number 1781000017.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study. Requests to access the datasets should be directed to the email address hughwoo@knou.ac.kr.

Acknowledgments

This study was supported by the Personal Information Protection Commission of the Republic of Korea and the Korea Internet & Security Agency (KISA). Therefore, the authors thank the Personal Information Protection Commission of the Republic of Korea and the Korea Internet & Security Agency (KISA) for their technical and financial support.

Conflicts of Interest

Authors Sungsoon Jang, Yeseul Cho, Hyeonmin Seong and Taejong Kim were employed by the company Technology Strategy Research Institute, World Vertex Co., Ltd., Seoul, Korea The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of the data; in the writing of the manuscript; or in the decision to publish the results.

References

Available online: https://kofice.or.kr/b20industry/b20_industry_03_view.asp?seq=8009 (accessed on 15 March 2024).
Available online: https://eiec.kdi.re.kr/publish/naraView.do?fcode=00002000040000100009&cidx=14502&sel_year=2023&sel_month=10 (accessed on 15 March 2024).
Available online: https://edition.cnn.com/2023/01/17/asia/korean-language-learning-rise-hallyu-intl-hnk-dst/index.html (accessed on 15 March 2024).
Available online: https://www.ksif.or.kr/com/cmm/EgovContentView.do?menuNo=10101100 (accessed on 23 February 2024).
Lee, J.H. A study on foreign learner’s learning experience in Korean using YouTube. JHSS 2020, 11, 285–300. [Google Scholar]
Kim, H.-J. A case study on a Korean speaking class using SNS, The Korean Association of Speech. Communication 2016, 34, 139–172. [Google Scholar]
Choi, S.-J. A study on Korean education using Instagram as a mobile-assisted language learning tool: The case of beginning Korean class and learners’ perception in American College. J. Lang. Cult. 2021, 17, 383–415. [Google Scholar]
Available online: https://www.boannews.com/media/view.asp?idx=101117 (accessed on 15 March 2024).
Available online: https://www.boannews.com/media/view.asp?idx=119138 (accessed on 15 March 2024).
Available online: https://www.mois.go.kr/frt/bbs/type010/commonSelectBoardArticle.do?bbsId=BBSMSTR_000000000008&nttId=100278 (accessed on 23 February 2024).
Available online: https://www.boannews.com/media/view.asp?idx=99333 (accessed on 15 March 2024).
Available online: https://www.bleepingcomputer.com/news/security/scraped-data-of-26-million-duolingo-users-released-on-hacking-forum/ (accessed on 6 April 2024).
Kim, B.P. Legal challenges in large-scale language models. KAFIL 2022, 26, 173–217. [Google Scholar]
Choi, D.; Kim, S.H.; Cho, J.-M.; Jin, S.-H.; Cho, H.S. Personal information exposure on social network service. KIISC 2013, 23, 977–983. [Google Scholar]
Dias, M.; Boné, J.; Ferreira, J.C.; Ribeiro, R.; Maia, R. Named entity recognition for sensitive data discovery in Portuguese. Appl. Sci. 2020, 10, 2303. [Google Scholar] [CrossRef]
Seo, D.-K.; Kim, G.-W.; Kim, J.-Y.; Lee, D.-H. Personal information detection and de-identification system using sentence intent classification and named entity recognition. In Proceedings of the Korea Institute of Information Security & Cryptology Conference, Online, 6–7 November 2020; Volume 27, pp. 1018–1021. [Google Scholar]
Cha, D.H.; Know, B.K.; Youn, H.C.; Hyup Lee, G.; Joo, J.W.J. A study on identifying personal information on conversational text data. In Proceedings of the Korea Institute of Information Security & Cryptology Conference, Seoul, Republic of Korea, 3–5 November 2022; Volume 29, pp. 11–13. [Google Scholar]
Li, J.; Sun, A.; Han, J.; Li, C. A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 2020, 34, 50–70. [Google Scholar] [CrossRef]
Seo, Y.H. A Study on Improvement of Identification Rate of Personal Data Using Machine Learning. Master’s Thesis, Soongsil University Graduate School, Seoul, Republic of Korea, 2019; pp. 279–281. [Google Scholar]
Available online: https://github.com/naver/nlp-challenge (accessed on 6 April 2024).
Kim, H.-D.; Lim, H.-S. A named entity recognition model in the criminal investigation domain using a pretrained language model. J. Korea Converg. Soc. 2022, 13, 13–120. [Google Scholar]
Go, M.-H.; Kim, H.-D.; Lim, H.-Y.; Lee, Y.-L.; Ji, M.-G.; Kim, W.I. A study on named entity recognition for effective dialogue information prediction. Broadcast. Eng. 2019, 24, 58–66. [Google Scholar]
Available online: https://www.privacy.go.kr/ (accessed on 23 February 2024).
Available online: https://www.privacy.go.kr/front/contents/cntntsView.do?contsNo=27 (accessed on 23 February 2024).
Available online: https://www.privacy.go.kr/front/contents/cntntsView.do?contsNo=35 (accessed on 23 February 2024).
Available online: https://www.law.go.kr/LSW/flDownload.do?flSeq=116296825&flNm=%5B%EB%B3%84%ED%91%9C+1%5D+%EA%B0%9C%EC%9D%B8%EC%A0%95%EB%B3%B4+%25E (accessed on 23 February 2024).
Kim, W.-H.; Lee, S.-J.; Lee, J.-H. Improving the accuracy of extracting sentiment in Korean text through the BIO tagging and triplet methods. Int. J. Foreign Stud. 2021, 57, 345–366. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Clark, K.; Luong, M.-T.; Le, Q.V.; Manning, C.D. ELECTRA: Pretraining Text Encoders as Discriminators Rather Than Generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
Son, H.-W.; Han, Y.-M.; Nam, K.-H.; Han, S.-B.; Yo, G.-S. Development of a News Trend Visualization System based on KPF-BERT for Event Changes and Entity Sentiment Analysis. Proc. JKIIT 2024, 22, 203–213. [Google Scholar] [CrossRef]
Available online: https://huggingface.co/snunlp/KR-ELECTRA-generator/blob/main/README.md (accessed on 23 February 2024).
Available online: https://github.com/KPFBERT/kpfbert (accessed on 23 February 2024).
Cho, W.-J.; Shin, G.-P.; Lee, W.-J.; Son, S.-H.; Song, H.-W.; Lee, J.-H.; Lee, H.-J.; Jo, S.-Y. KoELECTRA-Based Named Entity Recognition Using Korean Morphological Analyzers. In Proceedings of the Korean Institute of Information Scientists and Engineers 2021, Jeju, Republic of Korea, 23–25 June 2021; pp. 1897–1899. [Google Scholar]
Woo, H.-S.; Kim, J.-M.; Lee, W.-G. Validation of text data preprocessing using a neural network model. Math. Probl. Eng. 2020, 2020, 1958149. [Google Scholar] [CrossRef]

Figure 1. Example of BIO tagging.

Figure 2. Example of personal information detection in dialog texts.

Table 1. The types of personal information classified by the privacy portal [25].

Classification		Personal Information Items
Identity information	General information	Full name, resident registration number, address, phone number, date of birth, place of birth, gender, etc.
Identity information	Family information	Family relations, family member information, etc.
Physical information	Body information	Face, iris, voice, genetic information, fingerprints, height, weight, etc.
Physical information	Medical and health information	Medical conditions, medical history, physical disabilities, disability ratings, medical history, and physical exam information, such as blood type, IQ, and drug tests.
Mental information	Preferences and disposition	Book and video rental records, magazine subscription information, purchase history, website browsing history, etc.
Mental information	Inner secrets	Ideology, creed, religion, values, political party or union membership, activities, etc.
Social information	Education	Education, grades, attendance, technical certifications and professional licenses, disciplinary records, student records, health records, etc.
	Military service	Military service, number and rank, discharge type, military unit, specialties, etc.
	Labor	Workplace, employer, place of employment, work history, reward and punishment records, job evaluation records, etc.
	Legal information	Criminal records, court records, fines paid, etc.
Property information	Income	Salary, bonuses and commissions, interest income, business income, etc.
	Credit	Loan and security pledge history, credit card numbers, passbook account numbers, credit information, etc.
	Real estate	Owned homes, land, cars, other vehicles, stores, buildings, etc.
	Other revenues	Insurance (health, life, etc.), enrollment status, vacation, sick leave, etc.
Miscellaneous information	Communication	Email addresses, phone calls, log files, cookies, etc.
	Location	Location of individuals by GPS and mobile phone.
	Habits and hobbies	Smoking, alcohol consumption, preferred sports and entertainment, leisure activities, gambling propensity, etc.

Table 2. The types of personal information according to the Personal Information Protection Commission’s guidelines (Grade 1) [26].

Grade	Type	Personal Information Items
Grade 1	Unique identification information	Resident registration number, passport number, driver’s license number, and alien registration number.
	Sensitive information	Personal information is likely to result in a significant invasion of privacy, such as ideas, beliefs, membership in or withdrawal from a trade union or political party, political opinions, health, and sexual life. Genetic information, criminal background information, medical history, physical and mental disabilities, sexual orientation, and disabilities (disability or not, disability class).
	Authentication information	Passwords and biometrics (fingerprint, iris, vein, etc.).
	Credit/financial information	Credit card numbers, account numbers, bank names, depository institutions, credit information, payment authorization numbers, loan balances and payment status, mortgages, late and missed payments, and records of wage garnishment notifications.
	Location information	Personal location using GPS or mobile phone.

Table 3. The types of personal information according to the Personal Information Protection Commission’s guidelines (Grade 2) [26].

Grade	Type	Personal Information Items
Grade 2	Personal identification information	Personal name, personal address, personal phone number, mobile phone number, email address, date of birth, gender, place of birth, domicile, and nationality.
	Body information	Height, bust, weight, and DNA.
	Family information	Family situation, names of family members, resident registration number, date of birth, place of birth, occupation, phone number, mobile phone number, marital status, and hobbies.
	Education and training information	School attendance, final education, grades, technical certifications and professional licenses, completed training programs, extracurricular activities, rewards, and penalties.
	Military service information	Military number and rank, discharge type, specialties, and military unit.
	Real estate information	Owned homes, land, cars, other vehicles, stores, and buildings.
	Income information	Current salary, salary history, bonuses and commissions, other sources of income, interest income, business income, and other income revenues. Insurance (health, life, etc.), enrollment status, company overhead, investment programs, retirement programs, vacations, and sick leave.
	Employment information	Current employer, company address, supervisor’s name, performance evaluation records, training records, attendance records, punishment records, work attitude, and personality test results.
	Legal information	Criminal records, motor vehicle violation records, bankruptcy and collateral records, arrest records, divorce records, and tax records.
	Medical information	Past medical records, psychiatric records, physical disabilities, body information such as blood type, IQ, and drug tests, and family medical history.
	Organizational information	Union membership, religious affiliation, political party membership, and club membership.
	Habits and hobbies	Smoking, alcohol consumption, preferred sports and entertainment, leisure activities, video rental history, and gambling propensity.
	Personal video information	Personal video information stored on video surveillance equipment (CCTV).

Table 4. The types of personal information according to the Personal Information Protection Commission’s guidelines (Grade 3) [26].

Grade	Type	Personal Information Items
Grade 3	Telecommunications information	IP information, MAC address, site visit history, phone call history, log files, and cookies.
	Processed information	Statistical information and subscriber tendency.
	Limited personal identification information	Membership information, employee number, and personally identifiable information for internal use.

Table 5. Analysis of targets for personal information collection.

Items		Detection Technique		Exposure Risk		De-Identification Scope
Items		Regular Expression	NER	Sole Exposure	Combined Exposure	Pseudonyms/ Substitutions	Deletion	Categorization	Masking
Personal	Name		△	M	H	v	v		v
General	Nickname		△	M	M	v	v		v
	Date of birth		△	L	M	v	v	v	v
	Age		v	L	M	v	v	v	v
	Anniversaries		△	L	L	v	v	v	v
	Nationality		v	M	L	v	v	v	v
Body	Gender			L	L	v	v		v
	Height		v	M	L	v	v	v	v
	Weight		v	M	L	v	v	v	v
	Blood type			M	L	v	v		v
Health	Medical insurance number	v (Many false detections)	△	H	H	v	v		v
Health	Medical history		v	M	M	v	v	v	v
Unique identification number	Resident registration number	v	△	H	H	v	v		v
	Alien number	v	△	H	H	v	v		v
	Passport number	v (Many false detections)	△	H	H	v	v		v
	Driver’s license number	v	△	H	H	v	v		v
General identification information	Mobile phone number	v	△	H	H	v	v		v
	General phone/FAX number	v	△	M	M	v	v		v
	Card number	v	△	H	H	v	v		v
	Account number	v (Many false detections)	△	H	H	v	v		v
	Email address	v	v	H	H	v	v		v
	License plate number		△	H	H	v	v		v
Workplace	Workplace		△	M	M	v	v		v
	Department			M	M	v	v		v
	Job title/position		v	M	M	v	v	v	v
School	School		v	M	M	v	v		v
	Grade		v	M	L	v	v	v	v
	Major		△	M	L	v	v	v	v
Location	Address		v	H	M	v	v	v	v
	Building name		v	L	M	v	v		v
	Address (hometown)		△	H	M	v	v	v	v
	House type		△	L	L	v	v		v

v: accurately detectable, △: inaccurately detectable, H: high, M: medium, L: low.

Table 6. Privacy named entity (PNE) tag set.

Division	Named Entity Item	General Named Entity Tag Set	PNE Tag Set
1	Name	PS_NAME	PS_NAME
2	Nickname	PS_NAME	PS_NICKNAME
3	Date of birth	DT_OTHERS	DT_BIRTH
4	Age	QT_AGE	QT_AGE
5	Gender	-	CV_SEX
6	Height	QT_LENGTH	QT_LENGTH
7	Weight	QT_WEIGHT	QT_WEIGHT
8	Blood type	-	TM_BLOOD_TYPE
9	Religion	OGG_RELIGION	OGG_RELIGION
10	Nationality	LCP_COUNTRY	LCP_COUNTRY
11	Club/society	OGG_OTHERS	OGG_CLUB
12	Address	LC	LC_ADDRESS
13	Place	LC, AF_BUILDING	LC_PLACE
14	Resident registration number	QT_OTHERS	QT_RESIDENT_NUMBER
15	Alien number	QT_OTHERS	QT_ALIEN_NUMBER
16	Passport number	QT_OTHERS	QT_PASSPORT_NUMBER
17	Driver’s license number	QT_OTHERS	QT_DRIVER_NUMBER
18	Mobile phone number	QT_PHONE	QT_MOBILE
19	General phone/FAX number	QT_PHONE	QT_PHONE
20	Card number	QT_OTHERS	QT_CARD_NUMBER
21	Account number	QT_OTHERS	QT_ACCOUNT_NUMBER
22	Email address	TMI_EMAIL	TMI_EMAIL
23	License plate number	QT_OTHERS	QT_PLATE_NUMBER
24	Workplace	OG	OG_WORKPLACE
25	Department	-	OG_DEPARTMENT
26	Job title/position	CV_POSITION	CV_POSITION
27	School	OGG_EDUCATION	OGG_EDUCATION
28	Grade	QT_ORDER	QT_GRADE
29	Major	FD	FD_MAJOR
30	ID	PS_NAME	PS_ID
31	URL	TMI_SITE	TMI_SITE
32	IP information	-	QT_IP
33	Military unit	-	CV_MILITARY_CAMP

Table 7. Personal information items by dialog topic.

Dialog Topic	Personal Information Items
Personal and relationships	Name, nickname, date of birth, age, gender, religion, nationality, military unit, etc. (almost all personal information items can be included)
Housing and life	Address, place, and license plate number
Shopping and trading	Name, date of birth, mobile phone number, card number, account number, email address, ID, and URL
Public services	Resident registration number, alien number, passport number, driver’s license number, general phone/FAX number, and IP information
Leisure and entertainment	Nickname, club/society, and ID
Work and occupation	Workplace, department, and job title/position
Beauty and health	Age, gender, height, weight, and nlood type
Learning and career	School, grade, and major

Table 8. Dialog tagging conditions.

Criteria	Conditions
The answer is not correct	If the speaker talks about estimates, not exact numbers, etc., it is excluded from tagging. If the speaker provides an incorrect answer about personal information, it is tagged. If any option (including alternatives and drafts) contains personal information, it is tagged.
This applies to organizations and not individuals.	It is tagged if the speaker talks about an organization with the same personal information.
The same personal information is repeated within a set	If a name/nickname, etc., is called or the same information appears repeatedly, it is tagged.

Table 9. Data structure definitions.

Division	Item Name	Type	Description	Remarks
1	id	string	Document ID	This is created with certain rules and forms the basis of the child IDs (3.1, 3.3.1).
2	metadata			Enter the established metadata information when collecting data.
3	document
3.1	id	string	Dialog dataset ID	[document ID].1; [document ID].2.
3.2	metadata			Inclusion of the dialog type is required.
3.2.1	dialog_type	string	Dialog type	Eight types of dialog topics.
3.3	sentence
3.3.1	id	string	Sentence ID	[document ID].1.1; [document ID].1.2.
3.3.2	form	string	Sentence	Only one sentence is received as input.
3.3.3	pid	string	Speaker ID	P01, P02, etc.
3.3.4	NE		General named entities	Labeled to 15 major categories according to the TTA standard named entity tag set (TTAK.KO-10.0852).
3.3.4.1	id	integer	Named entity ID	1, 2, 3, etc.
3.3.4.2	form	string	Named entity	Hong Gil-Dong, etc.
3.3.4.3	label	string	Named entity tag	PS, LC, etc.

Table 10. Number of tags for personal information items in training and test sets.

NO.	Named Entity Tag	Train	Test	NO.	Named Entity Tag	Train	Test
1	PS_NAME	1697	210	18	QT_MOBILE	542	66
2	PS_NICKNAME	1138	163	19	QT_PHONE	550	71
3	DT_BIRTH	563	69	20	QT_CARD_NUMBER	729	77
4	QT_AGE	536	79	21	QT_ACCOUNT_NUMBER	719	84
5	CV_SEX	403	61	22	TMI_EMAIL	542	77
6	QT_LENGTH	540	66	23	QT_PLATE_NUMBER	398	59
7	QT_WEIGHT	525	78	24	OG_WORKPLACE	907	119
8	TM_BLOOD_TYPE	365	38	25	OG_DEPARTMENT	682	101
9	OGG_RELIGION	375	51	26	CV_POSITION	913	136
10	LCP_COUNTRY	531	80	27	OGG_EDUCATION	867	110
11	OGG_CLUB	1037	162	28	QT_GRADE	415	58
12	LC_ADDRESS	964	118	29	FD_MAJOR	539	98
13	LC_PLACE	1007	124	30	PS_ID	528	76
14	QT_RESIDENT_NUMBER	182	18	31	TMI_SITE	393	67
15	QT_ALIEN_NUMBER	182	18	32	QT_IP	178	22
16	QT_PASSPORT_NUMBER	177	23	33	CV_MILITARY_CAMP	346	60
17	QT_DRIVER_NUMBER	180	20		Total	19,650	2659

Table 11. Experiment results of models.

Model		Batch Size	Max Seq Length	Precision	Recall	F1-Score
KPF-BERT	Sentence unit	24	128	0.916	0.934	0.925
	Dialog unit	24	256	0.923	0.945	0.934
	Dialog unit	4	512	0.932	0.955	0.943
KR-ELECTRA	Sentence unit	24	128	0.919	0.944	0.931
	Dialog unit	24	256	0.749	0.936	0.832
	Dialog unit	4	512	0.757	0.944	0.834

Table 12. Experimental results for each KPF-BERT personal information item (F1-score: 0.943).

No.	Named Entity Tag	Precision	Recall	F1-Score	No.	Named Entity Tag	Precision	Recall	F1-Score
1	PS_NAME	0.932	0.915	0.923	18	QT_MOBILE	1.000	1.000	1.000
2	PS_NICKNAME	0.792	0.840	0.815	19	QT_PHONE	1.000	1.000	1.000
3	DT_BIRTH	0.946	1.000	0.972	20	QT_CARD_NUMBER	0.987	1.000	0.994
4	QT_AGE	0.974	0.962	0.968	21	QT_ACCOUNT_NUMBER	0.988	0.988	0.988
5	CV_SEX	0.968	0.984	0.976	22	TMI_EMAIL	1.000	1.000	1.000
6	QT_LENGTH	0.957	1.000	0.978	23	QT_PLATE_NUMBER	1.000	0.983	0.992
7	QT_WEIGHT	0.929	1.000	0.963	24	OG_WORKPLACE	0.895	0.925	0.910
8	TM_BLOOD_TYPE	1.000	1.000	1.000	25	OG_DEPARTMENT	0.908	0.980	0.943
9	OGG_RELIGION	0.962	1.000	0.981	26	CV_POSITION	0.868	0.881	0.874
10	LCP_COUNTRY	0.988	0.988	0.988	27	OGG_EDUCATION	0.930	0.964	0.946
11	OGG_CLUB	0.861	0.920	0.890	28	QT_GRADE	0.967	1.000	0.983
12	LC_ADDRESS	0.942	0.966	0.954	29	FD_MAJOR	0.990	0.980	0.985
13	LC_PLACE	0.824	0.871	0.847	30	PS_ID	0.974	1.000	0.987
14	QT_RESIDENT_NUMBER	1.000	0.944	0.971	31	TMI_SITE	1.000	1.000	1.000
15	QT_ALIEN_NUMBER	0.947	1.000	0.973	32	QT_IP	1.000	1.000	1.000
16	QT_PASSPORT_NUMBER	1.000	1.000	1.000	33	CV_MILITARY_CAMP	0.908	0.983	0.944
17	QT_DRIVER_NUMBER	1.000	1.000	1.000

Table 13. Examples of annotation scope errors.

Division	Named Entity	Dialog
Original	Nickname	<Chilchil-ah:PS_NICKNAME>, Did you lose your wallet again?
Prediction	Nickname	<Chilchil:PS_NICKNAME>ah, Did you lose your wallet again?
Original	Job title/position	<Sajang-nim:CV_POSITION>, my child said he ate free food at your restaurant.
Prediction	Job title/position	<Sajang:CV_POSITION>nim, my child said he ate free food at your restaurant.

Table 14. Examples of similar forms of false detection.

Division	Named Entity	Dialog
Original	Name	Are you <Kimnusia:PS_NAME> on 11/17/88?
Prediction	Nickname	Are you < Kimnusia:PS_NICKNAME> on 11/17/88?
Original	Club/society	<Ding-ging:PS_NICKNAME>ah, did you <Do-that:OGG_CLUB> also applied for this audition?
Prediction	Nickname	<Ding-ging:PS_NICKNAME>ah, did you < Do-that:PS_NICKNAME> also applied for this audition?
Original	Place	Last year, you even performed at <Seoul Arts Center:LC_PLACE>.
Prediction	Club/society	Last year, you even performed at <Seoul Arts Center:OGG_CLUB>.
Original	Place	I called you because a payment I did not make was paid at <GladiolasNail:LC_PLACE> five minutes ago.
Prediction	Workplace	I called you because a payment I did not make was paid at <GladiolasNail: OG_WORKPLACE > five minutes ago.
Original	Workplace	Yes, this is <Chicken Syndrome Seoksan Branch:OG_WORKPLACE>.
Prediction	Place	Yes, this is <Chicken Syndrome Seoksan Branch:LC_PLACE>.
Original	Card number	No, because I wrote <KB 4906 2560 6232 1691:QT_CARD_NUMBER>.
Prediction	Account number	No, because I wrote <KB 4906 2560 6232 1691:QT_ACCOUNT_NUMBER>.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jang, S.; Cho, Y.; Seong, H.; Kim, T.; Woo, H. The Development of a Named Entity Recognizer for Detecting Personal Information Using a Korean Pretrained Language Model. Appl. Sci. 2024, 14, 5682. https://doi.org/10.3390/app14135682

AMA Style

Jang S, Cho Y, Seong H, Kim T, Woo H. The Development of a Named Entity Recognizer for Detecting Personal Information Using a Korean Pretrained Language Model. Applied Sciences. 2024; 14(13):5682. https://doi.org/10.3390/app14135682

Chicago/Turabian Style

Jang, Sungsoon, Yeseul Cho, Hyeonmin Seong, Taejong Kim, and Hosung Woo. 2024. "The Development of a Named Entity Recognizer for Detecting Personal Information Using a Korean Pretrained Language Model" Applied Sciences 14, no. 13: 5682. https://doi.org/10.3390/app14135682

APA Style

Jang, S., Cho, Y., Seong, H., Kim, T., & Woo, H. (2024). The Development of a Named Entity Recognizer for Detecting Personal Information Using a Korean Pretrained Language Model. Applied Sciences, 14(13), 5682. https://doi.org/10.3390/app14135682

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Development of a Named Entity Recognizer for Detecting Personal Information Using a Korean Pretrained Language Model

Abstract

1. Introduction

2. Related Works

2.1. Named Entity Recognition

2.2. Personal Information Data

3. Materials and Methods

3.1. Construction of a Privacy Named Entity Tag Set

3.2. Interactive Data Collection and Processing

3.3. Structuring and Building Datasets

3.4. Learning Models and Training Data Preprocessing

3.4.1. BIO Tagging

3.4.2. Building a Learning Model

3.4.3. Model Experiment Dataset

4. Experimental Results

4.1. Performance Results of the KPF-BERT and KR-ELECTRA Models

4.2. Analysis Results for False Detection Cases

4.2.1. Annotation Scope Errors

4.2.2. Similar Forms of False Detections

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI