Next Article in Journal
In Vitro Bioactivities of Cereals, Pseudocereals and Seeds: Assessment of Antiglycative and Carbonyl-Trapping Properties
Previous Article in Journal
The Temperature Dependence of the Parameters of LED Light Source Control Devices Powered by Pulsed Voltage
Previous Article in Special Issue
AraFast: Developing and Evaluating a Comprehensive Modern Standard Arabic Corpus for Enhanced Natural Language Processing
 
 
Article
Peer-Review Record

The Development of a Named Entity Recognizer for Detecting Personal Information Using a Korean Pretrained Language Model

Appl. Sci. 2024, 14(13), 5682; https://doi.org/10.3390/app14135682
by Sungsoon Jang 1, Yeseul Cho 1, Hyeonmin Seong 1, Taejong Kim 1 and Hosung Woo 2,*
Reviewer 3: Anonymous
Appl. Sci. 2024, 14(13), 5682; https://doi.org/10.3390/app14135682
Submission received: 30 April 2024 / Revised: 20 June 2024 / Accepted: 25 June 2024 / Published: 28 June 2024
(This article belongs to the Special Issue Natural Language Processing: Theory, Methods and Applications)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The article is interesting and relevant, but it needs to be revised.

1. I recommend adding specific numerical estimates to the abstract that show the effectiveness of the proposed NER model.

2. There are many references to articles that could be more scientific in the bibliography. The literature review should be expanded. Current scientific publications in this area should be critically analyzed.

3. It is necessary to describe the model mathematically and provide a conceptual model.

Comments on the Quality of English Language

Moderate editing of English language required

Author Response

Response to reviewer 1

 

The article is interesting and relevant, but it needs to be revised.

  1. I recommend adding specific numerical estimates to the abstract that show the effectiveness of the proposed NER model.

Response:

Thank you for your pertinent comments. We have added the relevant performance figures to the abstract.

 

  1. There are many references to articles that could be more scientific in the bibliography. The literature review should be expanded. Current scientific publications in this area should be critically analyzed.

Response:

Thank you for your pertinent comments. This study identified limitations in existing research. These include the fact that personal-information detection research focused on English, low accuracy, ambiguous personal information, the research being time-consuming, and a model with excellent performance. The number of referenced studies is 33; more references have been added in response to the reviewer's comment.

 

  1. It is necessary to describe the model mathematically and provide a conceptual model.

Response:

Thank you for your pertinent comments. This study did not develop a new model. As existing research has focused on English, we examined how the extant model could be applied in other languages. In addition to expanding the scope of personal information, we also focused on the process of collecting and preprocessing data and how to apply tagging. Because this study focused on process rather than performance improvement, the mathematical details of the model are explained by referring to the literature.

Reviewer 2 Report

Comments and Suggestions for Authors

1. The paper is not technicaly and content properly processed:

 -The abstract has only 166 words but is without the  part which clear processes scientiffic contribution of the paper.

- The section 1. Introduction should briefly place the study in a broad context but clear highlight why it is important  and  explain  the paper  significance.

- Section 4.Experimental results  gaps one part or subsection which deals with discussion  about limitations of proposed methodology and  one separate subsection with comparsions of  proposed method with already existiing methods.

See for example:

Zhou, Guodong & Jian, Su. (2005). Machine learning-based named entity recognition via effective integration of various evidences. Natural Language Engineering. 11. 189-206. 10.1017/S1351324904003559.

Fenny, Syafariani & Yunanto, Rio. (2021). Literature Review: Information Extraction using Named-Entity Recognition with Machine Learning Approach.

Seon, Choong-Nyoung & Ko, Youngjoong & Kim, Jeong-Seok & Seo, Jungyun. (2001). Named Entity Recognition using Machine Learning Methods and Pattern-Selection Rules.. 229-236.

- Conclusion section is without given clear the purpose and  the paper significance. 

 

2. The subject matter is not presented in a comprehensive manner:

- The paper lacks with limitations of the proposed solution and its comparsion with other that are of similar context.

 

3. The references used in the paper are appropriate, but the paper need more references because number of total 33 references is not enough that the paper can be published in one so respected journal as Applied sciences is.

Author Response

Response to reviewer 2

  1. The paper is not technically and content properly processed:

 -The abstract has only 166 words but is without the part which clear processes scientific contribution of the paper.

Response:

Thank you for your pertinent comments. Content has been added to the abstract to reflect the reviewer's comment. “By specifying procedures for the development of Korean-based tag sets, data collection, and preprocessing, we formulated directions on the application of entity recognition research to non-English languages. Such research could significantly benefit artificial intelligence (AI)-based natural language processing globally.”

 

 

- The section 1. Introduction should briefly place the study in a broad context but clear highlight why it is important and explain the paper significance.

Response:

Thank you for your pertinent comments. The significance of this study is explained on page 3. “Named entity recognition (NER) studies have been primarily conducted with English data. Owing to the predominance of English in publicly available datasets, working with languages other than English can present significant challenges [15]. Research on NER based on the Korean language is inadequate, although Korean is becoming globally used. In recent research on personal information detection using a Korean-named entity recognizer, attempts were made to use additional information, such as sentence intention or speaker information; however, the study faced limitations owing to insufficient data [16,17]. Therefore, defining a privacy named entity (PNE) tag set and collecting and processing data will positively impact AI-based natural language processing research outside the English-speaking world.”

 

- Section 4. Experimental results gaps one part or subsection which deals with discussion about limitations of proposed methodology and one separate subsection with comparsions of proposed method with already existiing methods.

Response:

Thank you for your pertinent comments. In existing named entity recognition research, it is difficult to find experiments that specifically categorized personal information items. Therefore, in this study, we compared the performance of BERT and ELECTRA models. We also wanted to help other researchers by specifically analyzing false positive cases. Comments on the limitations of the study are added to the conclusion (page 20).

 

See for example:

 

Zhou, Guodong & Jian, Su. (2005). Machine learning-based named entity recognition via effective integration of various evidences. Natural Language Engineering. 11. 189-206. 10.1017/S1351324904003559.

 

Fenny, Syafariani & Yunanto, Rio. (2021). Literature Review: Information Extraction using Named-Entity Recognition with Machine Learning Approach.

 

Seon, Choong-Nyoung & Ko, Youngjoong & Kim, Jeong-Seok & Seo, Jungyun. (2001). Named Entity Recognition using Machine Learning Methods and Pattern-Selection Rules.. 229-236.

 

- Conclusion section is without given clear the purpose and  the paper significance.

Response:
Thank you for your insightful comments. The research process and performance results have been added to the abstract (page 1) and conclusion (page 19).

 

  1. The subject matter is not presented in a comprehensive manner:

 

- The paper lacks with limitations of the proposed solution and its comparsion with other that are of similar context.

Response:

Thank you for your helpful comments. Research to accurately detect personal information is still insufficient. Similar studies are summarized in Section 2; however, they have limitations in that datasets and personal information items are insufficient.

Therefore, this study specifically addressed the overall Korean-based entity name-tag set generation, data collection, preprocessing, AI learning data conversion, learning, and evaluation.

 

 

  1. The references used in the paper are appropriate, but the paper need more references because number of total 33 references is not enough that the paper can be published in one so respected journal as Applied sciences is.

Response:

This study identified limitations in existing research. These include the fact that personal-information detection research focused on English, low accuracy, ambiguous personal information, the research being time-consuming, and a model with excellent performance. The number of referenced studies is 33; more references have been added in response to the reviewer's comment.

Reviewer 3 Report

Comments and Suggestions for Authors

The objective of the research is to design a Korean named entity recognizer to identify personal details in Korean interactive text using a Korean pre-trained language model. The authors have developed a personal information tag set with 33 items, obtained and analyzed dialog data and have formed JSON dataset for using AI models such as BERT and ELECTRA. The proposed model exhibited better results than the conventional recognizers in identifying Personal Information (PI) thus providing a significant improvement in the protection of users’ privacy in systems that communicate through interactive text.

 The work focuses on the problem of personal information leakage in interactive text data that includes social media and chatbots. Creating NER models for languages that are not English such as Korean is useful. The approach of developing a set of personal information tags and the process of data gathering and training transformer models is well thought out.

 Research Method

 The study has a few areas of weakness that could be addressed to strengthen its contributions:The study has a few areas of weakness that could be addressed to strengthen its contributions:

 1. Despite the fact that the hypothesis that transformer models can enhance the identification of personal information against regular expressions is falsifiable, the authors do not declare their hypothesis or the way they will test it. Stating the hypothesis and the evaluation metrics before conducting the research would have enhanced the organization of the study.

 2. The general procedure of data collection and annotation has been explained, but there are more details that are required. How many annotators were engaged in the task? What do you think was the level of knowledge of the people involved? How were disagreements resolved? Reporting agreement statistics is crucial, as they help estimate the quality of the data. Also, the hyperparameters that were used in training the models should be stated to enhance reproducibility.

 3. The work under comparison is the proposed transformer models with “conventional recognizers” but no elaboration on these is provided. Thus, to show the efficacy of the proposed approach, comparisons should be made with strong baselines models identified in previous work on Korean personal information detection. Even a simple rule-based system as the control could also be useful to emphasize the benefits of the machine learning for this problem.

 4. However, there is no detailed error analysis of the results obtained, despite their effectiveness. Looking at the failure modes of the models could provide a better understanding of what could possibly be done differently. For instance, is there a tendency of the system producing many false positives and/or false negatives? Do some tag types prove to be more difficult than the others? These questions will help the reader determine the gaps in the proposed approach to the problem being solved.

 5. As for the proposed approach, it is Korean-specific; still, to what extent can the methodology be applied to other languages? Expanding on the language-specific difficulties and possibilities may enhance the study’s effectiveness.

 

Reproducibility

 The manuscript provides a decent level of detail in the methods section, but there are a few gaps that could hinder full reproducibility of the results:The manuscript provides a decent level of detail in the methods section, but there are a few gaps that could hinder full reproducibility of the results:

 1. Although the authors describe data collection from KakaoTalk, Twitter, and Facebook, they did not elaborate on how the dialog data was obtained. The time of data collection, filtering that was done on the data, and the total number of dialogs collected are not given. Lack of this information would make it hard for any other researcher to follow and create a similar dataset.

 2. Despite the fact that the authors supply the annotation guidelines and tag set (Tables 7-8), the authors do not describe the number of annotators used or any measures of agreement between the annotators. Such details are pertinent in evaluating the quality and the uniformity of the annotations.

 3. The hyperparameters of the BERT and ELECTRA models that were employed for training are not stated in detail. The batch size used was 4 and 24, the maximum sequence length was 128 and 256 and the number of training epochs was 30 but the learning rate, optimizer and other crucial parameters are not stated. Specifying all hyperparameters is important in order to be able to reproduce a particular experiment.

 4. The authors present the precision, recall, and F1 scores of the models but the authors did not define whether these are micro or macro-averaged scores. It is crucial to define the key evaluation metrics in order to understand the findings of the study.

 5. The manuscript does not indicate whether the code for the models and the experiments will be provided to the public. It greatly would help to share the code for the replication of the results.

 On the positive side, the authors have included the BIO tagging scheme in Figure 1 and the rather extensive description of the tag set in Table 10 which would at least make replication possible. The employment of out-of-the-box KPF-BERT and KR-ELECTRA also helps in model reproducibility.

 To improve the reproducibility of the results, I recommend the following additions to the methods section:To improve the reproducibility of the results, I recommend the following additions to the methods section:

 1. The specifics of the data collection process, the time frame, the criteria that were applied to the selection of the dialogs, and the total number of dialogs collected.

 2. The specifics of the annotation process such as the number of annotators and the inter-annotator reliability measures.

 3. All the hyperparameters that were used in the training of the model is given below.

 4. Elucidation of the evaluation criteria (micro and macro averaged).

 5. A declaration about the availability of codes.

 

 References

 Most of the cited references in the article are related to the named entity recognition, personal information extraction, and the Korean language as well. However, the earned citations are not always up-to-date which may be a potential drawback. Some of the reviewed works are quite recent, from 2017-2018, for instance, those that presented the BERT and ELECTRA models (29-31). Some of the references used in the paper are a bit dated, specifically from 2013 and 2016 (references 5-6, 14) in relation to the Korean language education and personal information leakage. Although these give a good background information, literature which is more recent on these topics could have been included. This would allow to indicate how the current research is related to the state of the art and what new elements have been introduced.

  

 Specific Comments:

Table 6: I do think that the tag set is quite inclusive. It will be useful to elaborate on the mentioned rationale for the 33 selected items to some extent.

Section 3. 2: This is where the size of the collected dataset has to be stated. In what way was the quality of data maintained?

Section 4. 1: What values of hyperparameters were used in the best model? Precision, recall and F1 should be reported for each tag.

Figure 2 is rather helpful in this aspect. Examples of failure cases that can be used for the error analysis are almost as effective in proving the argument.

 

The ethics statement is acceptable for the study since the data was collected from the public through an app. It should also be noted that data availability statement should be included even when the data cannot be shared due to factors such as privacy.

Comments on the Quality of English Language

 Minor editing of English language required

Author Response

Response to reviewer 3

The objective of the research is to design a Korean named entity recognizer to identify personal details in Korean interactive text using a Korean pre-trained language model. The authors have developed a personal information tag set with 33 items, obtained and analyzed dialog data and have formed JSON dataset for using AI models such as BERT and ELECTRA. The proposed model exhibited better results than the conventional recognizers in identifying Personal Information (PI) thus providing a significant improvement in the protection of users’ privacy in systems that communicate through interactive text.

 

 The work focuses on the problem of personal information leakage in interactive text data that includes social media and chatbots. Creating NER models for languages that are not English such as Korean is useful. The approach of developing a set of personal information tags and the process of data gathering and training transformer models is well thought out.

 

Research Method

 

 The study has a few areas of weakness that could be addressed to strengthen its contributions:The study has a few areas of weakness that could be addressed to strengthen its contributions:

 

  1. Despite the fact that the hypothesis that transformer models can enhance the identification of personal information against regular expressions is falsifiable, the authors do not declare their hypothesis or the way they will test it. Stating the hypothesis and the evaluation metrics before conducting the research would have enhanced the organization of the study.

Response:

Thank you for your pertinent comments. Named entity recognition research has generally been conducted based on English, with a tendency to focus on model performance while overlooking aspects such as data collection and preprocessing. In this paper, we specify procedures such as data collection, preprocessing, and conversion of training data to provide directions on how to apply entity recognition research to non-English languages.

 

  1. The general procedure of data collection and annotation has been explained, but there are more details that are required. How many annotators were engaged in the task? What do you think was the level of knowledge of the people involved? How were disagreements resolved? Reporting agreement statistics is crucial, as they help estimate the quality of the data. Also, the hyperparameters that were used in training the models should be stated to enhance reproducibility.

Response:

We agree that the description of the research procedures need to be more detailed. Accordingly, we have added the following information on page 12.

“The annotation guidelines for PNE were defined based on the existing entity name annotation guidelines in discussion with language-related university professors and researchers who discussed the definition of PNE tag set. All 15 annotators have majored in linguistics; they completed all prior training on the guidelines and then worked on the annotations. Where the annotation definition was not clear, detailed inspection rules were established based on consensus, and the annotation guidelines were revised. Data annotations were inspected according to relevant criteria; five inspectors inspected data annotations, and uniformity of annotations was secured based on the feedback from the inspector. Quality was improved by deploying five management and support personnel, as well as annotators and inspectors.”

 

  1. The work under comparison is the proposed transformer models with “conventional recognizers” but no elaboration on these is provided. Thus, to show the efficacy of the proposed approach, comparisons should be made with strong baselines models identified in previous work on Korean personal information detection. Even a simple rule-based system as the control could also be useful to emphasize the benefits of the machine learning for this problem.

Response:

The existing named entity recognizers do not include items such as CV_SEX, TM_BLOOD_TYPE, OG_DEPARTMENT, QT_IP, and CV_MILITARY_CAMP; this is a newly created tag set. QT_REASTER_NUMBER, QT_ALIEN_NUMBER, QT_PASSPORT_NUMBER, QT_DRIVER_NUMBER, QT_CARD_NUMBER, QT_ACCER_NUMBER, and so on, are treated as QT_OTHES.

Please refer to the tag set in the table on page 10. Table 5 on page 8 compares regular expression detection and NER detection.

 

  1. However, there is no detailed error analysis of the results obtained, despite their effectiveness. Looking at the failure modes of the models could provide a better understanding of what could possibly be done differently. For instance, is there a tendency of the system producing many false positives and/or false negatives? Do some tag types prove to be more difficult than the others? These questions will help the reader determine the gaps in the proposed approach to the problem being solved.

Response:

A detailed error analysis of the results has been provided in Section 4.2.2 and Table 14.

 

  1. As for the proposed approach, it is Korean-specific; still, to what extent can the methodology be applied to other languages? Expanding on the language-specific difficulties and possibilities may enhance the study’s effectiveness.

Reply:

Existing entity name recognition research tends to focus on performance based on English. This study focused on its use in non-English languages. Accordingly, it also explains how the data were collected, preprocessed, and tagged. We focused on the process rather than the research results. Methods for applying it to other languages ​​are not directly presented; however, the process of processing and applying data is not much different in other languages. We anticipate that this study can also serve as a reference in personal information detection research in non-English languages.

 

 

 

Reproducibility

 

 The manuscript provides a decent level of detail in the methods section, but there are a few gaps that could hinder full reproducibility of the results:

 

  1. Although the authors describe data collection from KakaoTalk, Twitter, and Facebook, they did not elaborate on how the dialog data was obtained. The time of data collection, filtering that was done on the data, and the total number of dialogs collected are not given. Lack of this information would make it hard for any other researcher to follow and create a similar dataset.

Response:

Data collection from KakaoTalk, Twitter, and Facebook was for preliminary investigation. Therefore, we have not provided the details as we considered them inappropriate for use. We compiled the data suitable for the experiment; the details of the data construction have been provided in Section 3.2. We have defined aspects such as the dialog topic, dialog turn, speakers, and tagging criteria. The total collected quantity is presented in Table 10.

 

  1. Despite the fact that the authors supply the annotation guidelines and tag set (Tables 7-8), the authors do not describe the number of annotators used or any measures of agreement between the annotators. Such details are pertinent in evaluating the quality and the uniformity of the annotations.

Response:

We have added the relevant information in Section 3.2.

 

  1. The hyperparameters of the BERT and ELECTRA models that were employed for training are not stated in detail. The batch size used was 4 and 24, the maximum sequence length was 128 and 256 and the number of training epochs was 30 but the learning rate, optimizer, and other crucial parameters are not stated. Specifying all hyperparameters is important to be able to reproduce a particular experiment.

Response:

As regards the BERT and ELECTRA models that were employed for training are KPF-BERT and KR-ELECTRA, the learning rate was 5e-5, and the optimizer was AdamW. This information has been added to Section 3.4.2.

 

  1. The authors present the precision, recall, and F1 scores of the models but the authors did not define whether these are micro or macro-averaged scores. It is crucial to define the key evaluation metrics in order to understand the findings of the study.

Response:

Section 4 presents the relevant information. With regard to NER, the F1-score is commonly used as the evaluation metric. Considering the significant differences in the quantities of 33 personal information items, a micro-average was deemed advantageous in calculating a balanced average.

 

  1. The manuscript does not indicate whether the code for the models and the experiments will be provided to the public. It greatly would help to share the code for the replication of the results.

Response:

This study is being considered for productization; therefore, it would be difficult to release the code. However, the authors have included the BIO tagging scheme in Figure 1 and an extensive description of the tag set in Table 10, which would make replication possible. The use of out-of-the-box KPF-BERT and KR-ELECTRA also helps in model reproducibility.

 

 To improve the reproducibility of the results, I recommend the following additions to the methods section:

 

  1. The specifics of the data collection process, the time frame, the criteria that were applied to the selection of the dialogs, and the total number of dialogs collected.

Response:

These details have been provided in Section 3.2

 

  1. The specifics of the annotation process such as the number of annotators and the inter-annotator reliability measures.

Response:

We have added these details in Section 3.2.

 

  1. All the hyperparameters that were used in the training of the model is given below.

Response:

These details have been added in Section 3.4.2.

 

  1. Elucidation of the evaluation criteria (micro and macro averaged).

Response:

This aspect is mentioned in Section 4. In NER, the F1-score is commonly used as the evaluation metric. Considering the significant differences in the quantities of 33 personal information items, a micro-average was deemed advantageous in calculating a balanced average.

 

  1. A declaration about the availability of codes.

Response:

This study is being considered for productization; therefore, releasing the code would be difficult.

 

 

 

 References

 

 Most of the cited references in the article are related to the named entity recognition, personal information extraction, and the Korean language as well. However, the earned citations are not always up-to-date which may be a potential drawback. Some of the reviewed works are quite recent, from 2017-2018, for instance, those that presented the BERT and ELECTRA models (29-31). Some of the references used in the paper are a bit dated, specifically from 2013 and 2016 (references 5-6, 14) in relation to the Korean language education and personal information leakage. Although these give a good background information, literature which is more recent on these topics could have been included. This would allow to indicate how the current research is related to the state of the art and what new elements have been introduced.

Response

Research on detecting personal information in non-English languages, including Korean, is scarce. Studies that explain methods for detecting personal information in non-English languages would have a positive impact both academically and socially. The references cited in this study may not appear recent; however, there has been a paucity of studies on personal-information detection in Korean and related content. Nonetheless, we have considered the reviewer's opinion and strived to add the latest references.

 

Specific Comments:

 

Table 6: I do think that the tag set is quite inclusive. It will be useful to elaborate on the mentioned rationale for the 33 selected items to some extent.

Response:

A relevant description has been included in Section 3.1. The 39 items, comprising the types of personal information defined by personal information portals and committees, are listed in Table 5. Following discussions involving ten experts in personal information detection and de-identification solutions, university professors, researchers, and AI and system integration development specialists, we selected 33 items from the initial 39 items targeted for personal information collection, as listed in Table 6. We selected the items based on the criteria for their collection and use in the interactive text.

 

Section 3. 2: This is where the size of the collected dataset has to be stated. In what way was the quality of data maintained?

Response:

Data collection from KakaoTalk, Twitter, and Facebook was for preliminary investigation; therefore, so we did not provide the details. We considered the data inappropriate for use. We compiled the data suitable for the experiment; the details of the data construction have been included in Section 3.2. We have defined aspects such as the dialog topic, dialog turn, speakers, and tagging criteria. The collected quantity have been provided in Table 10.

 

Section 4. 1: What values of hyperparameters were used in the best model? Precision, recall and F1 should be reported for each tag.

Response:

The hyperparameters used have been provided in Section 3.4.2. Precision, recall, and F1-score for each tag have been presented in Table 12.

 

Figure 2 is rather helpful in this aspect. Examples of failure cases that can be used for the error analysis are almost as effective in proving the argument.

Response:

Examples of failure cases have been presented in Table 14.

 

 

The ethics statement is acceptable for the study since the data was collected from the public through an app. It should also be noted that data availability statement should be included even when the data cannot be shared due to factors such as privacy.

Response:

The data were not collected through the app, and all personal information included in the constructed data was randomly generated. The annotators and inspectors were thoroughly managed by management and support personnel while compiling the data.

Back to TopTop