Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

ConBERT: A Concatenation of Bidirectional Transformers for Standardization of Operative Reports from Electronic Medical Records

Appl. Sci. 2022, 12(21), 11250; https://doi.org/10.3390/app122111250

by Sangjee Park^1,†, Jun-Woo Bong^2,†

, Inseo Park¹, Hwamin Lee³

, Jiyoun Choi⁴, Pyoungjae Park², Yoon Kim^5,6,7, Hyun-Soo Choi^5,6,7,*

and Sanghee Kang^2,*

Reviewer 1: Anonymous

Reviewer 2:

Hongbo Du

Appl. Sci. 2022, 12(21), 11250; https://doi.org/10.3390/app122111250

Submission received: 12 September 2022 / Revised: 31 October 2022 / Accepted: 2 November 2022 / Published: 6 November 2022

Round 1

Reviewer 1 Report

The authors present a novel approach to extract ICD-9 codes from operative reports free text using a concatenation of two BERT models.

However, the authors does not provide enough information about the implementation and training of the model: do they perform a fine-tuning of the BERT embeddings? which is the architecture of the classification layer? do they use a validation dataset?

The dataset is highly unbalanced (some ICD-9 are much more represented) but the authors does not provide any balancing technique (i.e. SMOTE, weight balancing) to counteract this.

Some references are incorrect: 18 and 19 does not refer to the results presented in table 4.

Author Response

Dear Reviewer 1.

We would like to thank you and the reviewers of the Applied sciences for taking the time and effort to review our manuscript. Many of the valuable and constructive points that the reviewers raised, were appreciated by the authors. After considering the reviewers’ comments, we revised the manuscript and have indicated the corrections and changes made with yellow highlighting in the manuscript.

Q1: However, the authors do not provide enough information about the implementation and training of the model: do they perform a fine-tuning of the BERT embeddings? which is the architecture of the classification layer? do they use a validation dataset?

A1: We greatly appreciate the Reviewer’s comments. We agree completely and have revised as followed in the Model Aggregation and Training Details:

In section 2.3.4, line 171-174,
“We add a classification layer and a sigmoid layer to the end of the pre-trained BERT model to fine-tune the ICD-9 code to classify. The classification layer used a fully connected layer the input size is the sum of output dimensions of each BERTs, and the output dimension is the number of multi-label classes, 353.”

In section 2.3.5, line 185-190,
“The training and validation were conducted using 10-fold cross-validation, and the training data and test data set at a ratio of 8:2 were used. In addition, I trained 50 epochs for each fold and changed the trainset and validation dataset. After that, the average of evaluation metrics was calculated for each model, and for the final model all datasets except the test dataset were used as training data based on the model with the highest performance.”

Q2: The dataset is highly unbalanced (some ICD-9 are much more represented) but the authors do not provide any balancing technique (i.e. SMOTE, weight balancing) to counteract this.

A2: Thank you for the Reviewer’s helpful suggestion. We confirm that data with completely identical operative reports and icd9-code on the dataset redundantly account for 46% of the total data, which can be biased to such redundant data, so by considering them as a single sample we under-sampled the redundant data. And we conducted learning by giving weights for each class through data-level approaches, but it was confirmed that there is a trade-off between macro performance and micro performance. As the reviewer suggested SMOTE or weight balancing techniques are a reasonable approach to handling unbalanced data, but some approach (SMOTE) is difficult to directly apply to text data and others were ineffective. However, we totally agree with the Reviewer’s suggestion that such techniques could improve our model. So, we add additional descriptions as follow-up study in the Discussion as below:

In section 4. lines 299-303,
"Second, our dataset is highly imbalanced. We conducted learning by giving weights for each class through data-level approaches, but it was confirmed that there is a trade-off in which macro performance increases and micro performance decreases. Further research on data imbalance is needed using weight balancing techniques, or more advanced training loss such as focal loss [32] or self-adjusting dice loss [33].”

Q3 Some references are incorrect: 18 and 19 do not refer to the results presented in table 4.

Q3: We greatly appreciate the Reviewer’s comments. We have modified Table 7 as followed:

[18],[19] has been modified to [23],[24]

Sincerely,

The Authors.

Reviewer 2 Report

This paper presents an automated process for standardizing operative reports by predicting ICD-9 codes in a multi-label classification context. The paper exploits the potentials in concatenating word BERT and Character BERT in the classification framework. The paper is set to resolve a realistic clinical problem which has an impact on quality control for clinical record and document management.

Although the paper's novelty is rather limited, the work has been conducted well. The method suggested is sound, and progressed logically from the existing literature. The data records are selected sensibly and cleansed using sensible solutions. The experiments are well conducted, and the evidence is clearly presented. The advantage of the proposed solution is quite well argued.

There are some minor points that still require the authors to address:

(1) In line 94, please specify how the ICD-9 code and name are provided. Were they labelled by domain experts, integrated from another data source separate from operative reports, or indicated already in the original operative reports? More details must be given as the ICD-9 codes will be used as the ground truth to measure performance metrics later.

(2) In lines 95 - 97, the paper says that data on 45,211 patients and 35,862 patients from the two hospitals are respectively collected. Are you suggesting that there is one operative report for each patient? Is this always the case? Could it be that one patient may have multiple operative reports (particularly when the duration for the data is 11 years)? It may be more accurate to use the word "patient case" or simply "case".

(3) Again on data, in lines 108 - 109, the final number of records, i.e. 45,853, seems to indicate that a large proportion of the original data collection has been removed after cleansing. This is a huge loss of data (43%), which may raise a question regarding the cleansing operations applied. Yet, in lines 264-265, the paper says that approximately 3% of the total data were removed. There seems an inconsistency somewhere. Please check and clarify this point.

(4) In the Method section, please add the feature lengths for the pre-trained BERT and that for pre-trained character BERT in the relevant boxes, as well as in the Concatenation box in Figure 2 to illustrate the total feature length for ease of understanding by readers.

(5) Is the Classification Layer in Figure 2 the fully connected layer? If so, are there any changes made? Please elaborate.

(6) On training details, are the hyperparameter setting determined empirically or taken as default? If it is the former, some discussions regarding the parameter tuning is then needed in the Discussions section.

(7) On the Experiment side, is there any reason why k-fold cross validation has not been used? After all, single split evaluation approach always has its limitations, particularly when the testing cases of minority labels are limited. If it is computing resource issue, then it should be made clear.

(8) I am surprised as why classification speed has not been measured and shown in the experimental results. After all, if the web app is to be used in real-life setting, time spent to deal with each case is always of interest.

(9) Should there be discussions on the threshold setting of probabilities in the final outcomes?

There are also some places where the English just needs a bit of finishing touch.

Author Response

Dear Reviewer 2.

Q1: In line 94, please specify how the ICD-9 code and name are provided. Were they labeled by domain experts, integrated from another data source separate from operative reports, or indicated already in the original operative reports? More details must be given as the ICD-9 codes will be used as the ground truth to measure performance metrics later.

A1: The Medical Record Information Team in the Korea University Guro Hospital, performed all the works of labeling and matching the original text information with standardized data using their own dictionaries. We added this explanation to the method.

Q2: In lines 95 - 97, the paper says that data on 45,211 patients and 35,862 patients from the two hospitals are respectively collected. Are you suggesting that there is one operative report for each patient? Is this always the case? Could it be that one patient may have multiple operative reports (particularly when the duration for the data is 11 years)? It may be more accurate to use the word "patient case" or simply "case".

A2: We agree with the reviewer’s suggestion. A single row in the dataset means one case of one operative report, which means two different operations of one patient are divided into two different rows in the dataset. We revised the term “patients” with “cases”.

Q3: Again on data, in lines 108 - 109, the final number of records, i.e. 45,853, seems to indicate that a large proportion of the original data collection has been removed after cleansing. This is a huge loss of data (43%), which may raise a question regarding the cleansing operations applied. Yet, in lines 264-265, the paper says that approximately 3% of the total data were removed. There seems an inconsistency somewhere. Please check and clarify this point.

A3: Thank you for the Reviewer’s helpful suggestion. In this paper, the word “remove” seems to have confused the reviewer. We confirm that some of the data with completely identical operative reports and icd9-code account for 46% of the total data, which can result in bias on such redundant samples, so we trained the model by considering redundant samples as a single sample. For clarification, we have revised as followed in the Materials and Methods:

In section 2.2, line 110-112,
“About 46% of the data in the dataset consists of data with the exact same operative report and ICD-9 code. Therefore, if the data is used as it is, it may be biased to a specific value, so duplicate data were considered as a single sample.”

Q4: In the Method section, please add the feature lengths for the pre-trained BERT and that for the pre-trained character BERT in the relevant boxes, as well as in the Concatenation box in Figure 2 to illustrate the total feature length for ease of understanding by readers.

A4: We greatly appreciate the Reviewer’s comments. We have modified Figure 2.

Q5: Is the Classification Layer in Figure 2 the fully connected layer? If so, are there any changes made? Please elaborate.

A5: We greatly appreciate the Reviewer’s comments. We agree completely and have revised as followed in the Model Aggregation:

In section 2.3.4, line 171-174,
“We add a classification layer and a sigmoid layer to the end of the pre-trained BERT model to fine-tune the ICD-9 code to classify. The classification layer used a fully connected layer that input size is the sum of output dimensions of each BERTs, and output dimension is the number of multi-label class, 353.”

Q6: On training details, are the hyperparameter setting determined empirically or taken as default? If it is the former, some discussions regarding the parameter tuning is then needed in the Discussions section.

A6: We appreciate the Reviewer’s constructive suggestion. We agree with the Reviewer’s comment and have revised as followed in the Training Details:

In section 2.3.5, line 183-185,
“In addition, we found hyperparameters such as learning rate through 10-fold validation. Found through learning rate finder of Trainer module provided by PyTorch lightning [22].”

Q7: On the Experiment side, is there any reason why k-fold cross-validation has not been used? After all, the single split evaluation approach always has its limitations, particularly when the testing cases of minority labels are limited. If it is a computing resource issue, then it should be made clear.

A7: We appreciate the Reviewer’s helpful suggestion. We agree with the Reviewer’s comment and revised the results by applying k-fold, and modified the Training Details, Comparison of Pre-trained Models, and Comparison of Aggregated Models as follows:

In section 2.3.5, line 185-190,
“The training and validation were conducted using 10-fold cross-validation, and the training data and test data set at a ratio of 8:2 were used. In addition, we trained 50 epochs for each fold and changed the trainset and validation dataset. After that, the average of evaluation metrics was calculated for each model, and for the final model, all datasets except the test dataset were used as training data based on the model with the highest performance.”

In section 3.2, line 217-219,
“The model with the highest performance is the UMLS BERT model, with a very slight difference from the Bio BERT model, and shows an AP score of 0.7872, F1 score of 0.7603, and AUC of 0.9863 based on the micro-average setting.”

In section 3.3, lines 229-230,
“Therefore, we proposed a combination of UMLS BERT and Medical Character BERT, which exhibits the best macro F1 performance.”

Q8: I am surprised as to why classification speed has not been measured and shown in the experimental results. After all, if the web app is to be used in a real-life setting, time spent to deal with each case is always of interest.

Reply 8: We greatly appreciate the Reviewer’s comments. We agree completely and have revised as followed in the Web-based Application:

In section 3.4, line 249-250,
“The classification speed of the model processes 34 samples per second, and web application is capable of both single prediction and multi-prediction.”

Q9: Should there be discussions on the threshold setting of probabilities in the final outcomes?
There are also some places where the English just needs a bit of finishing touch.

Q9: We appreciate the Reviewer’s helpful suggestion. The performance was evaluated using the AP score that calculates the area under the P-R curve, a precision-recall change measurement method according to the confidence threshold, but 0.5 was used for Web-based applications. However, in subsequent studies, efforts should be needed to find the optimal threshold according to the characteristics of the field, and we added the description as further works in the Discussion section as follows:

In section 4. lines 308-312,
“Fifth, a commonly used threshold of 0.5 was used in this paper. Performance was evaluated using the AP score, which is a precision-recall change measurement method according to the confidence threshold. However, the threshold value of 0.5 was used in Web-based applications. The number of appearances of icd-9 codes may vary depending on surgical characteristics. A follow-up study is needed to find the optimal threshold.”

And we have received the English correction through Editage and will revise it again before submitting the final paper. We have attached the certificate of the corresponding Editage.

Sincerely,

The Authors.

Round 2

Reviewer 1 Report

The reviewed version of the paper improves the original submission in all the concerns of the reviewer. The authors present clearly the proposed architecture for a classification of ICD-9 codes from free text (from both Operative reports and Diagnosis) using a novel concatenation of transformer pre-trained models. The proposed model improves the SOTA and can be used as a web-based application.

As minor corrections: on line 195 I suggest to change "I trained" -> "We trained". On line 199 the title "2.3.6 Evaluation" should be separate by a break line. The equations defining precision, recall, F1, TPR and FPR are quite obvious and can be skipped.

Author Response

=> Thank you for your comments and we corrected all the points the reviewer mentioned

1) We trained

2) 2.3.6 Evaluation with a break line

3) deletion of the definitions (precision, recall, F1, TPR and FPR)

Article Menu

ConBERT: A Concatenation of Bidirectional Transformers for Standardization of Operative Reports from Electronic Medical Records

Further Information

Guidelines

MDPI Initiatives

Follow MDPI