Next Article in Journal
Classification of Breast Lesions on DCE-MRI Data Using a Fine-Tuned MobileNet
Next Article in Special Issue
Developing a Deep-Learning-Based Coronary Artery Disease Detection Technique Using Computer Tomography Images
Previous Article in Journal / Special Issue
Cross Dataset Analysis of Domain Shift in CXR Lung Region Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Performance and Agreement When Annotating Chest X-ray Text Reports—A Preliminary Step in the Development of a Deep Learning-Based Prioritization and Detection System

by
Dana Li
1,2,*,
Lea Marie Pehrson
1,3,
Rasmus Bonnevie
4,
Marco Fraccaro
4,
Jakob Thrane
4,
Lea Tøttrup
4,
Carsten Ammitzbøl Lauridsen
1,5,
Sedrah Butt Balaganeshan
6,
Jelena Jankovic
1,
Tobias Thostrup Andersen
1,
Alyas Mayar
7,
Kristoffer Lindskov Hansen
1,2,
Jonathan Frederik Carlsen
1,2,
Sune Darkner
3 and
Michael Bachmann Nielsen
1,2
1
Department of Diagnostic Radiology, Copenhagen University Hospital, Rigshospitalet, 2100 Copenhagen, Denmark
2
Department of Clinical Medicine, University of Copenhagen, 2100 Copenhagen, Denmark
3
Department of Computer Science, University of Copenhagen, 2100 Copenhagen, Denmark
4
Unumed Aps, 1055 Copenhagen, Denmark
5
Radiography Education, University College Copenhagen, 2200 Copenhagen, Denmark
6
Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2100 Copenhagen, Denmark
7
Department of Health Sciences, Panum Institute, University of Copenhagen, 2100 Copenhagen, Denmark
*
Author to whom correspondence should be addressed.
Diagnostics 2023, 13(6), 1070; https://doi.org/10.3390/diagnostics13061070
Submission received: 22 February 2023 / Revised: 6 March 2023 / Accepted: 8 March 2023 / Published: 11 March 2023

Abstract

:
A chest X-ray report is a communicative tool and can be used as data for developing artificial intelligence-based decision support systems. For both, consistent understanding and labeling is important. Our aim was to investigate how readers would comprehend and annotate 200 chest X-ray reports. Reports written between 1 January 2015 and 11 March 2022 were selected based on search words. Annotators included three board-certified radiologists, two trained radiologists (physicians), two radiographers (radiological technicians), a non-radiological physician, and a medical student. Consensus labels by two or more of the experienced radiologists were considered “gold standard”. Matthew’s correlation coefficient (MCC) was calculated to assess annotation performance, and descriptive statistics were used to assess agreement between individual annotators and labels. The intermediate radiologist had the best correlation to “gold standard” (MCC 0.77). This was followed by the novice radiologist and medical student (MCC 0.71 for both), the novice radiographer (MCC 0.65), non-radiological physician (MCC 0.64), and experienced radiographer (MCC 0.57). Our findings showed that for developing an artificial intelligence-based support system, if trained radiologists are not available, annotations from non-radiological annotators with basic and general knowledge may be more aligned with radiologists compared to annotations from sub-specialized medical staff, if their sub-specialization is outside of diagnostic radiology.

1. Introduction

Chest X-rays (CXRs) are the most commonly performed diagnostic image modality [1]. Recent technological advancements have made it possible to create systems that support and increase radiologists’ efficiency and accuracy when analyzing CXR images [2]. Thus, interest in developing artificial intelligence-based systems for detection and prioritization of CXR findings has increased, including how to efficiently gather training data [3].
For training, validating, and testing a deep learning algorithm, labeled data are required [4]. Previous ontological schemes have been developed to have consistent labeling. Labeling schemes can vary, from hierarchical labeling systems with 180+ unique labels [5] to few selected labels [6,7]. Label creation for deep learning development may be unique to each project, since they are dependent on factors such as imaging modality, body part, algorithm type, etc. [4]. In a previous study we developed a labeling scheme for annotation of findings in CXRs to obtain consistent labeling [8]. Our labeling scheme was tested for inter- and intra-observer agreement when used to annotate CXR images [8], and iterations have been ongoing to potentially increase consistent use of labels for annotation of CXR image and text reports.
Optimally, CXR training data should consist of manually labeled findings on the radiographic images, marked with e.g., bounding boxes for location, and radiologists are often needed to perform such a task to ensure the most accurate labeling [9]. Gathering data for training an algorithm may therefore be time-consuming and expensive. Several systems for automatic extraction of labels from CXR text reports have therefore been developed, including natural language processing models based on either feature engineering [6,10] or deep learning technology [11]. Labels that are extracted this way can then be linked to the corresponding CXR image to provide large, labeled image datasets using minimal time and cost [5].
To fully automate the labeling process, researchers have attempted to develop unsupervised machine learning engineering to extract labels [12]. However, these methods still seem inferior compared to solutions with components of supervision [13,14]. Therefore, just as with images, text labeling algorithms still need manually labeled data for training.
Labeling of text for training a deep learning algorithm needs to be consistent [15]. However, unlike images, labeling and annotation of text may not require specialized radiologists, since radiological reports are used for communication with other specialty fields in health care and therefore should be understood by a much more diverse group of people than just radiologists [16]. Only a few studies have been done on reading comprehension and understanding findings in radiological text reports, when readers are health care workers with differentiated levels of radiological experience [17]. Understanding how variability in radiological knowledge impacts reading comprehension of a radiological text report, could not only be beneficial in the development of a deep learning algorithm but could also give insight to pitfalls of a radiological text report as a communicative tool between medical staff [18].
In this study we aimed to investigate how differentiated levels of radiological task experience impact reading comprehension and labeling performance on CXR text reports. We also field-tested the text report labeling scheme by measuring label-specific agreement between predicted and actual labels as to decrease any potential bias to reading comprehension created by the labeling process itself.

2. Materials and Methods

Ethical approval was obtained on 11 May 2022 by the Regional Council for Region Hovedstaden (R-22017450). Approval for data retrieval and storage was obtained on 19 May 2022 by the Knowledge Center on Data Protection Compliance (P-2022-231).

2.1. Diagnostic Labeling Scheme for Text Annotations

The initial structure and development of the labeling scheme have previously been highlighted [8]. In summary, the labels were generated to match existing CXR ontologies such as Fleischner criteria and definitions [19] and other machine learning labeling schemes [5,6,7]. Labels were ordered hierarchically, where a high-level class such as “decreased translucency” was divided to lower-level classes that increased in specificity. The labeling scheme was previously tested for inter- and intra-observer agreement in CXR image annotation [8]. Iterations were since made to increase the agreement; (1) labels were made to be as descriptive as possible and (2) interpretive labels were added under the category “Differential diagnosis”, because of increased detailed information that was present in chest X-ray text reports compared to chest X-ray images (Figure 1).

2.2. Dataset

A selection of a total of 200 de-anonymized CXR reports from 1 January 2015 to 11 March 2022 were collected at the Department of Diagnostic Radiology at Rigshospitalet through the PACS system (AGFA Impax Client 6, Mortsel, Belgium). The CXR reports were retrieved through two methods:
Firstly, through a computerized search algorithm, CXR reports were selected using search words found in the text. A minimum of six CXR reports were required to be present for each of the following search words; pneumothorax, cysts/bullae, emphysema, infiltrate, consolidation, diffuse infiltrate, pleural effusion, atelectasis, lung surgery, chronic lung changes, pneumonia infection, tuberculosis, abscess, and stasis/edema. This method resulted in 84 reports.
Secondly, for the remaining 116 reports, a computerized search algorithm was used to find and distribute an equal number of cases, between the following criteria (29 cases each):
(1)
Truly randomly selected.
(2)
Randomly selected cases containing any abnormal findings.
(3)
Randomly selected cases, within the top 10% of all cases that had the greatest number of associated labels per case relative to the length of the report.
(4)
Randomly selected cases, within the bottom 10% of cases that had the least number of labels associated per case relative to the length of the report.

2.3. Participants and Annotation Process

A total of three board-certified radiologists were included as annotators to determine labels for the cases in the text annotation set to form the “gold standard” labels (actual labels). All three radiologists had specialized training ranging from 14 to 30+ years each. Six annotators with varying degrees of radiological experience were included to annotate the 200 text reports with labels from the labeling scheme (Figure 1). Annotators included a(n): intermediate radiologist (physician with radiological experience, 6 years), novice radiologist (physician with radiological experience, 2 years), experienced radiographer (radiological technician, with radiographer experience of 15 years), novice radiographer (radiological technician with radiographer experience of 3 years), non-radiological physician (7 years of other specialized, clinical experience, post-graduation), and a senior medical student (planning to graduate from university within 6 months).
The annotation process began on 25 August 2022, and ended on 25 October 2022. All 200 text reports were imported to a proprietary annotation software developed by Unumed Aps (Copenhagen, Denmark). Annotators were instructed to find and label each piece of text describing both positive and negative findings (Figure 2). Annotators were blinded to the X-ray images and other annotators’ annotations.

2.4. Presentation of Data and Statistical Analysis

“Gold standard” labels were defined as consensus on a label in a text report between two or more of the three board-certified radiologists. “Majority” vote labels were defined by consensus on a label between four or more of the six annotators and “majority excl. intermediate radiologist” were defined as consensus vote on a label between three or more of the remaining annotators after removing the intermediate radiologist as an annotator. Frequency counts reflected the total cumulative counts of a label’s use in all text reports in the annotation set. Time spent on annotation was done by calculating the average time spent on a text report from opening the report to annotation completion.
Matthew’s correlation coefficient (MCC) [20] was used to compare annotator performance to “gold standard” labeling and to compare annotators’ performance to each other. The MCC was based on values selected for a 2 × 2 confusion matrix (Table 1) where true positive (TP) described the number of labels that matched “gold standard” labels for all positive and negative findings separately. True negative (TN) described the number of labels that were not used by annotators which also matched labels that were not used by both “gold standard” for all positive and negative findings separately. False positives (FP) described the number of labels that annotators used, but “gold standard” did not use, and false negative (FN) described all labels that “gold standard” used but annotators did not use.
MCC was then defined by following equation [20]:
M C C = ( T P T N ) ( F P F N ) ( T P + F P ) ( T P + F N ) ( T N + F P ) ( T N + F N )
To achieve this, MCC was calculated using Python 3.8.10 (https://www.python.org/) with the Pandas [21] and Numpy [22] libraries for each label and then micro-averaged [23] to give an overall coefficient for all positive and negative labels. MCC ranges between −1 and 1, where 1 represents perfect positive correlation, 0 represents correlation not better than random, and −1 represents total disagreement between labels of the “gold standard” set (actual) and the set of labels chosen by the annotator (predicted) [20].
One weakness of MCC and other standard agreement statistics is that they fail to take partial agreement into account in structured and taxonomic annotation tasks like ours. In addition, they do not clearly identify tendencies towards over- or under-annotation by any single annotator. To this end, we performed a separate analysis for any pair of annotators. An annotator here means either an individual human annotator or a constructed annotator such as “gold standard” or any of the “majority”-categories. For each annotator pair, we ran a maximum weight matching algorithm on a graph constructed from their individual annotations, trying to pair the labels from the two annotators as best as possible. We used the implementation available in the Python library networkx (version 2.8.8) [24].
We employed a weighting that enforced the following criteria in descending order:
(1)
Match with the exact same label, or
(2)
Match with an ancestral or descendent node (e.g., for “vascular changes” it could be either “aneurism” or “widening of mediastinum” etc. (Figure 1))
The hierarchical order in which the labels are placed, categorizes labels into similar groups and findings of similar characterization become more distinguishable from each other with each branch division. This is done to reduce the number of unusable labels caused by inter-reader variability [25] as disagreement on a label in a branched division could have common ascending nodes. Annotators do not manually mark a piece of text to a label, so to maximize data, we post-processed by discarding matched pairs of labels that did not belong to the same branch, since we operated on the assumption that the same piece of text/finding should not lead to annotation with labels that did not belong within the same category. The statistical algorithm would pair up any remaining annotations at random after all matches with positive weight had been made. If the annotators made an unequal number of annotations, such that it was impossible to pair all annotations, or if matched labels did not belong within the same branch or were not in a direct line of descending/ascending order we denoted the remaining annotations as unmatched.
Descriptive statistics were thus calculated to investigate specific agreements by comparing counts of “matched” and “unmatched” labels between annotators and “gold standard”. In addition to presenting matched and unmatched labels as representation for individual annotator agreements, the number of matched and unmatched counts was also presented for each label.

3. Results

A total of 63 positive labels and 62 negative labels were possible to use for annotation (Figure 1). A pareto chart showed that 25 labels covered 80% of all labeled positive findings, and four labels covered 80% of all negative findings. The top 5 most used labels for positive findings were: “infiltrate”, “pleural effusion”, “cardiomegaly”, “atelectasis”, and “stasis/edema”. The top 5 most used labels for negative findings were: “pleural effusion”, “infiltrate”, “stasis/edema”, “cardiomegaly”, and “pneumothorax” (Figure 3a,b).
For labels that represented positive findings, the novice radiographer had more annotations for “bone” (16 cases vs. 0–8 cases) and “decreased translucency” (29 cases vs. 0–10 cases) compared to other annotators. The novice radiologist had more annotations for “other non-pathological” compared to other annotators (18 cases vs. 0–2 cases), and the senior medical student had more annotations on “diffuse infiltrate” compared to other annotators (22 cases vs. 0–5 cases) (Table A1 in Appendix A).
For negative findings, the experienced radiographer had more annotations on “consolidation” (23 cases vs. 0–4 cases) and “pleural changes” (20 cases vs. 0–6 cases) compared to the other annotators. The non-radiological physician had more annotations on “cardiomediastinum” than other annotators (21 cases vs. 0–7 cases) (Table A2 in Appendix A).
The average time spent on annotating a text report was: 98.1 s for the intermediate radiologist, 76.2 s for the novice radiologist, 232.1 s for the experienced radiographer, 135 s for the novice radiographer, 99.4 s for the non-radiological physician, 145.8 s for the senior medical student, and each “gold standard” annotator took on average 135.2 s per text report.

3.1. Annotator Performance and Agreement

Table 2a,b showed the MCC values for each annotator for positive and negative findings, respectively. The intermediate radiologist had the best MCC compared to other annotators, both for labels representing positive findings and negative findings (MCC 0.77 and MCC 0.92). The senior medical student had comparable MCC values to the novice radiologist for both negative and positive findings (Table 2a,b).
For both positive and negative findings, the senior medical student achieved better MCC than the non-radiological physician (0.71 vs. 0.64 for positive findings and 0.88 vs. 0.77 for negative findings). This tendency was also present for the radiographers. The novice radiographer achieved better MCC for both positive and negative findings compared to the experienced radiographer (0.65 vs. 0.57 for positive findings and 0.88 vs. 0.64 for negative findings).
All annotators achieved higher MCC for negative findings compared to their own MCC for positive findings (Table 2a,b).
The number of labels that were a match (Table 3) and unmatched (Table A3) between different pairs of annotators was used as representation for degree of agreement between different annotators.
Table 3 showed the number of matched labels between each annotator for both positive and negative findings. The intermediate radiologist, novice radiologists and senior medical student had the most label matches with each other. The novice radiographer had more matches with the “gold standard” (710 labels matched) compared with the experienced radiographer’s matches with “gold standard” (589 labels matched). The senior medical student had more matches with “gold standard” (741 labels matched) compared with the non-radiological physician’s matches with “gold standard” (665 labels matched).
Table A3 in the Appendix A showed the number of unmatched labels that were left after subtracting the number of matched labels to each annotator’s total label use. The intermediate radiologist had the least number of unmatched labels left compared with the “gold standard” (201), however, the other annotators closely followed (203–234). The “majority” vote achieved the lowest number of unmatched labels against “gold standard” annotations compared with any individual annotator (122). “Gold standard” generally used fewer labels per text report compared with any annotator. (e.g., 32 unmatched labels leftover for “gold standard” when matched to the intermediate radiologist vs. 201 unmatched labels leftover for the intermediate radiologist when matched to “gold standard”).
The “majority excl. the intermediate radiologist” voting (723) had more labels that matched with “gold standard” compared with the “majority” voting which included the intermediate radiologist (702) (Table 3). Even though the number of unmatched labels increased (162) when excluding the intermediate radiologist majority vote compared with majority voting including the intermediate radiologist (122), there were still fewer unmatched labels than any individual annotator (Table A3).

3.2. Label Specific Agreement

Table 4 and Table 5 showed the cumulative cases of matches on a specific label for labels in the “lung tissue findings” category and “cardiomediastinum” category, respectively. “Atelectasis”, “infiltrate”, and “pleural effusion” were lung tissue related labels with the most matches (219, 687, and 743, respectively) (Table 4), while “cardiomegaly” (472) was the label with the most matches in the “cardiomediastinum” category (Table 5), and “medical device, correct placement” (115), and “stasis/edema” (576) were the labels with the most matches in the rest of the labeling scheme (Table A4).
For the label “infiltrate”, the annotators had a greater spread across different labels compared to “gold standard”. When “gold standard” used the label “infiltrate”, annotators matched with six labels other than “infiltrate”. Four of these labels were more specific i.e., descendants of “infiltrate” and two were less specific i.e., ancestors of “infiltrate” (Figure 1 and Table 4). For comparison, “gold standard” matched only with two descendent labels and one ancestral label (Table 4).
The opposite tendency was seen in the labels “decreased translucency”, “pleural changes”, and “atelectasis”—“gold standard” had greater spread and used more specific labels compared to annotators (Table 4).
Table 5. Number of matched cases (accumulated) on specific labels in the labeling scheme related to “cardiomediastinal findings”. * Rows and columns not belonging to the parent node “cardiomediastinal findings” and that did not have any label disagreements have been pruned and thus number of rows does not match number of columns.
Table 5. Number of matched cases (accumulated) on specific labels in the labeling scheme related to “cardiomediastinal findings”. * Rows and columns not belonging to the parent node “cardiomediastinal findings” and that did not have any label disagreements have been pruned and thus number of rows does not match number of columns.
Gold Standard *
Annotators * CardiomediastinumCardiomegalyWidening of MediastinumLymph Node PathologyOther CardiomediastinumVascular Changes
Cardiomediastinum116811
Cardiomegaly3 472
Widening of Mediastinum1 361
Lymph node pathology 9
Mediastinal tumor 1
Other cardiomediastinum 4
Vascular changes 2 32
Diagnostics 13 01070 i003
0 labels matched                      100+ labels matched
When annotators used “cardiomediastinum” it was most often matched with more specific, descendent nodes such as “cardiomegaly”, “widening of mediastinum”, and “lymph node pathology” by “gold standard” (Table 5). Annotators were also less specific when “gold standard” used “lymph node pathology” since annotators only matched with using ancestral nodes besides the label itself (Table 5).
For the rest of the labeling scheme “gold standard” also used more specific labels compared to annotators (Table A4).
For unmatched labels, annotators had more different types of unmatched labels compared to “gold standard” (60 different types of labels vs. 41). Annotators had labeled 760 findings that were unmatched with “gold standard” labels, while “gold standard” only had 131 findings that did not find a match within the annotators’ labels.

4. Discussion

There were three main findings in our study: (1) for radiologists, annotation performance of CXR text reports increased when radiological experience increased, (2) annotators had better performance on annotating negative findings compared to positive findings, and (3) annotators with less radiological experience tended to use a greater amount of less specific labels compared to experienced radiologists.

4.1. Performance of Annotators

Generally, all annotators showed high correlation [20] to “gold standard” annotations of CXR text reports (Table 2a,b). This finding was comparable to a previous study which showed a similar level of agreement between radiologists and non-radiological physicians and medical students when reading and comprehending radiology reports [26]. However, disagreements in reading and reporting radiological findings exist even between readers of the same specialty [27]. Previous studies suggested that the free-form structure of a radiological text report permitted the use of sentences that were ambiguous and inconsistent [28]. The variability in using these phrases could contribute to the annotation variability observed between the annotators. The intermediate radiologist’s specialized experience may enable them to be better aligned with the “gold standard” annotators in interpreting whether an ambiguously worded sentence suggested that a finding was relevant and/or important enough to be annotated [26,29].
Our study also showed that the senior medical student and the novice radiographer performed better in annotation than the non-radiological physician and the experienced radiographer, respectively (Table 2a,b). Previous studies have demonstrated the difference between adaptive and routine expertise [30]. Experienced medical staff are encouraged to increase their specialization over time, thus, narrowing, but deepening their field of knowledge and therefore do not often engage in unknown situations [31,32], contrary to younger medical staff in active training. The novice radiographer and the medical student may have been more receptive to the change in their usual tasks, making them quicker to adapt to the annotation process itself [33,34]. The inherent routine expertise the experienced radiographer and the non-radiological physician have, may affect their behavior to value efficiency higher than thoroughness [35,36], and to only annotate findings that they would usually find relevant and disregard other findings [26,37]. A previous study aligned with our findings and showed that radiologists in training had slightly better performance compared to sub-specialist radiologists when reading and understanding reports outside their sub-specialty [38]. Another study showed that clinicians extract information from a radiological report based on their clinical bias [39,40] which may also contribute to the result of lesser correlation with “gold standard” annotations by the non-radiological physician compared to e.g., the senior medical student.
We found that labeling negative findings or labeling normal cases from abnormal cases may result in more consistent data for training a decision support system. Our findings were congruent with previous findings where it was demonstrated that negative findings were described more unambiguously in text reports, and that this may contribute to less difficulty in reading and comprehending negative findings compared to positive findings [27]. Negations may be a useful resource in the development of artificial intelligence-based algorithms for radiological decision support systems and studies [10,41,42] have shown that they are just as crucial to identify in a text, as positive findings [43].

4.2. Majority Vote Labeling

The results of our research indicated that there could be a reduction in false positive labels when using majority labeling compared to the labels used by an individual annotator (Table A3). Recent efforts have been made to outsource labeling to more annotators of lesser specialized experience as a way to reduce the time and cost of data gathering compared to sourcing and reimbursing field experts in the same tasks [44]. Several methods have been proposed to clean data labeled by multiple, less experienced annotators to obtain high-quality datasets efficiently, including using majority-vote labeling [45,46,47]. More inexperienced annotators may tend to overinterpret and overuse labels due to lack of training [48] or fear of missing findings [49]. Our study suggested that using majority labeling instead of using labels by individual annotators may eliminate some of the noisy and dispensable labels created by inexperienced annotators. Even when we eliminated the most experienced annotator from the majority voting (intermediate radiologist), there was still a reduction in false positive labels compared to any individual annotator (Table A3).

4.3. The Labeling Scheme

“Atelectasis”, “infiltrate”, “pleural effusion”, “cardiomegaly”, “correctly placed medical device”, and “stasis/edema” were the labels that were most frequently agreed upon from our labeling scheme (Table 4, Table 5 and Table A4 in Appendix A). While some labeling taxonomies are highly detailed with more labels than our labeling scheme [5], our labels were comparable to previously used annotation taxonomies which used text mining methods to extract labels [6,50]. An increased number of labels may introduce noise in data gathering [51], which there is a particularly high risk of when interpreting CXR and thoracic findings [52]. Fewer and broader labels may therefore be more desirable since this may enable higher agreement on a label from different readers.
Although “infiltrate” was one of the most agreed-upon labels, the differential diagnosis “pneumonia/infection” was not, despite it being one of the most common referral reasons for a CXR [53]. The “pneumonia/infection” diagnosis is usually based on a combination of clinical and paraclinical findings [54]. Radiologists are aware of this and may oftentimes not be conclusive in their reports, thus, introducing larger uncertainty to words associated with “pneumonia” compared to “infiltrate” [52]. Comparable with previous results from labeling CXR images [8], our study suggested that labels which are descriptive may be preferred to interpretive diagnostic labels. When annotating CXR reports, uncertainty of the radiologist in making diagnostic conclusions may introduce increased annotation bias in text reports.

4.4. Bias, Limitations and Future Studies

Due to time constraints, only a limited number of CXR text reports were included in our study. Previous studies have mentioned the limitations of using Cohen’s kappa when it comes to imbalanced datasets, specifically, when the distribution of true positives and true negatives is highly skewed [55]. The limitations have been shown to be most prevalent when readers show negative or no correlation [56]. In anticipation of a label imbalance in our dataset and a risk of none to negative correlation between an annotator and “gold standard”, we used Matthew’s correlation coefficient over Cohen’s kappa. However, as shown by Chicco et al. [56] MCC and Cohen’s kappa are closely related, especially when readers show positive correlation. In our study, all readers had positive correlation coefficients with “gold standard” and the interpretation of results would therefore likely not have changed if we had used Cohen’s kappa instead of MCC.
A limitation of the number of annotators included in our study was due to a combination of time constraints and participant availability. We recognize that as with the “gold standard” labels, ideally each level of annotator-experience should consist of multiple annotators’ consensus vote. However, we found it relevant that our study reflected the real-world obstacles of data-gathering for deep learning development projects since recruitment of human annotators is already a well-known problem. We presented “majority” voting categories as solutions to, not only the limited number of annotators in our study, but also as a solution when there is a lack of annotators in deep learning development projects in general.
Annotations by the board-certified experienced radiologists may not reflect true labels, since factors such as the annotation software and subjective opinions may influence a radiologist’s annotations. We attempted to reduce these elements of reader bias through consensus between the experienced radiologists by majority voting [57]. Furthermore, since annotators did not manually link each specific text piece to a label, we could not guarantee that annotators labeled the exact same findings with the same labels. We used an algorithm for matching labels in this study, since that algorithm would also be used for developing the final artificial intelligence-based support system.
Our study did not investigate whether an artificial intelligence-based algorithm would perform better when trained on annotations from less experienced medical staff compared to experienced radiologists. The assumption behind our study was that radiologists could provide annotations of the highest quality to train an algorithm, and that annotators with higher correlation to those annotations would produce high quality data [9]. Further studies are needed to investigate the differences in algorithm performance based on training data annotated by experienced radiologists compared to other medical staff. We did not investigate whether our annotators’ text report labels corresponded to the CXR image, since this was not within the scope of our study but could be a topic of interest for future studies.

5. Conclusions

Trained radiologists were most aligned with experienced radiologists in understanding a chest X-ray report. For the purpose of labeling text reports for the development of an artificial intelligence-based decision support system, performance increased with radiological experience for trained radiologists. However, as annotators, medical staff with general and basic knowledge may be preferred to experienced medical staff, if the experienced medical staff have sub-specialized routine experience in other domains than diagnosing thoracic radiological findings.

Author Contributions

Conceptualization, D.L., J.F.C., K.L.H., R.B., M.F., J.T., L.T., S.D. and M.B.N.; methodology, D.L., J.F.C., M.F., J.T., R.B. and M.B.N.; software, R.B., J.T., L.T. and M.F.; formal analysis, D.L. and R.B.; investigation, D.L., L.M.P., C.A.L., J.J., T.T.A., S.B.B., A.M. and R.B.; resources, S.D. and M.B.N.; data curation, J.T., R.B. and M.F.; writing—original draft preparation, D.L.; writing—review and editing, D.L., L.M.P., C.A.L., J.J., T.T.A., S.B.B., A.M., J.T., R.B., M.F., K.L.H., J.F.C., S.D. and M.B.N.; visualization, D.L., J.T. and R.B.; supervision, K.L.H., J.F.C., S.D. and M.B.N.; project administration, D.L.; funding acquisition, S.D. and M.B.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Innovation Fund Denmark (IFD) with grant no. 0176-00013B for the AI4Xray project.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Regional Council for Region Hovedstaden (R-22017450, 11 May 2022) and Knowledge Center on Data Protection Compliance (P-2022-231, 19 May 2022).

Informed Consent Statement

Informed consent was obtained from all readers/annotators involved in the study. Informed consent from patients was waived by the Regional Council for Region Hovedstaden.

Data Availability Statement

Not applicable.

Acknowledgments

We acknowledge and are grateful for any support given by the Section of Biostatistics at the Department of Public Health at Copenhagen University.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Frequency counts of labels used by each annotator for positive findings.
Table A1. Frequency counts of labels used by each annotator for positive findings.
Annotators
Radiologist, Inter-
Mediate
Radiologist, NoviceRadiographer, ExperiencedRadiographer, NovicePhysician, Non-RadiologistSenior Medical StudentSenior Radiologist 3Senior Radiologist 2Senior Radiologist 1All
LabelsAbnormal0010000001
Abscess65535556141
Asbestosis0110100014
Atelectasis333131323333353523286
Bone060165381039
Cardiomediastinum00019011012
Cardiomegaly393738383436363933330
Cavitary lesion22863420431
Chronic lung changes33331728172921267211
Consolidation252933512251275153
Correct placement32421129347272432238
Cysts/bullae42744344436
Decreased translucency3002910343052
Diffuse infiltrate350022305038
Elevated (hemi)diaphragm77766666253
Emphysema161518241918101017147
Enlarged mediastinum22220222216
Fibrosis121011810774271
Flattened (hemi)diaphragm8131413121315102100
Foreign object0001000001
Fracture71401434731
Hiatal hernia55433545438
Increased interstitial....610515163111471
Increased translucency101213101019
Infiltrate524553643962603164470
Interlobar septal thickening0110000002
Lung0000120003
Lung surgery51271715195147101
Lymph node pathology23220223117
Malignant/cancer33457727038
Mediastinal shift23100112111
Mediastinal tumor0120010015
Nodule, tumor or mass6834106831664
Not abnormal9231321610126495
Non-correct placement22341136123
Operation/implants2310816317101694
Artifact07200001010
Other bone pathology119751211512678
Other cardiomediastinum1100131119
Other1000000304
Other foreign object0001001002
Other decreased translucency1000100103
Other increased translucency0000100001
Other non-pathological118200002023
Other pathological811102311027
Other soft tissue00113511315
Pericardial effusion1001113108
Pleural calcification88128758754
Pleural changes116131311874477
Pleural contraction0010000001
Pleural effusion414342384147494442387
Pleural thickening101005867101268
Pneumomediastinum0010000001
Pneumonia3232191829140302176
Pneumothorax1010101013101010891
Sarcoidosis1111110118
Soft tissue010811040024
Stasis/edema303123233226292927250
Subcutaneous emphysema66405525337
Support devices1003440121211173162
Tuberculosis88368866856
Vascular changes1501511001611088
Table A2. Frequency counts of labels used by each annotator for negative findings.
Table A2. Frequency counts of labels used by each annotator for negative findings.
Annotators
Radiologist, Inter-MediateRadiologist, NoviceRadiographer, ExperiencedRadiographer, NovicePhysician, Non-RadiologistSenior Medical StudentSenior Radiologist 3Senior Radiologist 2Senior Radiologist 1All
LabelsAbscess1111111108
Atelectasis988106858769
Bone33031242018
Cardiomediastinum071021555044
Cardiomegaly524350533148434255417
Cavitary lesion0032011018
Consolidation212310411235
Correct placement0010000001
Cysts/bullae0030000003
Differential diagnosis0010000001
Decreased translucency0001300105
Diffuse infiltrate0020100104
Emphysema0001001002
Enlarged mediastinum18928061111166
Fracture34313333326
Increased interstitial1011011117
Increased translucency0001100002
Infiltrate867360846588817782696
Lung2100050008
Lung surgery0000000101
Lymph node pathology0000001001
Malignant/cancer33534222024
Mediastinal shift1110000003
Nodule, tumor or mass19112211422
Other bone pathology0001100002
Other soft tissue0001000001
Pericardial effusion0021002005
Pleural changes122006321439
Pleural effusion102102449481102999794815
Pleural thickening0010000001
Pneumonia87301107027
Pneumothorax303128292630303025259
Sarcoidosis1100110116
Soft tissue0000001001
Stasis/edema828151806377767872660
Subcutaneous emphysema22202212215
Support devices0011000204
Tuberculosis22001102210
Vascular changes300201040019
Table A3. Number of unmatched labels of both positive and negative findings after subtraction of matched labels by individual annotators, majority of annotators, and gold standard annotations.
Table A3. Number of unmatched labels of both positive and negative findings after subtraction of matched labels by individual annotators, majority of annotators, and gold standard annotations.
Number of Unmatched Labels (by Annotator)
Radiologist, IntermediateRadiologist, NoviceRadiogra-pher, ExperiencedRadiogra-pher, NovicePhysician, Non-RadiologistSenior Medical StudentMajorityMajority excl. Intermed. RadiologistGold Standard
Compared to annotatorRadiologist, intermediate 101144128121114307532
Radiologist, novice118 169150130135457058
Radiographer, experienced288296 271277277180205209
Radiographer, novice182187181 164155718488
Physician, non-radiologist214206226203 205110139133
Senior medical student135139154122133 416257
Majority173171179160160163 6196
Majority excl. Intermed. Radiologist1571351431121281230 75
Gold Standard201210234203209205122162
Diagnostics 13 01070 i0a1
Fewest unmatched (best)     50% fractile     Most unmatched (worst)
Table A4. Number of matched cases (accumulated) on specific labels in the labeling scheme for all labels except labels in the “lung tissue findings” category and the “cardiomediastinum” category. * Rows and columns belonging to the parent nodes “lung tissue finding” or “cardiomediastinal findings” and that did not have any label disagreements have been pruned and thus number of rows does not match number of columns.
Table A4. Number of matched cases (accumulated) on specific labels in the labeling scheme for all labels except labels in the “lung tissue findings” category and the “cardiomediastinum” category. * Rows and columns belonging to the parent nodes “lung tissue finding” or “cardiomediastinal findings” and that did not have any label disagreements have been pruned and thus number of rows does not match number of columns.
Gold Standard *
Annotators * BoneCorrect PlacementFractureNon-Correct PlacementOperation and ImplantsOther Bone PathologyStasis/EdemaSubcutaneous EmphysemaSupport Devices
Bone11 10 10
Correct placement 115 13
Differential diagnosis 1
Foreign object 1
Fracture2 29
Non-correct placement 6 1
Operation and implants 38
Other bone pathology3 38
Soft tissue 1
Stasis/edema 576
Subcutaneous emphysema 28
Support devices 48 24
Diagnostics 13 01070 i0a2
0 labels matched                  100+ labels matched

References

  1. Performance Analysis Team. Diagnostic Imaging Dataset Statistical Release; NHS: London, UK, 2022/2023. Available online: https://www.england.nhs.uk/statistics/statistical-work-areas/diagnostic-imaging-dataset/diagnostic-imaging-dataset-2022-23-data/ (accessed on 7 February 2022).
  2. Li, D.; Pehrson, L.M.; Lauridsen, C.A.; Tottrup, L.; Fraccaro, M.; Elliott, D.; Zajac, H.D.; Darkner, S.; Carlsen, J.F.; Nielsen, M.B. The Added Effect of Artificial Intelligence on Physicians’ Performance in Detecting Thoracic Pathologies on CT and Chest X-ray: A Systematic Review. Diagnostics 2021, 11, 2206. [Google Scholar] [CrossRef]
  3. Kim, T.S.; Jang, G.; Lee, S.; Kooi, T. Did You Get What You Paid For? Rethinking Annotation Cost of Deep Learning Based Computer Aided Detection in Chest Radiographs. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; pp. 261–270. [Google Scholar]
  4. Willemink, M.J.; Koszek, W.A.; Hardell, C.; Wu, J.; Fleischmann, D.; Harvey, H.; Folio, L.R.; Summers, R.M.; Rubin, D.L.; Lungren, M.P. Preparing medical imaging data for machine learning. Radiology 2020, 295, 4–15. [Google Scholar] [CrossRef]
  5. Bustos, A.; Pertusa, A.; Salinas, J.-M.; de la Iglesia-Vayá, M. Padchest: A large chest x-ray image dataset with multi-label annotated reports. Med. Image Anal. 2020, 66, 101797. [Google Scholar] [CrossRef]
  6. Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R.; Shpanskaya, K. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 590–597. [Google Scholar]
  7. Putha, P.; Tadepalli, M.; Reddy, B.; Raj, T.; Chiramal, J.A.; Govil, S.; Sinha, N.; KS, M.; Reddivari, S.; Jagirdar, A. Can artificial intelligence reliably report chest X-rays?: Radiologist validation of an algorithm trained on 2.3 million X-rays. arXiv 2018, arXiv:1807.07455. [Google Scholar]
  8. Li, D.; Pehrson, L.M.; Tottrup, L.; Fraccaro, M.; Bonnevie, R.; Thrane, J.; Sorensen, P.J.; Rykkje, A.; Andersen, T.T.; Steglich-Arnholm, H.; et al. Inter- and Intra-Observer Agreement When Using a Diagnostic Labeling Scheme for Annotating Findings on Chest X-rays-An Early Step in the Development of a Deep Learning-Based Decision Support System. Diagnostics 2022, 12, 3112. [Google Scholar] [CrossRef]
  9. Mehrotra, P.; Bosemani, V.; Cox, J. Do radiologists still need to report chest x rays? Postgrad. Med. J. 2009, 85, 339. [Google Scholar] [CrossRef]
  10. Peng, Y.; Wang, X.; Lu, L.; Bagheri, M.; Summers, R.; Lu, Z. NegBio: A high-performance tool for negation and uncertainty detection in radiology reports. AMIA Summits Transl. Sci. Proc. 2018, 2018, 188. [Google Scholar]
  11. McDermott, M.B.; Hsu, T.M.H.; Weng, W.-H.; Ghassemi, M.; Szolovits, P. Chexpert++: Approximating the chexpert labeler for speed, differentiability, and probabilistic output. In Proceedings of the Machine Learning for Healthcare Conference, Durham, NC, USA, 7–8 August 2020; pp. 913–927. [Google Scholar]
  12. Wang, S.; Cai, J.; Lin, Q.; Guo, W. An Overview of Unsupervised Deep Feature Representation for Text Categorization. IEEE Trans. Comput. Soc. Syst. 2019, 6, 504–517. [Google Scholar] [CrossRef]
  13. Thangaraj, M.; Sivakami, M. Text classification techniques: A literature review. Interdiscip. J. Inf. Knowl. Manag. 2018, 13, 117. [Google Scholar] [CrossRef] [Green Version]
  14. Calderon-Ramirez, S.; Giri, R.; Yang, S.; Moemeni, A.; Umaña, M.; Elizondo, D.; Torrents-Barrena, J.; Molina-Cabello, M.A. Dealing with Scarce Labelled Data: Semi-supervised Deep Learning with Mix Match for Covid-19 Detection Using Chest X-ray Images. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 5294–5301. [Google Scholar]
  15. Munappy, A.; Bosch, J.; Olsson, H.H.; Arpteg, A.; Brinne, B. Data Management Challenges for Deep Learning. In Proceedings of the 2019 45th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Kallithea-Chalkidiki, Greece, 28–30 August 2019; pp. 140–147. [Google Scholar]
  16. Brady, A.P. Radiology reporting-from Hemingway to HAL? Insights Imaging 2018, 9, 237–246. [Google Scholar] [CrossRef] [Green Version]
  17. Ogawa, M.; Lee, C.H.; Friedman, B. Multicenter survey clarifying phrases in emergency radiology reports. Emerg. Radiol. 2022, 29, 855–862. [Google Scholar] [CrossRef]
  18. Klobuka, A.J.; Lee, J.; Buranosky, R.; Heller, M. When the Reading Room Meets the Team Room: Resident Perspectives From Radiology and Internal Medicine on the Effect of Personal Communication After Implementing a Resident-Led Radiology Rounds. Curr. Probl. Diagn. Radiol. 2019, 48, 312–322. [Google Scholar] [CrossRef]
  19. Hansell, D.M.; Bankier, A.A.; MacMahon, H.; McLoud, T.C.; Muller, N.L.; Remy, J. Fleischner Society: Glossary of terms for thoracic imaging. Radiology 2008, 246, 697–722. [Google Scholar] [CrossRef] [Green Version]
  20. Chicco, D.; Jurman, G. The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. BioData Min. 2023, 16, 4. [Google Scholar] [CrossRef]
  21. McKinney, W. Data Structures for Statistical Computing in Python. 2010, pp. 56–61. Available online: https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf (accessed on 7 February 2022).
  22. Harris, C.R.; Millman, K.J.; van der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef]
  23. Asch, V.V. Macro-and Micro-Averaged Evaluation Measures [BASIC DRAFT]. 2013. Available online: https://cupdf.com/document/macro-and-micro-averaged-evaluation-measures-basic-draft.html?page=1 (accessed on 7 February 2022).
  24. Hagberg, A.A.; Schult, D.A.; Swart, P.J. Exploring Network Structure, Dynamics, and Function Using NetworkX. In Proceedings of the 7th Python in Science Conference, Pasadena, CA, USA, 19–24 August 2008; pp. 11–15. [Google Scholar]
  25. Wigness, M.; Draper, B.A.; Ross Beveridge, J. Efficient label collection for unlabeled image datasets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 4594–4602. [Google Scholar]
  26. Lee, B.; Whitehead, M.T. Radiology Reports: What YOU Think You’re Saying and What THEY Think You’re Saying. Curr. Probl. Diagn. Radiol. 2017, 46, 186–195. [Google Scholar] [CrossRef]
  27. Lacson, R.; Odigie, E.; Wang, A.; Kapoor, N.; Shinagare, A.; Boland, G.; Khorasani, R. Multivariate Analysis of Radiologists’ Usage of Phrases that Convey Diagnostic Certainty. Acad. Radiol. 2019, 26, 1229–1234. [Google Scholar] [CrossRef]
  28. Shinagare, A.B.; Lacson, R.; Boland, G.W.; Wang, A.; Silverman, S.G.; Mayo-Smith, W.W.; Khorasani, R. Radiologist Preferences, Agreement, and Variability in Phrases Used to Convey Diagnostic Certainty in Radiology Reports. J. Am. Coll. Radiol. 2019, 16, 458–464. [Google Scholar] [CrossRef]
  29. Berlin, L. Medicolegal: Malpractice and ethical issues in radiology. Proofreading radiology reports. AJR Am. J. Roentgenol. 2013, 200, W691–W692. [Google Scholar] [CrossRef]
  30. Mylopoulos, M.; Woods, N.N. Having our cake and eating it too: Seeking the best of both worlds in expertise research. Med. Educ. 2009, 43, 406–413. [Google Scholar] [CrossRef]
  31. Winder, M.; Owczarek, A.J.; Chudek, J.; Pilch-Kowalczyk, J.; Baron, J. Are We Overdoing It? Changes in Diagnostic Imaging Workload during the Years 2010-2020 including the Impact of the SARS-CoV-2 Pandemic. Healthcare 2021, 9, 1557. [Google Scholar] [CrossRef]
  32. Sriram, V.; Bennett, S. Strengthening medical specialisation policy in low-income and middle-income countries. BMJ Glob. Health 2020, 5, e002053. [Google Scholar] [CrossRef] [Green Version]
  33. Mylopoulos, M.; Regehr, G.; Ginsburg, S. Exploring residents’ perceptions of expertise and expert development. Acad. Med. 2011, 86, S46–S49. [Google Scholar] [CrossRef]
  34. Farooq, F.; Mahboob, U.; Ashraf, R.; Arshad, S. Measuring Adaptive Expertise in Radiology Residents: A Multicenter Study. Health Prof. Educ. J. 2022, 5, 9–14. [Google Scholar] [CrossRef]
  35. Grant, S.; Guthrie, B. Efficiency and thoroughness trade-offs in high-volume organisational routines: An ethnographic study of prescribing safety in primary care. BMJ Qual. Saf. 2018, 27, 199–206. [Google Scholar] [CrossRef] [Green Version]
  36. Croskerry, P. Adaptive expertise in medical decision making. Med. Teach. 2018, 40, 803–808. [Google Scholar] [CrossRef]
  37. Lafortune, M.; Breton, G.; Baudouin, J.L. The radiological report: What is useful for the referring physician? Can. Assoc. Radiol. J. 1988, 39, 140–143. [Google Scholar]
  38. Branstetter, B.F.t.; Morgan, M.B.; Nesbit, C.E.; Phillips, J.A.; Lionetti, D.M.; Chang, P.J.; Towers, J.D. Preliminary reports in the emergency department: Is a subspecialist radiologist more accurate than a radiology resident? Acad. Radiol. 2007, 14, 201–206. [Google Scholar] [CrossRef]
  39. Clinger, N.J.; Hunter, T.B.; Hillman, B.J. Radiology reporting: Attitudes of referring physicians. Radiology 1988, 169, 825–826. [Google Scholar] [CrossRef]
  40. Kruger, P.; Lynskey, S.; Sutherland, A. Are orthopaedic surgeons reading radiology reports? A Trans-Tasman Survey. J. Med. Imaging Radiat. Oncol. 2019, 63, 324–328. [Google Scholar] [CrossRef]
  41. Lin, C.; Bethard, S.; Dligach, D.; Sadeque, F.; Savova, G.; Miller, T.A. Does BERT need domain adaptation for clinical negation detection? J. Am. Med. Inf. Assoc. 2020, 27, 584–591. [Google Scholar] [CrossRef]
  42. van Es, B.; Reteig, L.C.; Tan, S.C.; Schraagen, M.; Hemker, M.M.; Arends, S.R.S.; Rios, M.A.R.; Haitjema, S. Negation detection in Dutch clinical texts: An evaluation of rule-based and machine learning methods. BMC Bioinform. 2023, 24, 10. [Google Scholar] [CrossRef]
  43. Rokach, L.; Romano, R.; Maimon, O. Negation recognition in medical narrative reports. Inf. Retr. 2008, 11, 499–538. [Google Scholar] [CrossRef]
  44. Zhang, J. Knowledge Learning With Crowdsourcing: A Brief Review and Systematic Perspective. IEEE/CAA J. Autom. Sin. 2022, 9, 749–762. [Google Scholar] [CrossRef]
  45. Li, J.; Zhang, R.; Mensah, S.; Qin, W.; Hu, C. Classification-oriented dawid skene model for transferring intelligence from crowds to machines. Front. Comput. Sci. 2023, 17, 175332. [Google Scholar] [CrossRef]
  46. Whitehill, J.; Ruvolo, P.; Wu, T.; Bergsma, J.; Movellan, J. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Proceedings of the Advances in Neural Information Processing Systems 22-Proceedings of the 2009 Conference, Vancouver, BC, Canada, 7–9 December 2009; pp. 2035–2043. [Google Scholar]
  47. Sheng, V.S.; Zhang, J.; Gu, B.; Wu, X. Majority Voting and Pairing with Multiple Noisy Labeling. IEEE Trans. Knowl. Data Eng. 2019, 31, 1355–1368. [Google Scholar] [CrossRef]
  48. Schmidt, H.G.; Boshuizen, H.P.A. On acquiring expertise in medicine. Educ. Psychol. Rev. 1993, 5, 205–221. [Google Scholar] [CrossRef] [Green Version]
  49. Yavas, U.S.; Calisir, C.; Ozkan, I.R. The Interobserver Agreement between Residents and Experienced Radiologists for Detecting Pulmonary Embolism and DVT with Using CT Pulmonary Angiography and Indirect CT Venography. Korean J. Radiol. 2008, 9, 498–502. [Google Scholar] [CrossRef] [Green Version]
  50. Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R. ChestX-ray14: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  51. Frénay, B.; Verleysen, M. Classification in the Presence of Label Noise: A Survey. Neural Netw. Learn. Syst. IEEE Trans. 2014, 25, 845–869. [Google Scholar] [CrossRef]
  52. Callen, A.L.; Dupont, S.M.; Price, A.; Laguna, B.; McCoy, D.; Do, B.; Talbott, J.; Kohli, M.; Narvid, J. Between Always and Never: Evaluating Uncertainty in Radiology Reports Using Natural Language Processing. J. Digit. Imaging 2020, 33, 1194–1201. [Google Scholar] [CrossRef]
  53. Wootton, D.; Feldman, C. The diagnosis of pneumonia requires a chest radiograph (X-ray)-yes, no or sometimes? Pneumonia 2014, 5, 1–7. [Google Scholar] [CrossRef] [Green Version]
  54. Loeb, M.B.; Carusone, S.B.; Marrie, T.J.; Brazil, K.; Krueger, P.; Lohfeld, L.; Simor, A.E.; Walter, S.D. Interobserver reliability of radiologists’ interpretations of mobile chest radiographs for nursing home-acquired pneumonia. J. Am. Med. Dir. Assoc. 2006, 7, 416–419. [Google Scholar] [CrossRef]
  55. Byrt, T.; Bishop, J.; Carlin, J.B. Bias, prevalence and kappa. J. Clin. Epidemiol. 1993, 46, 423–429. [Google Scholar] [CrossRef]
  56. Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [Green Version]
  57. Hight, S.L.; Petersen, D.P. Dissent in a Majority Voting System. IEEE Trans. Comput. 1973, 100, 168–171. [Google Scholar] [CrossRef]
Figure 1. Labeling hierarchy for chest X-ray text report annotation.
Figure 1. Labeling hierarchy for chest X-ray text report annotation.
Diagnostics 13 01070 g001
Figure 2. Annotation software for text report annotations. The full-text report is displayed on the right side and labels in the labeling hierarchy are displayed on the left. On the top left, selected labels are showcased; red labels for negative findings and blue labels for positive findings.
Figure 2. Annotation software for text report annotations. The full-text report is displayed on the right side and labels in the labeling hierarchy are displayed on the left. On the top left, selected labels are showcased; red labels for negative findings and blue labels for positive findings.
Diagnostics 13 01070 g002
Figure 3. Pareto chart of all annotators accumulated use of labels for (a) positive findings and (b) negative findings.
Figure 3. Pareto chart of all annotators accumulated use of labels for (a) positive findings and (b) negative findings.
Diagnostics 13 01070 g003
Table 1. An example of 2 × 2 confusion matrix for the calculations of Matthew’s Correlation Coefficient. TP, true positive; FP, false positive; FN, false negative; TN, true negative.
Table 1. An example of 2 × 2 confusion matrix for the calculations of Matthew’s Correlation Coefficient. TP, true positive; FP, false positive; FN, false negative; TN, true negative.
Gold Standard
Annotator(s) Labels usedLabels NOT used
Labels usedTPFP
Labels NOT usedFNTN
Table 2. Matthew’s correlation coefficients (MCC) for annotators’ performance in annotating chest X-ray text reports compared to gold standard annotation set for (a) positive findings and (b) negative findings.
Table 2. Matthew’s correlation coefficients (MCC) for annotators’ performance in annotating chest X-ray text reports compared to gold standard annotation set for (a) positive findings and (b) negative findings.
Radiologist, IntermediateRadiologist, NoviceRadiographer, ExperiencedRadiographer, NovicePhysician, Non-RadiologistSenior Medical Student
MCC0.770.710.570.650.640.71
(a)
Radiologist, IntermediateRadiologist, NoviceRadiographer, ExperiencedRadiographer, NovicePhysician, Non-RadiologistSenior Medical Student
MCC0.920.880.640.880.770.88
(b)
Table 3. Number of matched labels of both positive and negative findings for each annotator, majority of annotators, and gold standard.
Table 3. Number of matched labels of both positive and negative findings for each annotator, majority of annotators, and gold standard.
Radiologist, IntermediateRadiologist, NoviceRadiographer, ExperiencedRadiographer, NovicePhysician, Non-RadiologistSenior Medical StudentMajorityMajority excl. Intermed. RadiologistGold Standard
Radiologist, intermediate 849679785753832794810766
Radiologist, novice849 654763744811779815740
Radiographer, experienced679654 642597669664680589
Radiographer, novice785763642 710791753801710
Physician, non-radiologist753744597710 741714746665
Senior medical student832811669791741 783823741
Majority794779664753714783 824702
Majority excl. Intermed. Radiologist810815680801746823824 723
Gold Standard766740589710665741702723
Diagnostics 13 01070 i001
Fewest matched (worst)      50% fractile      Most matched (best)
Table 4. Number of matched cases (accumulated) on specific labels in the labeling scheme related to “lung tissue findings”. * Rows and columns not belonging to the parent node “lung tissue findings” and that did not have any label disagreements have been pruned and thus number of rows does not match number of columns.
Table 4. Number of matched cases (accumulated) on specific labels in the labeling scheme related to “lung tissue findings”. * Rows and columns not belonging to the parent node “lung tissue findings” and that did not have any label disagreements have been pruned and thus number of rows does not match number of columns.
Gold Standard *
Annotators * AtelectasisConsolidationCysts/BullaeIncreased InterstitialInfiltrateDecreased TranslucencyNodule, Tumor or MassPleural CalcificationPleural ChangesPleural EffusionPleural Thickening
Atelectasis 219
Cavitary lesion 8
Consolidation 22 533
Cysts/bullae 21
Diffuse infiltrate 30
Increased interstitial… 20
Infiltrate 4 687 110
Lung1 1 1 2
Decreased translucency13 15522191
Nodule, tumor or mass 8 28
Pleural calcification 32
Pleural changes 1010 13
Pleural effusion 743
Pleural thickening 30
Diagnostics 13 01070 i002
0 labels matched                      100+ labels matched
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, D.; Pehrson, L.M.; Bonnevie, R.; Fraccaro, M.; Thrane, J.; Tøttrup, L.; Lauridsen, C.A.; Butt Balaganeshan, S.; Jankovic, J.; Andersen, T.T.; et al. Performance and Agreement When Annotating Chest X-ray Text Reports—A Preliminary Step in the Development of a Deep Learning-Based Prioritization and Detection System. Diagnostics 2023, 13, 1070. https://doi.org/10.3390/diagnostics13061070

AMA Style

Li D, Pehrson LM, Bonnevie R, Fraccaro M, Thrane J, Tøttrup L, Lauridsen CA, Butt Balaganeshan S, Jankovic J, Andersen TT, et al. Performance and Agreement When Annotating Chest X-ray Text Reports—A Preliminary Step in the Development of a Deep Learning-Based Prioritization and Detection System. Diagnostics. 2023; 13(6):1070. https://doi.org/10.3390/diagnostics13061070

Chicago/Turabian Style

Li, Dana, Lea Marie Pehrson, Rasmus Bonnevie, Marco Fraccaro, Jakob Thrane, Lea Tøttrup, Carsten Ammitzbøl Lauridsen, Sedrah Butt Balaganeshan, Jelena Jankovic, Tobias Thostrup Andersen, and et al. 2023. "Performance and Agreement When Annotating Chest X-ray Text Reports—A Preliminary Step in the Development of a Deep Learning-Based Prioritization and Detection System" Diagnostics 13, no. 6: 1070. https://doi.org/10.3390/diagnostics13061070

APA Style

Li, D., Pehrson, L. M., Bonnevie, R., Fraccaro, M., Thrane, J., Tøttrup, L., Lauridsen, C. A., Butt Balaganeshan, S., Jankovic, J., Andersen, T. T., Mayar, A., Hansen, K. L., Carlsen, J. F., Darkner, S., & Nielsen, M. B. (2023). Performance and Agreement When Annotating Chest X-ray Text Reports—A Preliminary Step in the Development of a Deep Learning-Based Prioritization and Detection System. Diagnostics, 13(6), 1070. https://doi.org/10.3390/diagnostics13061070

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop