1. Introduction
Due to its low cost, widespread availability, and low radiation burden, conventional radiographs (X-ray) are the most frequently used image modality in radiology, with chest radiographs (CXR) accounting for nearly half of the performed examinations [
1]. The ubiquitous shortage of radiologists, coupled with the ever-increasing demand of imaging data, makes it challenging to maintain a short turnaround time while upholding the quality of radiological reports [
2]. In some settings, it may not be feasible to report all acquired X-ray examinations in a timely manner, leading to large backlogs of unreported examinations [
3].
Artificial intelligence (AI) has the potential to transform radiology in various ways, from faster medical image assessment to diagnostic and operational support [
4]. One could hypothesize that, with the aid of AI, prioritization of normal vs. abnormal cases could be achieved, which, in turn, would allow more complex cases to be assessed by radiologists while simpler cases could be handled by AI alone or in combination with reporting radiographers. This may lead to shorter turnaround times and faster responses, benefiting both the patient and the department. This is further reinforced by the study of Woznitza et al. who, after assessing the agreement in radiological text reports for CXR between thoracic radiologists, consulting radiologists, and trained radiographers in clinical practice, concluded that reports from trained radiographers were indistinguishable from those of consultant radiologists and expert thoracic radiologists [
5]. A prior study from our group focused on evaluating the agreement levels (fair-to-excellent) between six clinicians on pulmonary findings in CXRs after generating a diagnostic labelling scheme to consistently label findings on CXRs [
6].
To achieve such outcomes, it is important to consider how CXR reports are constructed. To date, while the structure of the reports is generally consistent, the language used is chosen by the radiologist, allowing for the expression of subtleties and uncertainty, but also potentially introducing bias and variability [
7]. Such variability in radiological terminology can influence the reliability of AI models when these reports are used to define the ground truth for training or testing algorithms. Previous studies have also suggested that factors such as medical experience, terminology, bias, local disease prevalence, and geographic location may impact the interpretation and naming of CXR findings by clinicians [
8,
9]. This inconsistency in terminology may lead to poor or inconsistent AI performance, as well as systematic biases. Reports in the literature suggest circumventing this variability by utilizing ontological systems for annotation [
10]. The use of structured reports could also facilitate and speed up information retrieval for clinicians [
7]. A plethora of labelling schemes for CXR annotations have been developed, with differing numbers of labels as well as different degrees of variability in terminology. For instance, CheXpert and MIMIC-CXR each include 14 labels, whereas PadChest comprises more than 180 unique labels [
11,
12,
13].
The most commonly used database for machine learning interpretation and development is ‘Chest X-ray14’, which includes only the most common pulmonary findings [
14,
15,
16]. As a result, many studies focus solely on this; however, extrapulmonary radiographic findings are also important. Extrapulmonary labels encompass mediastinal contour (e.g., enlarged cardiomediastinum, cardiomegaly), bone pathologies (e.g., fractures), soft tissue anomalies (e.g., subcutaneous emphysema), foreign objects, medical devices and their positioning, as well as other findings (non-pathological or pathological) (
Figure 1). Failure to detect extrapulmonary findings may lead to diagnostic delays, missed secondary diagnoses, or overlooked comorbidities impacting patient management [
17]. This is underlined in the study presented by Nguyen et al. (2017) that aimed to measure the prevalence of clinically significant extrapulmonary findings on chest CT for lung cancer screening in the National Lung Screening Trial where 58.7% of the examined of the screened population (
n = 17,309) had extrapulmonary findings, and approximately 20% of them were classified as significant [
18]. Also, in the setting of tuberculosis (TB) screening programs, a systematic review comparing human reader vs. CAD4TB (CAD4TB v6, Delft Imaging; Lunit Insight CXR, Lunit Insight; and qXR v2, Qure.ai.), a computer-aided detection (CAD) software for detection of TB, demonstrated a substantial overlap between the two, to the point that the world health organization (WHO) guideline development group considers such CAD software accurate and scalable, and claims that its use can increase the access to CXR and meet the scarcity of radiologists. However, they highlighted that the drawback of using CAD interpretation in place of human readers is that the software is not able to identify other lung pathologies other than TB [
19,
20].
As reported by Yang et al. (2023), no studies so far have assessed the agreement between multiple readers with different experience levels for extrapulmonary findings in a structured manner in the broader population [
21]. With this in mind, this study aims to evaluate the reliability of a diagnostic labelling scheme for extrapulmonary radiographic findings and to assess the level of expertise required for accurate annotation in the development of CXR algorithms.
3. Results
All clinicians demonstrated almost perfect agreement in their annotations (RK = 0.87 [0.82–0.93. Agreement was highest for negative findings (PNA = 0.97 [0.95–0.98]), indicating strong consistency in ruling out findings. However, agreement for positive findings was moderate to low (PPA = 0.25 [0.18–0.54]), suggesting greater variability when identifying present findings.
Across experience levels, annotation patterns were largely consistent, with no significant differences in overall agreement. While experienced clinicians showed a slight trend toward using fewer labels, this was not statistically significant. However, pairwise comparisons identified some significant differences in label use between novice and experienced readers, suggesting that clinical expertise may subtly influence annotation decisions. Despite these variations, almost perfect agreement was observed across all groups (PABAK: 0.86–0.91).
Intra-reader agreement remained stable over both rounds (3-week wash-out period), with clinicians maintaining almost perfect agreement between rounds (PABAK = 0.94 [0.93–0.95]). The ability to consistently reproduce annotations reinforces the reliability of individual interpretations. However, lower agreement for positive findings suggests potential challenges, which suggests further investigation in training and labelling protocol is necessary.
Overall, these findings highlight the strength of the annotation process while identifying areas where variability exists, particularly in the detection of positive findings and subtle differences between experience levels. The following sections will provide a detailed breakdown of statistical results and subgroup comparisons to further explore these patterns—frequency of label use (
Section 3.1), Randolph’s free multirater kappa (
Section 3.2), specific agreement (
Section 3.3), PABAK and Kruskal–Wallis test (
Section 3.4), and intra-reader agreement (
Section 3.5).
3.1. Frequency of Label Use
Variance in the frequency of label data (
Table A1) was initially assessed to understand annotation tendencies and frequencies. The variance of the frequency count data was notably greater than the mean, indicating overdispersion, which warranted the use of a negative binomial GLM to model the relationship between experience level and annotation frequency (
Table A2). Afterwards, Bonferroni-corrected pairwise comparisons were performed to determine differences across specific groups while controlling for multiple testing. The negative binomial GLM indicated that the intercept estimate for the novice group was 2.49 on the log scale, which approximates 12 annotations for novices when exponentiated. The coefficient for the intermediate group was close to zero (−0.0076), suggesting no statistically significant difference in annotation counts compared to the novice group (
p = 0.98). The largest difference was between the novice and experienced groups, with an estimated coefficient (−0.4449) suggesting less frequent use of annotations for the experienced group relative to the novice group, but it was not statistically significant (
p = 0.28) (
Figure 2).
Bonferroni-corrected pairwise comparisons similarly showed no significant differences in annotation counts between experience levels (
Table A3). The estimated difference in log counts between novice and intermediate groups was approximately zero (0.007,
p = 1.00), and between novice and experienced 0.44 (
p = 0.86), again, indicating no significant difference. Additionally, there was no significant difference between the intermediate and experienced groups, with an estimated log difference of 0.43 (
p = 0.88).
3.2. Randolph’s Free Multirater Kappa
3.2.1. Group Level Agreement Among Novice, Intermediate, and Experienced Readers
Inter-reader reliability (RK, 95% CI) was assessed, with group-wise analysis (novice, intermediate, experienced) and bootstrapping to estimate SE for group RK differences.
No significant differences in RK among grouped clinicians were registered after adjusting the
p-values for multiple comparisons (Holm, Bonferroni). This implies that the consistency of agreement across all groups and labels is reliable with regards to maintaining agreement. An increase in RK value as a function of experience was observed, though it was not statistically significant (
Figure 3). Visual distribution of all RK values for each group is available in
Figure A1.
3.2.2. Overall Randolph Kappa Agreement Among All Clinicians
All six clinicians achieved a mean RK value that indicated almost perfect agreement (RK = 0.87 (0.82–0.93)) (
Table 1). Wilcoxon pairwise comparisons of all pathology labels, showed no significant differences (Bonferroni, Holm). For 18 of the 23 labels, the readers achieved almost perfect agreement. However, for the remaining five labels, substantial RK values were obtained, i.e., “normal”, “support device”, “correct placement”, “former non-pulmonary OP implants”, and “other pathological” labels. Three of these five labels belong to the same parent label (foreign object), indicating the lowest RK agreement for all parent labels within this sub-group.
3.3. Specific Agreement
3.3.1. Overall Proportion of Positive and Negative Agreement Among All Clinicians
The clinicians obtained an average PPA, which indicated moderate-to-low agreement (PPA = 0.25 (0.18–0.54)). For the average PNA, the mean indicated almost perfect negative agreement (PNA = 0.97 (0.95–0.98)), showcasing a high level of consistency when readers identified the absences of findings (
Table 1).
3.3.2. Group-Wise Proportion of Positive and Negative Agreement Among All Clinicians
The results of the pair-wise analysis for all groups (novice, intermediate, experienced) performed to determine if there were significant differences in specific agreements levels, or the PPA and PNA, Bonferroni adjusted, are shown in
Table A3. The significant differences were observed for the specific agreement values between novice and experienced groups (
n = 8) (
Figure 4). Additionally, label ‘foreign object’ and all sub-labels consistently showed significant differences in all interactions for both PPA and PNA. Overview of all PPA and PNA levels for all readers are provided in
Table A4.
3.4. PABAK and Kruskal–Wallis Test
Group
All groups of clinicians (novice, intermediate, experienced) obtained mean PABAK values, estimated to be almost perfect across all radiological labels (novice: 0.86 (0.73–0.92), intermediate: 0.90 (0.78–0.95), experienced: 0.91 (0.80–0.95)) (
Table 2), indicating high consistency in their evaluations. To assess the group-level PABAK differences, the Kruskal–Wallis test was conducted indicating no significant difference between the groups (novice, intermediate, experienced) demonstrating comparable consistency in agreement (
Table A5).
To reinforce the absence of substantial differences between the groups, pair-wise comparisons using the Wilcoxon test were conducted, showing no significant differences (Bonferroni adjusted). The absence of significant differences based on the Kruskal–Wallis and pairwise Wilcoxon tests suggests that agreement levels across the groups are statistically similar. These results indicate that the level of agreement among novice, intermediate, and experienced readers was consistent, supporting uniformity in their evaluations.
3.5. Intra-Reader Agreement
Agreement for Each Clinician Between Rounds 1 and 2 of Annotation
To assess intra-reader agreement between both rounds of annotations, PABAK, PPA, PNA, and CI were calculated for each clinician and as an average across all clinicians. The McNemar’s test was conducted to determine if any significant changes occurred in the paired nominal data between rounds (Bonferroni adjusted). All clinicians, regardless of experience level, showed high mean PABAK values (
Table 3), with a mean PABAK indicating almost perfect agreement between rounds of annotations for each reader (PABAK = 0.94 (0.93–0.95)) (PABAK scores visualised in heatmap,
Figure A2).
Mean PPA values showed more variation, ranging from 0.32 to 0.54, with an average PPA indicating moderate-to-low positive agreement (PPA = 0.38 (0.32–0.44)) (PPA scores visualised in heatmap,
Figure A3). PNA values were consistently high, indicating almost perfect agreement for negative findings (mean PNA = 0.98 (0.98–0.99)) (PNA scores visualised in heatmap,
Figure A4). The McNemar’s test (Bonferroni adjusted) did not show any significant differences for PABAK, PPA, or PNA for any clinicians between rounds, indicating consistent annotations for all groups and readers.
4. Discussion
This study demonstrates that clinicians across all experience levels show consistent label use, with mean RK values indicating almost perfect agreement (novice: 0.85 [95% CI: 0.72–0.91], intermediate: 0.89 [95% CI: 0.85–0.93], experienced: 0.90 [95% CI: 0.84–0.96]). Notably, the overall mean RK value across all six clinicians (0.87 [95% CI: 0.82–0.93]) further supports the robustness of inter-rater reliability in assessing extrapulmonary findings in CXR images (
Table 1). More experienced clinicians use labels less frequently and achieve slightly higher RK values, though these differences are not statistically significant. Mean RK values across all clinicians and groups remain almost perfect. Novice clinicians exhibit higher RK agreement for broader parent labels, while experienced clinicians maintain consistently high agreement across all annotations. No significant differences are observed between novice and experienced groups, indicating uniformity in annotation performance. Specific agreement analysis reveals moderate-to-low PPA, suggesting variability in identifying positive findings, while PNA is almost perfect, reflecting strong consistency in identifying the absence of findings. These findings highlight a disparity in performance between positive and negative case identification. PABAK values indicate almost perfect agreement across groups, with no significant differences detected in Kruskal–Wallis or Wilcoxon analyses, confirming consistent agreement among groups and readers. Intra-reader reliability is similarly high, with all clinicians achieving almost perfect PABAK scores between rounds. While PPA shows moderate to low variability, PNA remains consistently high, demonstrating strong reliability in ruling out findings.
Prior to this study, Li D et al. examined the labels assigned by the same six clinicians, focusing specifically on identifying labels for pulmonary findings [
6]. Building on their work, this study expands the analysis to compare both pulmonary and extrapulmonary findings (
Table A6). A difference was observed in agreement, both overall and when grouped by experience level. Clinicians generally agreed more on findings not present in the pulmonary region, resulting in higher RK and PABAK values. Additionally, the specific agreement measures differed regarding PPA and PNA. Across all six clinicians, higher PPA and lower PNA were observed for pulmonary findings compared to extrapulmonary radiographic findings. This discrepancy could potentially be explained by the higher taxonomy and greater frequency of labels used for pulmonary findings, which increases the specificity of annotation options and could lead to greater variability in clinician interpretations.
The observed disparity between PPA and PNA reflects variability in identifying positive findings alongside strong reliability in ruling out conditions. The moderate-to-low PPA suggests potential challenges in consistent positive finding identification, that may contribute to variability in recognizing critical conditions, which may lead to inadequate treatment discissions, compromise patient safety, and impact disease monitoring and prognosis. Conversely, the almost perfect PNA demonstrates consistent agreement in negative case identification, indicating reliability in ruling out findings and minimizing false positives. This disparity highlights the need for focused efforts to improve consistency in positive case identification while preserving the high reliability of negative assessments [
32].
Several studies have demonstrated that an increased level of training is significantly associated with improved accuracy when evaluating and interpreting CXR [
6,
38,
39,
40,
41]. Specifically, Eisen et al. showed that medical doctors specializing in radiology performed better than novices [
42]. This finding is corroborated by Fabre et al., who indicated that the number of years in residency significantly enhanced detection capabilities, and that attendance at CXR training courses was linked to improved interpretative performance [
40]. Sverzellati et al. aimed to evaluate the interobserver agreement among four radiologists, divided into two groups based on their experience levels: experienced and less experienced [
43]. Their findings indicated a significantly better interobserver agreement among the more experienced radiologists compared to their less experienced colleagues when distinguishing between normal and abnormal CXR. The specific positive agreement for the label ‘normal’ in this setting demonstrated that clinicians with more experience achieved significantly higher PPA values (novice: 0.27; intermediate: 0.67; experienced: 0.65) (
Table A4).
In clinical practice, clinician from specialties other than radiology frequently look at CXRs before the radiological report is available and start making decisions independent of the radiologist’s opinion. Moreover, in some clinical settings reporting radiographers describe CXR autonomously. This emphasizes that experience is key in the assessment of CXR examinations, and that formal radiological training is not always required. Within the setting of annotating extensive datasets for AI algorithms training or testing, these results suggests that clinicians with lower levels of experience might perform adequately. As the annotation of extensive datasets is typically a time-consuming process, the knowledge that distributing annotations to experienced clinicians, who may not necessarily be formally educated radiologists, would not affect data quality could alleviate some of the pain points. Future work is required to establish whether this could be extended to radiographers.
The findings of this study can also be a first step in support of a more rational workflow approach where the presumed reduced response time when assisted by AI would likely improve clinical decision-making and treatment response time, and alleviate psychological stress for the patient. Additionally, this approach could save time for radiologists to focus on more complex cases and/or imaging techniques, improving workflow efficiency, enhancing time management, and optimizing resource allocation within radiology departments.
High-quality datasets play an essential role in AI development. The principle of “garbage in, garbage out” is widely acknowledged in both machine learning and computer science, emphasizing the criticality of reliable data input [
44]. A well-constructed dataset intended for training or validating AI software should feature expertly annotated data that covers the full spectrum of the target disease and reflects the diversity of the intended population [
45]. Such datasets can be fundamental to ensuring the reliability and accuracy of AI models [
45,
46,
47]. This study addresses the importance of consistent and reliable annotations, which are crucial for creating high-quality datasets that support accurate AI model development.
The use of an ontological labelling scheme for annotations differs from the clinicians’ typical free-text reporting, which could introduce bias in label selection [
7]. To minimize this potential bias, no additional case information was provided. However, this differs from the usual clinical practice where clinicians have access to patient referrals, IDs, and medical histories. Existing studies have not reached a consensus on whether the availability of such clinical information affects radiologists’ interpretive performance on CXR [
38,
39].
This study was limited by the relatively small number of clinicians assessing the cases and cases included. The cases were selected to ensure that each label occurred at least twice in the original free-text reports, allowing for statistical analysis. However, this selection process also impacted the prevalence and distribution of labels in the dataset, as the cases were drawn from the typical prevalence patterns found in the general population. Kappa statistics are affected by prevalence, as they measure agreement relative to what would be expected by chance. As such, if a label is either highly prevalent or highly rare in the dataset, kappa values are often lower. To account for this, RK and PABAK were used in this setting to provide a comprehensive overview while adjusting for prevalence and bias. Additionally, deep learning algorithms will not be able to detect findings that are not present in a supervised learning setting; in this case they must be trained on positively labelled data. This is why specific agreement measures were included, such as the PPA and PNA. While it is still possible to achieve a high PPA when prevalence is low, the likelihood of that is small. That is why both specific agreements and chance-adjusted agreements were provided to obtain a more comprehensive evaluation.
The use of a predefined labelling scheme may not fully capture the complexity of extrapulmonary findings. While the selection of specific labels could have influenced the interpretation of the findings, potentially restricting the generalizability of the results to other annotation frameworks, it must be kept in mind that CAD software may be used. Furthermore, inter-reader variability remains an inherent challenge, as image interpretation is subjective, and differences in individual diagnostic approaches may have influenced agreement levels. While efforts were made to standardize the annotation process through the provision of trial cases and guidance from the research team, individual differences in perception could still affect labelling consistency. A limitation of this study is that the dataset was not used to assess differences in algorithmic performance, either across the entire sample or within individual groups. The primary focus was on examining the agreement levels among clinicians using a labelling scheme to annotate CXR findings.