Three-Stage Framework for Accurate Pediatric Chest X-ray Diagnosis Using Self-Supervision and Transfer Learning on Small Datasets

Zhang, Yufeng; Kohne, Joseph; Wittrup, Emily; Najarian, Kayvan

doi:10.3390/diagnostics14151634

Open AccessArticle

Three-Stage Framework for Accurate Pediatric Chest X-ray Diagnosis Using Self-Supervision and Transfer Learning on Small Datasets

¹

Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA

²

Department of Pediatrics, University of Michigan, Ann Arbor, MI 48103, USA

³

Michigan Institute for Data Science (MIDAS), University of Michigan, Ann Arbor, MI 48109, USA

⁴

Department of Emergency Medicine, University of Michigan, Ann Arbor, MI 48109, USA

⁵

Max Harry Weil Institute for Critical Care Research and Innovation, University of Michigan, Ann Arbor, MI 48109, USA

^*

Author to whom correspondence should be addressed.

Diagnostics 2024, 14(15), 1634; https://doi.org/10.3390/diagnostics14151634

Submission received: 22 June 2024 / Revised: 19 July 2024 / Accepted: 25 July 2024 / Published: 29 July 2024

(This article belongs to the Special Issue Application of Artificial Intelligence in Radiological Imaging Analysis and Diagnosis)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Pediatric respiratory disease diagnosis and subsequent treatment require accurate and interpretable analysis. A chest X-ray is the most cost-effective and rapid method for identifying and monitoring various thoracic diseases in children. Recent developments in self-supervised and transfer learning have shown their potential in medical imaging, including chest X-ray areas. In this article, we propose a three-stage framework with knowledge transfer from adult chest X-rays to aid the diagnosis and interpretation of pediatric thorax diseases. We conducted comprehensive experiments with different pre-training and fine-tuning strategies to develop transformer or convolutional neural network models and then evaluate them qualitatively and quantitatively. The ViT-Base/16 model, fine-tuned with the CheXpert dataset, a large chest X-ray dataset, emerged as the most effective, achieving a mean AUC of 0.761 (95% CI: 0.759–0.763) across six disease categories and demonstrating a high sensitivity (average 0.639) and specificity (average 0.683), which are indicative of its strong discriminative ability. The baseline models, ViT-Small/16 and ViT-Base/16, when directly trained on the Pediatric CXR dataset, only achieved mean AUC scores of 0.646 (95% CI: 0.641–0.651) and 0.654 (95% CI: 0.648–0.660), respectively. Qualitatively, our model excels in localizing diseased regions, outperforming models pre-trained on ImageNet and other fine-tuning approaches, thus providing superior explanations. The source code is available online and the data can be obtained from PhysioNet.

Keywords:

chest X-ray; medical image analysis; self-supervised learning; transfer learning; model interpretability

1. Introduction

Pediatric respiratory diseases, such as pneumonia, are the leading cause of hospitalizations and mortality in young children or infants in the United States and across the whole world [1] with an estimated annual mortality of more than 0.7 million children under the age of five in the United States [2]. Early diagnosis and treatment can greatly reduce the risk of severe outcomes and mortality. Chest X-rays (CXRs) are the most commonly performed imaging examination in pediatrics and can aid in performing non-invasive, accurate interpretations and diagnoses and timely interventions in pediatric respiratory disease cases. Therefore, early thorax disease detection using a CXR is essential in pediatric medical research and practice.

Recent advances in deep learning (DL) have revolutionized the analysis of CXRs, with convolutional neural networks (CNNs) showcasing exceptional capabilities in diagnosing diseases and identifying affected regions [3,4,5]. Building on this success, newer architectures such as transformer-based and graph neural network-based models are gaining traction [6,7,8,9]. These models not only maintain a high diagnostic performance but also enhance the interpretability of the results, which is crucial for clinical acceptance. These advancements promise to sustain momentum towards more sophisticated, automated medical imaging diagnostics which will transform patient outcomes through technology.

Despite the advantages of CXRs and the significant advancement in applying deep learning to building decision support systems using CXRs, the development of accurate and efficient models hinges on the availability of precise annotations. However, accurately annotating CXRs remains a challenging task, even for skilled radiologists, due to the high costs and extensive time required for obtaining annotations. Furthermore, inconsistencies are common among physicians. To address these challenges, researchers are turning to self-supervised learning strategies such as BYOL [10], DINO [11], and Masked Autoencoders (MAE) [12]. These techniques enable the pre-training of models on large unlabeled datasets, followed by fine-tuning using smaller, task-specific labeled datasets. As demonstrated in recent studies [13,14,15], not only are these methods resource-efficient in terms of CXR label usage, but they also excel in developing robust image representations that often match or even surpass the performance of fully supervised approaches. A summary of self-supervised methods for CXR pre-training is presented in Table 1.

However, a notable limitation of self-supervised learning is its dependency on large datasets. The majority of existing large-scale CXR datasets primarily comprise adult cases [21,22], and there is a moderate difference in clinical characteristics between adult and pediatric patients [23]. Pediatric CXR datasets, particularly publicly available ones, are scarce and typically only contain samples for one specific disease [24]. The most comprehensive public pediatric CXR dataset to date has recently been released, containing only a few thousand images [25]. This data scarcity poses a challenge in developing accurate and reliable models for pediatrics despite ongoing research efforts.

In this study, we aim to create an accurate, automated system for diagnosing pediatric respiratory diseases from CXR, addressing the challenge of limited pediatric CXR data. To achieve this, we developed a three-stage transfer learning approach specifically for classifying pediatric CXR diagnoses, as shown in Figure 1. This approach involves (1) initially pre-training the system using a masked autoencoder, a self-supervised learning framework, on large-scale datasets of publicly available adult CXRs or natural images, (2) subsequently fine-tuning the model with additional adult CXR data, and (3) further fine-tuning with pediatric CXR data to enhance model accuracy and effectiveness in this specialized domain. We investigate the impact of pre-training the model on natural images or adult CXRs and the effects of fine-tuning it on various CXR datasets. This work is of high importance because there is little work focusing on pediatric CXRs due to the data availability and heterogeneity between adults’ and children’s CXR. In summary, the main contributions of our work include the following:

We developed and evaluated a three-stage training framework specifically for the pediatric chest X-ray dataset.
Our study compares this three-stage training framework to other approaches, demonstrating that the ViT-Base/16 model, pre-trained on CXR, fine-tuned on CheXpert, and further fine-tuned on PediCXR, outperforms the otheres.
We examined the top-performing model’s ability to detect common diseases accurately.

2. Materials and Methods

2.1. Dataset

2.1.1. Adult CXR Dataset

Four public adult CXR datasets were utilized in the three-stage approach for (a) pre-training and (b) fine-tuning. The MIMIC-CXR dataset provides 243,334 image–text paired chest X-rays across fourteen classes [21]. The CheXpert dataset is a large collection of 191,028 frontal-view X-rays, focusing on five common lung diseases [22]. The COVIDx dataset includes over 300,000 images, categorized into four disease classes [26]. Lastly, the ChestX-ray14 dataset comprises 112,120 frontal view X-rays, spanning fourteen disease classes [27].

2.1.2. Pedatric CXR Dataset

In this study, we utilized the recently released and currently largest pediatric CXR dataset, PediCXR [25,28] for pediatric chest thorax disease diagnosis. This dataset comprises a training set of 7728 images and a test set of 1397. Each X-ray image is manually annotated by experienced and board-certified radiologists. The entire PediCXR dataset encompasses 15 diagnoses, but the test set includes only 11. Following the strategy of [29], we grouped less common diagnoses into a category labeled ‘other disease’, resulting in six disease categories: broncho-pneumonia, bronchiolitis, bronchitis, pneumonia, no findings, and other diseases. Detailed information about the dataset is shown in Table 2. The images, initially in DICOM format, were converted into JPG format with the Python package Pydicom version 2.4 for further processing. Each original image, with dimensions of 1692 by 1255 pixels, was resized to 224 by 224 pixels. Following resizing, standardization of the images was implemented using mean values of (0.5056, 0.5056, 0.5056) and standard deviations of (0.252, 0.252, 0.252) across the three channels [30].

Figure 1. The overall workflow of PediCXR classification task. It consists of three stages. (a) Pre-training stage: self-supervised learning is performed using MAE on natural images or adult CXRs. (b) Adult CXR fine-tuning stage: the trained encoder undergoes supervised learning with the adult CXR dataset. (c) Knowledge-transferring stage: the trained encoder is further linear-probed/fine-tuned on the PediCXR dataset for specific knowledge acquisition.

2.2. Vision Transformer, Masked Autoencoder and Transfer Learning

The transformer, based on the multi-headed self-attention mechanism and positional encoding, was first introduced in [31] for natural language processing and has shown its great power in multiple tasks. ViT [32] extends its application to the computer vision area. The images are split into patches and then fed into the transformer encoder to produce a representation for downstream classification. ViT outperformed ResNets by a significant margin on some tasks [33,34], showing its potential in CXR-related tasks.

The Masked Autoencoder (MAE) was first introduced in [12]. This approach incorporates both an encoder and a decoder. Random patches are masked based on a uniform distribution, and the remaining unmasked patches are processed and embedded using a standard ViT. Subsequently, the encoded patches and masked tokens are input into the decoder, which is tasked with predicting the masked information. Notably, the decoder’s role is confined to pre-training, enabling the encoder to learn image representations. The primary goal of MAE is to reconstruct the original image from the visible portions by predicting pixel values for the masked ones with a perceptual loss, similar to BERT, to learn the image’s representation. The effectiveness of MAE has been showcased across various image datasets [12,35] as well as in medical imaging datasets [30,36,37].

The encoder, pre-trained with MAE, can serve as a starting point for a related task via transfer learning. Transfer learning is a technique that transfers knowledge across different but related tasks, improving model performance and potentially reducing training time or training dataset size. In medical research, where data can be scarce, models are frequently pre-trained on one task and then fine-tuned for another related task. Using self-supervised pre-trained models as base encoders, transferring the learned weights from these self-supervised models to enhance performance is now popular and has been proven to be effective [38,39,40].

2.3. Training Strategy and Details

2.3.1. Pre-Training Stage and Fine-Tuning Stage on Adult CXR

The MAE framework comprises an encoder and a simpler decoder. Two variants of the Vision Transformer (ViT) serve as encoders: ViT-Small/16 and ViT-Base/16. These encoders break down images into patches of non-overlapping 16 × 16 pixels and then embed them into a lower-dimensional space. The ViT-Small/16 has a smaller MLP size (

d i m

= 384) and fewer heads in its attention layers (n = 6) in contrast to the ViT-Base/16 (

d i m

= 768, n = 12). Both MAE variants utilize the ImageNet and a combined CXR dataset for pre-training, respectively. The combined CXR dataset includes CheXpert (191,028 images), MIMIC-CXR (243,334 images), and NIH ChestX-ray14 (75,312 images). Pre-training aims to learn the general image/CXR patterns. To further improve the model performance on the downstream tasks and learn the more detailed distribution of diseases with label guidance, each CXR pre-trained model was specifically fine-tuned on these three datasets. This approach follows the methodology detailed in [30], using their reported pre-trained and fine-tuned model weights. These two foundational training stages equip the encoder with skills for interpreting X-ray data and the ability to discern more specific patterns.

2.3.2. Knowledge Transfer from Adult to Pediatric CXR

To better learn the domain knowledge and adapt the previously learned adult CXR distribution to pediatric CXRs, a further transfer-learning phase specifically for pediatric CXR images was performed. This final stage involves training the already pre-trained and fine-tuned ViT encoder using the PediCXR dataset’s training set, followed by its evaluation using the test set. We evaluated the proposed approach under two scenarios: (a) a linear classification setting where the pre-trained encoder weights were frozen but the linear classification head on top was trainable for the task and (b) a fine-tuning setting where both the encoder and the linear head were fine-tuned. For the (a) setting, following and adapted from the protocol in [12], a LARS optimizer was applied with

m o m e n t u m

= 0.9 and

W e i g h t D e c a y

= 0.05. The encoder was trained with 100 epochs at

L e a r n i n g R a t e

= 0.1 and

B a t c h S i z e

= 128. For the (b) setting, we utilized the AdamW optimizer (

β_{1}

= 0.9,

β_{2}

= 0.95,

W e i g h t D e c a y

= 0.05). The

L e a r n i n g R a t e

was established at 2.5e-4 and a

B a t c h S i z e

of 128. This process extended across a total of 75 epochs. All experiments were run three times with different random seeds.

2.3.3. Model Evaluation

For model comparison, our primary focus was on the mean area under the receiver operating characteristic curve (AUC) across all classes (mAUC), as well as the AUC for each of the six individual classes. In Section 3.6, we utilized specificity, sensitivity, and F1 score to evaluate the models’ discriminative ability for positive and negative classes of each disease label. The metrics are defined as follows:

\begin{matrix} Precision & = \frac{T P}{(T P + F P)} \\ sensitivity & = \frac{T P}{T P + F N} \\ Specificity & = \frac{T N}{T N + F P} \\ F 1 Score & = \frac{2 \times sensitivity \times Precision}{sensitivity + Precision} \end{matrix}

The effectiveness of clustering image embeddings was evaluated using the Davies–Bouldin Index (DBI). Lower DBI values suggest that similar images are more effectively clustered. The formula is:

D B I = \frac{1}{k} \sum_{i = 1}^{k} max_{j \neq i} (\frac{s_{i} + s_{j}}{d_{i j}})

where k is the number of clusters,

s_{i}

is the average distance of all points in cluster i to the centroid of cluster i (intra-cluster distance), and

d_{i j}

is the distance between the centroids of clusters i and j (inter-cluster distance).

3. Results

3.1. Model Performance with Supervised Learning

We first started with the baseline models and directly trained the pediatric CXR using supervised learning with random initialization. The results are summarised in Table 3. The table shows that DenseNet121 performed the best, outperforming ResNet 50, ViT-Small/16, and ViT-Base/16. DenseNet121 achieved an average mAUC of 0.714 (95% CI: 0.709–0.719), and the AUC for broncho-pneumonia, bronchiolitis, bronchitis, no finding, pneumonia, and other diseases are 0.781 (95% CI: 0.780–0.782),0.710 (95% CI: 0.703–0.717), 0.698 (95% CI: 0.696–0.700), 0.726 (95% CI: 0.721–0.731), 0.744 (95% CI: 0.732–0.756), and 0.625 (95% CI:0.613–0.637), respectively. ViTs do not perform as well as CNNs. This is possibly because ViTs generally require much larger datasets to achieve optimal performance compared to CNNs. CNNs, like DenseNet, have inherent inductive biases such as translation invariance and local connectivity, allowing for better generalizability on small datasets like PediCXR. Additionally, we also report the previously published results using the same dataset [41] for reference.

3.2. Transfer Learning via Linear Evaluation

The results of the linear evaluation for both ViT-Small/16 and ViT-Base/16 are summarized in Table 4. The three-stage approach increased the model’s performance remarkably compared to its supervised training counterpart. In the case of ViT-Small/16, the most effective model was the one fine-tuned using the CheXpert dataset, which resulted in a 2.5% improvement in mAUC. Similarly, for ViT-Base/16, fine-tuning with the CheXpert dataset yielded the best model, achieving a 5.5% increase in the mAUC. Another observation is that the larger models (ViT-Base/16) always perform better than the smaller ones(ViT-Small/16).

3.3. Transfer Learning via Fine-Tuning

In contrast with linear evaluation, the fine-tuning setting more faithfully mirrors the real-world application of pre-trained encoders. In this context, we provide a more comprehensive analysis of fine-tuning evaluation results, which are detailed in Table 5.

Compared to the results listed in the last two sections, the major finding about ViT models is that fine-tuned models perform better than the corresponding linearly evaluated and supervised trained ones. Investigating the data in the table alone, it is evident that the ViT-Base/16 encoders exhibit superior performance in comparison to the ViT-Small/16 encoders, though the smaller ones also deliver acceptable results. The ViT-Base/16 fine-tuned with CheXpert shows the best model performance on the PediCXR dataset with an mAUC of 0.761. This performance surpasses models that were only pre-trained with X-ray data or fine-tuned with COVIDx and ChestX-ray14 data by 2.4%, 1.7% and 0.4%, respectively. Additionally, this particular model yields the highest performance in diagnosing bronchitis (0.7413), pneumonia (0.8350), and other rare diseases (0.6833). The ViT-Small/16 encoder pre-trained on the ImageNet dataset performed the worst with an mAUC of only 0.719 and that of the encoder pre-trained on the CXR data improved by a small margin. The performance of the ViT-Small/16 encoder improved significantly when it underwent a two-stage training and additional fine-tuning on the pediatric CXR dataset (2.6%, 2.3%, and 2.6% when fine-tuned with CheXpert, COVIDx, and ChestX-ray14, respectively). Interestingly, fine-tuning with the COVIDx dataset yielded minimal improvements in model performance. This may be attributed to the distinct disease distribution in COVIDx compared to that in PediCXR, which constrains the model’s capacity to generalize across different datasets effectively.

One key insight about ViT models emerges from this study. ViT models pre-trained on CXRs, then fine-tuned on adult CXR datasets, and further fine-tuned on pediatric CXR data outperform those only pre-trained on general X-rays and directly fine-tuned on pediatric CXR data. This enhanced performance can be attributed to the broad knowledge acquired from supervised learning using adult CXR, which is beneficial in adapting to the nuances of pediatric CXR data, especially given the limited size of the PediCXR dataset.

DenseNet121’s performance on PediCXR holds a different story. The model achieves a higher mAUC of 0.749 without fine-tuning on adult CXR data, compared to a lower mAUC of 0.713 when fine-tuning is applied. This indicates that fine-tuning DenseNet121 on adult CXR might lead to overfitting to adult-specific features, consequently diminishing its performance on pediatric CXR images.

3.4. Model Interpretation

Grad-CAM was utilized to visualize the CXRs, specifically to highlight the areas indicative of disease through bright colors. Figure 2 shows saliency maps of the sampled images from the PediCXR test set for all models presented in Section 3.3. Observations from Figure 2 reveal that models pre-trained on the ImageNet dataset fail to localize the pertinent regions accurately, whereas those pre-trained on CXR datasets achieve slightly better localization. In comparison, models pre-trained on CXRs, with an infusion of domain-specific knowledge, can pinpoint potential areas of interest. Nonetheless, models based on the Vit-Small/16 architecture fail to delineate the affected regions precisely. For instance, as depicted in Figure 2c, the model fine-tuned with CheXpert data erroneously highlights the spinal area, whereas the one fine-tuned with COVIDx data emphasizes both the left and right lung areas, deviating from the actual ground truth. Conversely, models employing the Vit-Base/16 architecture show enhanced performance, with the region identified by the model fine-tuned using CheXpert data aligning closely with the ground truth.

3.5. Embedding Visualization

To better understand how various initialization methods differentiate between disease categories in the embedding space, we present t-distributed stochastic neighbor (t-SNE) visualizations of PediCX training samples for four representative ViT encoders in Figure 3 along with the Davies–Bouldin index noted in the subplot title. Notably, compared to natural image classification, this task is more complex due to the high inter-class similarity in medical imaging data and the inherently noisy nature of the labels. As the figure suggests, the encoder fine-tuned with CheXpert data demonstrates marginally superior performance to the others, as evidenced by the more cohesive clustering of data from the same class and the smallest Davies–Bouldin index.

3.6. Error Analysis with the Best-Performance Model

Table 6 presents the comprehensive evaluation metrics for each disease category using the top-performing ViT model. With an optimized threshold setting, the model attains an average sensitivity of 0.639, a specificity of 0.683, and an F1 score of 0.376 in the test dataset. In the case of ’Bronchitis,’ the model successfully identifies most young patients affected (72.4%), showing similar sensitivity for broncho-pneumonia and pneumonia. However, its performance is worse for bronchiolitis and other diseases. This could be attributed to the diagnostic challenges associated with bronchiolitis, as discussed in [42], and the significant variability within the ’Other disease’ category, which complicates the model’s ability to discern specific patterns necessary for accurate decision-making. Additionally, our model shows quite high specificity scores across all disease categories, suggesting that the model can correctly recognize disease-free patients.

4. Discussion

Here, we present a three-stage transfer learning system designed to accurately diagnose pediatric chest X-rays (CXRs) and classify them into subtypes. Our approach involves leveraging models pre-trained and fine-tuned on adult CXRs, followed by further fine-tuning on pediatric CXRs to enhance overall performance. The model’s performance on the testing dataset demonstrates several key points:

Introducing prior knowledge through masked autoencoder (MAE) pre-training on adult CXRs significantly boosts model performance on pediatric CXRs.
Fine-tuning on adult CXRs further enhances the model’s ability to learn the intricate characteristics of thoracic disease distributions.
Larger vision transformer models fine-tuned on the CheXpert dataset exhibit the best performance, both quantitatively and qualitatively.

The proposed three-stage system—consisting of masked autoencoder pre-training on adult X-ray images, fine-tuning on the CheXpert dataset, and final fine-tuning on a pediatric dataset—demonstrates a superior performance by progressively refining the model’s feature representations. This approach outperforms directly supervised learning and a two-stage fine-tuning system by leveraging robust feature learning, effective domain adaptation, and targeted specialization. The initial pre-training stage establishes a strong foundation, the CheXpert fine-tuning enhances the model’s understanding of thoracic diseases, and the final stage ensures the model is finely tuned to the unique characteristics of pediatric X-rays, resulting in a more accurate and generalizable model. Pediatric chest radiographs capture the same body segment with the same anatomical structures as adult chest radiographs, so pre-training with larger adult datasets is a promising approach when developing models with limited pediatric training data [43,44]. However, the differing body size, the relative size of the structures (the cardiac silhouette is proportionally larger relative to the thorax in children for example), and the differing disease processes that affect children and adults mean that training and validation on pediatric images is of the same critical.

A three-stage system using ViT-Base/16 as the backbone demonstrates the best performance on the PediCXR dataset. ViT-Base generally outperforms ViT-Small, likely due to its larger number of parameters, which enable it to capture more intricate patterns and nuances in the data. The fine-tuned CheXpert dataset performs better than other datasets, probably because it encompasses a wider variety of thoracic disease conditions. This variety allows models to learn more complex features and patterns compared to the more homogeneous COVIDx dataset.

Additionally, our error analysis identifies specific diseases where the system excels, achieving high sensitivity and F1 scores. Notably, the model shows a strong performance in identifying pneumonia, one of the most clinically relevant respiratory diseases in children [45]. Consequently, our model demonstrates a robust overall performance, particularly in diagnosing pneumonia, and holds potential for further improvement in less severe diseases.

This work has several limitations. The current diagnosis may not fully reflect its clinical significance. The current dataset was collected from two major hospitals in Vietnam and annotated by their seventeen radiologists [25]. Therefore, the clinical definition might vary slightly among different annotators and may also differ from the data on which it could be deployed. Bronchiolitis is a clinical diagnosis and cannot be determined solely based on a CXR [46]. Furthermore, bronchitis is generally not a diagnosis in children [47]. Therefore, the clinical relevance and actionability of these labels are likely limited. Additionally, the annotations in the PediCXR dataset are subjective, with discrepancies among different annotators. Increasing the number of annotators could enhance the quality of CXR labeling and validate model performance. To improve the model’s generalizability and practicality, efforts include (1) establishing clinical labels with high clinical relevance, (2) collecting more annotated data, (3) utilizing multiple annotators and incorporating label uncertainty into the network to reduce labeling bias and improve model robustness, and (4) validating model performance on external datasets can be made.

In summary, the main contributions of our work include the following:

We developed a three-stage training system and assessed its effectiveness in classifying thoracic diseases using the most recent and extensive publicly accessible pediatric CXR dataset, PediCXR.
Our study involved extensive quantitative experiments including (a) comparisons with direct training with a supervised learning strategy; (b) the use of CXR or non-medical images for pre-training; (c) the utilization of various adult CXR datasets for fine-tuning; and (d) an evaluation with linear and fine-tuning settings. Our findings demonstrate quantitatively and qualitatively that the top-performing model is the ViT-Base/16. This model was pre-trained on CXR, fine-tuned on CheXpert, and then further fully fine-tuned on PediCXR.
We performed a detailed error analysis on PediCXR using the best-performing model, thoroughly examining its performance in identifying common diseases.

Although the proposed methods are preliminary and not ready for immediate deployment, these early results support the potential for using these three-stage decision support systems for pediatric thoracic disease diagnosis. Compared to manually labeling the CXRs and gathering a board of certified radiologists, this system can provide a more cost-effective method for use in therapeutic trials, research, and clinical practice. The key to this method lies in (1) using millions of public adult CXRs for pre-training to improve the model’s capability of learning the CXRs and their relevant disease distribution and (2) shifting the distribution learned on adults toward children. In addition, other clinical features, including the electronic health records data that are routinely collected, should also be considered to gain a deeperr understanding of the disease’s progression and mechanism, therefore allowing for better diagnosis and treatment [48]. There are works utilizing multi-modal learning for adult thoracic disease early detection or diagnosis [49,50], and these works can also be extended to children.

5. Conclusions

This study introduces a three-stage automated clinical support system for precise diagnostics in pediatric CXRs. It employs a masked autoencoder pre-trained on the large-scale adult CXR datasets, followed by knowledge transfer to adapt to pediatric cases. Based on the various experiments, it can be seen that models pre-trained on CXR and subsequently fine-tuned significantly outperform other models in overall performance and diseased region localization. Notably, models pre-trained on CXRs and fine-tuned using the CheXpert dataset demonstrated the best performance, with disease localization closely aligning with the ground truth. In the future, we will further explore other pre-training schemes, including BYOL [10], DINO [11], and use bounding box coordinates to improve the model performance and enhance disease region localization on external datasets.

Author Contributions

Conceptualization, Y.Z.; methodology, Y.Z.; validation, Y.Z. and K.N.; formal analysis, Y.Z.; investigation, Y.Z. and K.N.; resources, K.N.; data curation, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, Y.Z., J.K., E.W. and K.N.; visualization, Y.Z.; supervision, K.N.; project administration, E.W.; funding acquisition, K.N. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is based upon work supported by the National Science Foundation under grant no. 2014003 and National Institutes of Health funding K12TR004374.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code is available at https://github.com/kayvanlabs/PediCXR-MAE (accessed on 21 June 2024) and the data can be obtained from https://physionet.org/content/vindr-cxr/1.0.0/ (accessed on 21 June 2024).

Conflicts of Interest

The authors declare no conflicts of interests.

References

Reyes, M.A.; Etinger, V.; Hronek, C.; Hall, M.; Davidson, A.; Mangione-Smith, R.; Kaiser, S.V.; Parikh, K. Pediatric respiratory illnesses: An update on achievable benchmarks of care. Pediatrics 2023, 152, e2022058389. [Google Scholar] [CrossRef] [PubMed]
World Health Organization. Stakeholder Consultative Meeting on Prevention and Management of Childhood Pneumonia and Diarrhoea: Report, 12–14 October 2021; World Health Organization: Geneva, Switzerland, 2022. [Google Scholar]
Rahman, T.; Chowdhury, M.E.; Khandakar, A.; Islam, K.R.; Islam, K.F.; Mahbub, Z.B.; Kadir, M.A.; Kashem, S. Transfer learning with deep convolutional neural network (CNN) for pneumonia detection using chest X-ray. Appl. Sci. 2020, 10, 3233. [Google Scholar] [CrossRef]
Banerjee, A.; Sarkar, A.; Roy, S.; Singh, P.K.; Sarkar, R. COVID-19 chest X-ray detection through blending ensemble of CNN snapshots. Biomed. Signal Process. Control 2022, 78, 104000. [Google Scholar] [CrossRef]
Chen, S.; Ren, S.; Wang, G.; Huang, M.; Xue, C. Interpretable cnn-multilevel attention transformer for rapid recognition of pneumonia from chest X-ray images. IEEE J. Biomed. Health Inform. 2023, 28, 753–764. [Google Scholar] [CrossRef]
Wollek, A.; Graf, R.; Čečatka, S.; Fink, N.; Willem, T.; Sabel, B.O.; Lasser, T. Attention-based saliency maps improve interpretability of pneumothorax classification. Radiol. Artif. Intell. 2022, 5, e220187. [Google Scholar] [CrossRef] [PubMed]
Chetoui, M.; Akhloufi, M.A. Explainable vision transformers and radiomics for covid-19 detection in chest X-rays. J. Clin. Med. 2022, 11, 3013. [Google Scholar] [CrossRef] [PubMed]
Wang, B.; Pan, H.; Aboah, A.; Zhang, Z.; Keles, E.; Torigian, D.; Turkbey, B.; Krupinski, E.; Udupa, J.; Bagci, U. GazeGNN: A Gaze-Guided Graph Neural Network for Chest X-Ray Classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–6 January 2024; pp. 2194–2203. [Google Scholar]
Mahapatra, D.; Bozorgtabar, B.; Ge, Z.; Reyes, M. GANDALF: Graph-based transformer and Data Augmentation Active Learning Framework with interpretable features for multi-label chest Xray classification. Med. Image Anal. 2024, 93, 103075. [Google Scholar] [CrossRef]
Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9650–9660. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
Gazda, M.; Plavka, J.; Gazda, J.; Drotar, P. Self-supervised deep convolutional neural network for chest X-ray classification. IEEE Access 2021, 9, 151972–151982. [Google Scholar] [CrossRef]
Tiu, E.; Talius, E.; Patel, P.; Langlotz, C.P.; Ng, A.Y.; Rajpurkar, P. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng. 2022, 6, 1399–1406. [Google Scholar] [CrossRef]
VanBerlo, B.; Hoey, J.; Wong, A. A survey of the impact of self-supervised pretraining for diagnostic tasks in medical X-ray, CT, MRI, and ultrasound. BMC Med. Imaging 2024, 24, 79. [Google Scholar] [CrossRef]
Gidaris, S.; Singh, P.; Komodakis, N. Unsupervised representation learning by predicting image rotations. arXiv 2018, arXiv:1803.07728. [Google Scholar]
Doersch, C.; Gupta, A.; Efros, A.A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1422–1430. [Google Scholar]
Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 139–144. [Google Scholar] [CrossRef]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual Event, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Johnson, A.E.; Pollard, T.J.; Berkowitz, S.J.; Greenbaum, N.R.; Lungren, M.P.; Deng, C.y.; Mark, R.G.; Horng, S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 2019, 6, 317. [Google Scholar] [CrossRef] [PubMed]
Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R.; Shpanskaya, K.; et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 590–597. [Google Scholar]
Padash, S.; Mohebbian, M.R.; Adams, S.J.; Henderson, R.D.; Babyn, P. Pediatric chest radiograph interpretation: How far has artificial intelligence come? A systematic literature review. Pediatr. Radiol. 2022, 52, 1568–1580. [Google Scholar] [CrossRef] [PubMed]
Ravi, V.; Narasimhan, H.; Pham, T.D. A cost-sensitive deep learning-based meta-classifier for pediatric pneumonia classification using chest X-rays. Expert Syst. 2022, 39, e12966. [Google Scholar] [CrossRef]
Pham, H.H.; Tran, T.T.; Nguyen, H.Q. VinDr-PCXR: An open, large-scale pediatric chest X-ray dataset for interpretation of common thoracic diseases. PhysioNet 2022. [Google Scholar] [CrossRef]
Wang, L.; Lin, Z.Q.; Wong, A. Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images. Sci. Rep. 2020, 10, 19549. [Google Scholar] [CrossRef]
Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2097–2106. [Google Scholar]
Goldberger, A.L.; Amaral, L.A.; Glass, L.; Hausdorff, J.M.; Ivanov, P.C.; Mark, R.G.; Mietus, J.E.; Moody, G.B.; Peng, C.K.; Stanley, H.E. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 2000, 101, e215–e220. [Google Scholar] [CrossRef]
Wu, C.; Zhang, X.; Wang, Y.; Zhang, Y.; Xie, W. K-Diag: Knowledge-enhanced Disease Diagnosis in Radiographic Imaging. arXiv 2023, arXiv:2302.11557. [Google Scholar]
Xiao, J.; Bai, Y.; Yuille, A.; Zhou, Z. Delving into masked autoencoders for multi-label thorax disease classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 3588–3600. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do vision transformers see like convolutional neural networks? Adv. Neural Inf. Process. Syst. 2021, 34, 12116–12128. [Google Scholar]
Reed, C.J.; Gupta, R.; Li, S.; Brockman, S.; Funk, C.; Clipp, B.; Keutzer, K.; Candido, S.; Uyttendaele, M.; Darrell, T. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4088–4099. [Google Scholar]
Zhou, L.; Liu, H.; Bae, J.; He, J.; Samaras, D.; Prasanna, P. Self Pre-training with Masked Autoencoders for Medical Image Classification and Segmentation. arXiv 2022, arXiv:2203.05573. [Google Scholar]
Almalki, A.; Latecki, L.J. Self-Supervised Learning With Masked Autoencoders for Teeth Segmentation From Intra-Oral 3D Scans. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 7820–7830. [Google Scholar]
Ericsson, L.; Gouk, H.; Hospedales, T.M. How well do self-supervised models transfer? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5414–5423. [Google Scholar]
Chen, H.; Lundberg, S.M.; Erion, G.; Kim, J.H.; Lee, S.I. Forecasting adverse surgical events using self-supervised transfer learning for physiological signals. NPJ Digit. Med. 2021, 4, 167. [Google Scholar] [CrossRef] [PubMed]
Truong, T.; Mohammadi, S.; Lenga, M. How transferable are self-supervised features in medical image classification tasks? Proc. Mach. Learn. Health PMLR 2021, 158, 54–74. [Google Scholar]
Tran, T.T.; Pham, H.H.; Nguyen, T.V.; Le, T.T.; Nguyen, H.T.; Nguyen, H.Q. Learning to automatically diagnose multiple diseases in pediatric chest radiographs using deep convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3314–3323. [Google Scholar]
Kirolos, A.; Manti, S.; Blacow, R.; Tse, G.; Wilson, T.; Lister, M.; Cunningham, S.; Campbell, A.; Nair, H.; Reeves, R.M.; et al. A systematic review of clinical practice guidelines for the diagnosis and management of bronchiolitis. J. Infect. Dis. 2020, 222, S672–S679. [Google Scholar] [CrossRef] [PubMed]
Shin, H.J.; Son, N.H.; Kim, M.J.; Kim, E.K. Diagnostic performance of artificial intelligence approved for adults for the interpretation of pediatric chest radiographs. Sci. Rep. 2022, 12, 10215. [Google Scholar] [CrossRef]
Kohne, J.G.; Farzaneh, N.; Barbaro, R.P.; Mahani, M.G.; Ansari, S.; Sjoding, M.W. Deep learning model performance for identifying pediatric acute respiratory distress syndrome on chest radiographs. Intensive Care-Med. Paediatr. Neonatal 2024, 2, 5. [Google Scholar] [CrossRef]
Kjærgaard, J.; Anastasaki, M.; Stubbe Østergaard, M.; Isaeva, E.; Akylbekov, A.; Nguyen, N.Q.; Reventlow, S.; Lionis, C.; Sooronbaev, T.; Pham, L.A.; et al. Diagnosis and treatment of acute respiratory illness in children under five in primary care in low-, middle-, and high-income countries: A descriptive FRESH AIR study. PLoS ONE 2019, 14, e0221389. [Google Scholar] [CrossRef] [PubMed]
Friedman, J.N.; Rieder, M.J.; Walton, J.M.; Society, C.P.; Committee, A.C.; Therapy, D.; Committee, H.S. Bronchiolitis: Recommendations for diagnosis, monitoring and management of children one to 24 months of age. Paediatr. Child Health 2014, 19, 485–491. [Google Scholar] [CrossRef] [PubMed]
Taussig, L.M.; Smith, S.M.; Blumenfeld, R. Chronic bronchitis in childhood: What is it? Pediatrics 1981, 67, 1–5. [Google Scholar] [CrossRef] [PubMed]
Cui, C.; Yang, H.; Wang, Y.; Zhao, S.; Asad, Z.; Coburn, L.A.; Wilson, K.T.; Landman, B.A.; Huo, Y. Deep multimodal fusion of image and non-image data in disease diagnosis and prognosis: A review. Prog. Biomed. Eng. 2023, 5, 022001. [Google Scholar] [CrossRef] [PubMed]
Jabbour, S.; Fouhey, D.; Kazerooni, E.; Wiens, J.; Sjoding, M.W. Combining chest X-rays and electronic health record (EHR) data using machine learning to diagnose acute respiratory failure. J. Am. Med. Inform. Assoc. 2022, 29, 1060–1068. [Google Scholar] [CrossRef]
Wang, Y.; Yin, C.; Zhang, P. Multimodal risk prediction with physiological signals, medical images and clinical notes. Heliyon 2024, 10, e26772. [Google Scholar] [CrossRef]

Figure 2. Grad-CAM visualizations on four pediatric CXR samples. The first column on the left, featuring (a–d) as four randomly drawn diseased samples, displays the original CXR, recognized as the ground truth, with the diseased areas highlighted in red boxes. The subsequent columns showcase saliency maps created with various initializations overlaying on the original X-ray images. The bright colors signify areas of relevance to the model’s predictions.

Figure 3. t-SNE comparison of image representations from ViT-Base/16 models (DBI is presented along with the title): (a) supervised training with random initialization; (b) pre-trained on ImageNet using MAE; (c) pre-trained on adult CXR using; (d) pre-trained on adult CXR with MAE and subsequently fine-tuned using CheXpert data.

Table 1. Brief summary of different approaches for image pre-training.

Strategies	Methods (Ref.)
Innate relationship	Image rotation prediction [16]; Image context prediction [17]
Generative	Autoencoder [18]; GAN [19]
Contrastive	SimCLR [20]; BYOL [10]; DINO [11]
Self-prediction	MAE [12]

Table 2. Numbers of chest X-rays in the training and test Sets of PediCXR.

Dataset	Samples (n)	Brocho-Pneumonia (n)	Bronchiolitis (n)	Bronchitis (n)	Pneumonia (n)	No Finding (n)	Other Diseases (n)
Training	7728	545	497	842	392	5143	463
Test	1387	84	90	174	89	907	81

Table 3. Supervised training on PediCXR with random initialization: mean (std). The highest mAUC scores are bold. The p-value indicates the statistical significance of the mAUC achieved by direct supervised learning methods compared to the best-performing model using the proposed three-stage system. †: The results with the same dataset are reported in [41].

Encoder	mAUC	AUC for Every Class
Encoder	mAUC	Broncho-Pneumonia	Bronchiolitis	Bronchitis	No Finding	Pneumonia	Others	p-Value
DenseNet121 †	0.709	0.696	0.638	0.691	0.776	0.802	0.703	<0.0001
Densenet121	0.714	0.781	0.710	0.698	0.726	0.744	0.625	<0.0001
Densenet121	[0.709–0.719]	[0.780–0.782]	[0.703–0.717]	[0.696–0.700]	[0.721–0.731]	[0.732–0.756]	[0.613–0.637]	<0.0001
ResNet50	0.685	0.736	0.690	0.686	0.696	0.709	0.592	<0.0001
ResNet50	[0.679–0.691]	[0.727–0.745]	[0.676–0.704]	[0.685–0.687]	[0.691–0.701]	[0.693–0.725]	[0.577–0.607]	<0.0001
ViT–Small/16	0.646	0.698	0.627	0.646	0.655	0.652	0.600	<0.0001
ViT–Small/16	[0.641–0.652]	[0.696–0.699]	[0.621–0.632]	[0.644–0.649]	[0.648–0.662]	[0.643–0.661]	[0.580–0.621]	<0.0001
ViT–Base/16	0.654	0.704	0.649	0.648	0.657	0.655	0.609	<0.0001
ViT–Base/16	[0.648–0.659]	[0.694–0.715]	[0.632–0.665]	[0.643–0.654]	[0.655–0.658]	[0.636–0.674]	[0.607–0.612]	<0.0001

Table 4. Linear evaluation on PediCXR: mean (std). The highest mAUC scores are bold. The p-value indicates the statistical significance of the mAUC achieved by the methods compared to the best-performing model using the proposed three-stage system.

Encoder	Pretrained	Finetuned	mAUC	AUC for Every Class
Encoder	Pretrained	Finetuned	mAUC	Broncho-Pneumonia	Bronchiolitis	Bronchitis	No Finding	Pneumonia	Others	p-Value
ViT-Small/16	X-ray	COVIDx	0.643	0.670	0.611	0.679	0.695	0.667	0.538	<0.0001
		COVIDx	[0.631–0.656]	[0.645–0.695]	[0.557–0.665]	[0.676–0.682]	[0.690–0.701]	[0.605–0.729]	[0.512–0.565]	<0.0001
		X-ray	0.653	0.700	0.637	0.647	0.696	0.695	0.545	<0.0001
		X-ray	([0.648–0.658]	[0.689–0.710]	[0.615–0.659]	[0.641–0.653]	[0.689–0.703]	[0.664–0.727]	[0.518–0.572]	<0.0001
		CheXpert	0.662	0.707	0.632	0.655	0.690	0.734	0.554	<0.0001
		CheXpert	[0.662–0.662]	[0.675–0.740]	[0.625–0.640]	[0.647–0.663]	[0.683–0.697]	[0.723–0.744]	[0.518–0.590]	<0.0001
ViT-Base/16	X-ray	COVIDx	0.675	0.710	0.657	0.674	0.704	0.711	0.597	<0.0001
		COVIDx	[0.662–0.689]	[0.682–0.739]	[0.616–0.698]	[0.649–0.698]	[0.699–0.708]	[0.674–0.747]	[0.551–0.643]	<0.0001
		X-ray	0.678	0.739	0.658	0.646	0.702	0.735	0.586	<0.0001
		X-ray	[0.673–0.682]	[0.723–0.755]	[0.648–0.669]	[0.628–0.664]	[0.698–0.705]	[0.727–0.744]	[0.553–0.620]	<0.0001
		CheXpert	0.690	0.745	0.677	0.670	0.708	0.737	0.604	<0.0001
		CheXpert	[0.682–0.698]	[0.728–0.762]	[0.642–0.712]	[0.654–0.686]	[0.703–0.712]	[0.722–0.752]	[0.578–0.631]	<0.0001

Table 5. Fine-tuning evaluation on PediCXR: mean(std). The highest mAUC scores are bold. The p-value indicates the statistical significance of the mAUC achieved by the best performing model, ViT-Base/16 fine-tuned on CheXpert, in comparison to other methods.

Encoder	Pretrained	Finetuned	mAUC	AUC for Every Class
Encoder	Pretrained	Finetuned	mAUC	Broncho-Pneumonia	Bronchiolitis	Bronchitis	No Finding	Pneumonia	Others	p-Value
DenseNet121	X–ray	–	0.749	0.823	0.732	0.722	0.776	0.827	0.615	<0.0001
		–	[0.747–0.752]	[0.817–0.830]	[0.729–0.734]	[0.720–0.724]	[0.770–0.783]	[0.818–0.836]	[0.610–0.620]	<0.0001
		X–ray	0.713	0.779	0.709	0.700	0.725	0.744	0.621	<0.0001
		X–ray	[0.708–0.717]	[0.772–0.786]	[0.701–0.717]	[0.695–0.704]	[0.724–0.725]	[0.735–0.754]	[0.602–0.639]	<0.0001
ViT–Small/16	Imagenet	–	0.719	0.787	0.709	0.711	0.729	0.761	0.618	<0.0001
	Imagenet	–	[0.716–0.723]	[0.780–0.794]	[0.709–0.710]	[0.707–0.715]	[0.725–0.732]	[0.758–0.765]	[0.613–0.624]	<0.0001
		–	0.729	0.808	0.721	0.708	0.744	0.770	0.626	<0.0001
		–	[0.727–0.732]	[0.806–0.810]	[0.717–0.725]	[0.707–0.709]	[0.740–0.748]	[0.764–0.776]	[0.617–0.636]	<0.0001
	X–ray	COVIDx	0.746	0.825	0.725	0.729	0.760	0.797	0.639	<0.005
		COVIDx	[0.741–0.751]	[0.818–0.833]	[0.716–0.733]	[0.726–0.732]	[0.756–0.764]	[0.779–0.815]	[0.632–0.645]	<0.005
		X–ray	0.748	0.824	0.725	0.737	0.761	0.798	0.642	<0.0001
		X–ray	[0.747–0.748]	[0.820–0.827]	[0.719–0.731]	[0.732–0.741]	[0.760–0.762]	[0.797–0.800]	[0.640–0.643]	<0.0001
		CheXpert	0.748	0.818	0.719	0.740	0.758	0.805	0.650	<0.0001
		CheXpert	[0.748–0.749]	[0.817–0.819]	[0.718–0.720]	[0.737–0.742]	[0.757–0.759]	[0.800–0.810]	[0.645–0.654]	<0.0001
ViT–Base/16	Imagenet	–	0.746	0.824	0.729	0.722	0.759	0.809	0.633	<0.005
	Imagenet	–	[0.745–0.747]	[0.823–0.826]	[0.727–0.732]	[0.720–0.725]	[0.756–0.761]	[0.807–0.810]	[0.631–0.634]	<0.005
	X–ray	–	0.743	0.818	0.728	0.722	0.757	0.800	0.633	0.2
		–	[0.740–0.746]	[0.812–0.824]	[0.723–0.733]	[0.721–0.724]	[0.754–0.759]	[0.792–0.808]	[0.628–0.639]	0.2
		COVIDx	0.750	0.833	0.732	0.725	0.762	0.814	0.634	<0.005
		COVIDx	[0.749–0.751]	[0.830–0.835]	[0.730–0.734]	[0.720–0.730]	[0.760–0.765]	[0.812–0.816]	[0.628–0.641]	<0.005
		X–ray	0.760	0.825	0.733	0.726	0.767	0.831	0.678	<0.0001
		X–ray	[0.758–0.761]	[0.824–0.827]	[0.729–0.736]	[0.721–0.730]	[0.765–0.769]	[0.827–0.835]	[0.673–0.683]	<0.0001
		CheXpert	0.761	0.831	0.711	0.741	0.766	0.835	0.683
		CheXpert	[0.760–0.763]	[0.829–0.833]	[0.709–0.714]	[0.737–0.745]	[0.764–0.767]	[0.833–0.837]	[0.672–0.695]

Table 6. Detailed model performance on PediCXR test dataset with ViT-Base/16 model fine-tuned on CheXpert.

Label	Accuracy	Sensitivity	Precision	Specificity	F1 Score
Brocho-pneumonia	0.801	0.691	0.187	0.801	0.294
Bronchiolitis	0.765	0.500	0.137	0.784	0.215
Bronchitis	0.632	0.724	0.213	0.619	0.329
Pneumonia	0.894	0.618	0.325	0.913	0.426
Other diseases	0.852	0.309	0.142	0.885	0.195
mean	0.770	0.639	0.279	0.683	0.376

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Kohne, J.; Wittrup, E.; Najarian, K. Three-Stage Framework for Accurate Pediatric Chest X-ray Diagnosis Using Self-Supervision and Transfer Learning on Small Datasets. Diagnostics 2024, 14, 1634. https://doi.org/10.3390/diagnostics14151634

AMA Style

Zhang Y, Kohne J, Wittrup E, Najarian K. Three-Stage Framework for Accurate Pediatric Chest X-ray Diagnosis Using Self-Supervision and Transfer Learning on Small Datasets. Diagnostics. 2024; 14(15):1634. https://doi.org/10.3390/diagnostics14151634

Chicago/Turabian Style

Zhang, Yufeng, Joseph Kohne, Emily Wittrup, and Kayvan Najarian. 2024. "Three-Stage Framework for Accurate Pediatric Chest X-ray Diagnosis Using Self-Supervision and Transfer Learning on Small Datasets" Diagnostics 14, no. 15: 1634. https://doi.org/10.3390/diagnostics14151634

APA Style

Zhang, Y., Kohne, J., Wittrup, E., & Najarian, K. (2024). Three-Stage Framework for Accurate Pediatric Chest X-ray Diagnosis Using Self-Supervision and Transfer Learning on Small Datasets. Diagnostics, 14(15), 1634. https://doi.org/10.3390/diagnostics14151634

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Three-Stage Framework for Accurate Pediatric Chest X-ray Diagnosis Using Self-Supervision and Transfer Learning on Small Datasets

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.1.1. Adult CXR Dataset

2.1.2. Pedatric CXR Dataset

2.2. Vision Transformer, Masked Autoencoder and Transfer Learning

2.3. Training Strategy and Details

2.3.1. Pre-Training Stage and Fine-Tuning Stage on Adult CXR

2.3.2. Knowledge Transfer from Adult to Pediatric CXR

2.3.3. Model Evaluation

3. Results

3.1. Model Performance with Supervised Learning

3.2. Transfer Learning via Linear Evaluation

3.3. Transfer Learning via Fine-Tuning

3.4. Model Interpretation

3.5. Embedding Visualization

3.6. Error Analysis with the Best-Performance Model

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI