PreRadE: Pretraining Tasks on Radiology Images and Reports Evaluation Framework

Coleman, Matthew; Dipnall, Joanna F.; Jung, Myong Chol; Du, Lan

doi:10.3390/math10244661

Open AccessArticle

PreRadE: Pretraining Tasks on Radiology Images and Reports Evaluation Framework

by

Matthew Coleman

¹,

Joanna F. Dipnall

^2,3

,

Myong Chol Jung

¹ and

Lan Du

^1,*

¹

Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia

²

School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 38004, Australia

³

Institute for Mental and Physical Health and Clinical Translation, School of Medicine, Deakin University, Geelong, VIC 3220, Australia

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(24), 4661; https://doi.org/10.3390/math10244661

Submission received: 7 October 2022 / Revised: 23 November 2022 / Accepted: 1 December 2022 / Published: 8 December 2022

(This article belongs to the Special Issue Mathematical Modeling Using Deep Learning with Applications in Biology and Medicine)

Download

Browse Figures

Versions Notes

Abstract

:

Recently, self-supervised pretraining of transformers has gained considerable attention in analyzing electronic medical records. However, systematic evaluation of different pretraining tasks in radiology applications using both images and radiology reports is still lacking. We propose PreRadE, a simple proof of concept framework that enables novel evaluation of pretraining tasks in a controlled environment. We investigated three most-commonly used pretraining tasks (MLM—Masked Language Modelling, MFR—Masked Feature Regression, and ITM—Image to Text Matching) and their combinations against downstream radiology classification on MIMIC-CXR, a medical chest X-ray imaging and radiology text report dataset. Our experiments in the multimodal setting show that (1) pretraining with MLM yields the greatest benefit to classification performance, largely due to the task-relevant information learned from the radiology reports. (2) Pretraining with only a single task can introduce variation in classification performance across different fine-tuning episodes, suggesting that composite task objectives incorporating both image and text modalities are better suited to generating reliably performant models.

Keywords:

self-supervised learning; deep learning applications; multimodal; masked language modelling; model evaluation; X-ray analysis; computational radiology; pathoanatomical classification; machine learning

MSC:

68T45; 68T50

1. Introduction

The development of self-supervised pretraining and transformer architectures have produced predictive models that surpass their supervised counterparts on many image and language benchmarks [1,2]. With the availability of large collections of paired radiology X-ray images and text reports, recent attention has been directed towards developing novel implementations of these methods suited to multimodal radiology data [3,4,5]. There has been rapid growth in this field and early results show promise. However, there exists no known domain specific study into pretraining (pretext) tasks in isolation from the model architecture and training protocol, leading to uncertainty about task suitability.

Evaluation of pretraining methods typically involve evaluating the resulting model on downstream tasks and datasets relevant to the domain and/or application [1,6]. Within radiology, common evaluation tasks include multi-label pathoanatomical classification [5], medical visual question answering [4], and medical report generation [3,5]. While domain agnostic evaluation frameworks exist [6,7] and offer guidance on establishing methods to isolate components in a controlled manner, their focus is on the performance across domains at the expense of specificity.

The primary aim of this study is to examine the effect of varying the pretext task used in pretraining on downstream radiology classification tasks. We implemented several common baseline tasks in a controlled evaluation framework using a consistent model architecture and training protocol, and empirically measured their impact on downstream classification performance. Thus, this study:

Introduced an empirical evaluation framework for model-agnostic evaluation of vision and language pretext tasks on downstream classification tasks. (The PyTorch Lightning implementation of our framework, along with data processing and analysis notebooks, is available at https://github.com/mwcoleman/prerade/).
Considered the impacts of pretext tasks particularly with multimodal radiology data on multi-label pathoanatomical classification for the first time.
Performed controlled studies on Masked Language Modelling (MLM), Masked Feature Regression (MFR), Image to Text Matching (ITM), and their combinations.

2. Materials and Methods

This section first provides a brief overview of the evaluation framework, followed by an introductory background to self-supervised pretraining and a description of the three pretraining tasks (MLM, MFR and ITM) that were selected for evaluation in this study.

2.1. Framework Overview

This study evaluated combinations of common pretraining tasks in a controlled setting by using consistent model architecture, datasets and training protocols. Figure 1 outlines the framework and process used for this study. Experimental details covering the data processing steps, model components, training protocols and evaluation protocols are provided in Section 2.5, Section 2.6, Section 2.7 and Section 2.8. Beginning with a single stream type multimodal transformer (Refer Section 2.6), eight instances were pretrained using combinations of three self-supervised learning pretext tasks (Refer Section 2.2) on a large multimodal radiology dataset (Refer to Section 2.5). These instances were fine-tuned on a smaller labelled dataset from the same data source, before being evaluated on a held-out test set (from the same source). The fine-tuned instances were evaluated on a dataset from a different source that was not seen during training. The fine-tuning evaluating process was repeated 20 times varying the random seed to obtain uncertainty estimates of performance.

2.2. Self-Supervised Pretraining

Self-supervised image and text pretraining aims to learn an encoder function f0 that maps paired inputs (w, v), where w is a text sequence w = {wl,..., wm} of length m, and v is the accompanying image features v = {vl,..., vn} of length n, to a representative fixed dimension vector useful for downstream tasks (Refer to Table 1 for notation used).

There exist a variety of pretext tasks that have been developed in multimodal self-supervised pretraining implemented with transformer architecture (commonly referred to as Bidirectional Encoder Representations from Transformers (BERT)). In this study we investigated Masked Language Modeling (MLM), Masked Feature Regression (MFR), and Image to Text Matching (ITM), tasks common to training modern and performant implementations such as UNiversal Image-TExt Representation (UNITER) [8], VisualBERT [9], Learning Cross-Modality Encoder Representations from Transformers (LXMERT) [10], Vision-and-Language BERT (VilBERT) [11] and VL BERT [12]. These state-of-the-art implementations all use individual modality (text/image) embedding layers combined with a transformer (self-attention) encoding base. However, they differ in the architecture of the attention mechanism (Figure 2). Some propose a single stream joint encoder with self-attention [8,9,12], while others introduce an inductive bias towards inter-modal attention by using two streams and cross-modal attention [10,11].

2.2.1. Masked Language Modeling (MLM)

Masked Language Modeling is a unimodal (text) pretraining task [13]. The process randomly mask outs input tokens, replacing them with a generic token called [MASK]. The encoder learns representations through maximizing the likelihood of predicting a masked token(s) true label conditioned on the remaining token embeddings. The supervisory signal (loss) is implemented with cross entropy. In the joint image and text setting the training loss is for one training pair follows as:

L (θ) = - \sum_{\begin{matrix} \bar{w} \in m (w) \end{matrix}} l o g \cdot P_{θ} (\bar{w} | w_{! \hat{w}}, v)

2.2.2. Masked Feature Regression (MFR)

Masked Feature Regression randomly selects one or more image features and masks them by replacement with zero-padded vectors of equivalent length [8]. The encoder learns representations through reconstructing the original feature vectors. The loss for one training pair is implemented as the sum, over masked vector indices, of L2 distances between each of the reconstructed and original vectors:

L (θ) = - \sum_{\bar{v} \in m (v)} | | f_{θ} (\hat{v} | w, v_{! \bar{v}}) - \hat{v} | |_{2}^{2}

2.2.3. Image to Text Matching (ITM)

Image to Text Matching falls within the contrastive learning category of pretext tasks and is a multimodal variant of next sentence prediction [10]. During preprocessing, a selection of samples is randomly chosen and either their image or text input is swapped with the corresponding type from another data point, resulting in the encoder being presented with a misaligned input pair. The encoder’s task is to determine if the two modalities are paired or not. The representations are learned through maximizing (minimizing) the similarity of paired (unpaired) inputs, with the loss for one training pair implemented via binary cross entropy, i.e., (for misaligned images):

\begin{array}{r} L (θ) = - [ & y^{i} l o g (g_{θ} (w, v)) \\ + (1 - y^{i}) l o g (1 - g_{θ} (w, v))] \end{array}

where

y^{i} \in \{0 : unmatched, 1 : matched\}

.

2.3. Pretraining on Multimodal Radiology Data

Following the release of large publicly available multimodal datasets [14] VL BERT architectures and training protocols for radiology tasks have been adopted. Notably there were two studies that incorporated variants of the pretext tasks above to pretrain on radiology images and accompanying reports [4,5]. Each study reported state of the art (SOTA) results in their evaluation in terms of accuracy, BiLingual Evaluation Understudy (BLEU) score, Area under the ROC (Receiver Operating Characteristic) Curve (AUC), and ranked retrieval precision. Yet neither of these studies investigated the effect of varying the pretext task, instead focused on novel problem domains (medical visual question answering) [4] or training data regimes [5]. Furthermore, neither study conducted repeated evaluations to evaluate variance of results.

Within the radiology data domain, the closest work to this current study was a comparative study evaluating four performant pretrained VL BERTs (LXMERT [10], VisualBERT [9], UNITER [8] and PixelBERT [15]) on multimodal and text-only pathoanatomical classification tasks using the MIMIC-CXR and OpenI publicly available chest X-ray datasets as training and evaluation datasets, respectively. These studies found that these models, pretrained on general domain corpora only, transferred well to radiology data and tasks, and reported SOTA performance in terms of per-label and averaged AUC. Our study differed in two key ways from these studies: (1) we focused on a model-agnostic evaluation of the choice of pretext task, conducting in-domain pretraining using the same architecture and training protocol, and (2) we performed controlled fine-tuning experiments to evaluate consistency of results.

2.4. Evaluating Pretraining Performance

The evaluation of pretraining methods typically involve evaluating the resulting model on downstream tasks and datasets relevant to the domain and/or application [1,6]. While domain agnostic evaluation frameworks exist [6,7] and offer guidance on establishing methods to isolate components in a controlled manner, their focus has been on the performance across domains at the expense of specificity.

Within radiology, common evaluation tasks include multilabel pathoanatomical classification, medical visual question answering, and medical report/image generation. Of these, pathoanatomical classification requires the least amount of task specific modelling components, which is relevant to our work concerned with model-agnostic evaluations. Even so, there are limited evaluation studies specific to multimodal radiology data (Refer to Section 2.3), and no known study to evaluate pretext tasks in isolation.

The wider machine learning literature contains established practices for conducting comparative studies in controlled environments. Our philosophy of approach is closely related to a recent study [16] that theoretically and experimentally investigates modelling choices for VL BERTs. While their study includes a large theoretical component (in contrast with ours) and is focused on architecture choices in the general domain, aspects of our experimental approach are influenced by their framework. Controlled experimentation with analysis of uncertainty is a necessary condition to adoption within medical domain and yet is often neglected in the core machine learning literature [17].

2.5. Data

This study used the MIMIC-CXR dataset, which is a large publicly available chest X-ray dataset containing 377,110 radiograph images corresponding to 227,827 imaging studies with free text radiology reports sourced from the Beth Israel Deaconess Medical Center in North America [14]. Labels for 13 pathoanatomical findings are derived by the authors using natural language processing (NLP) methods [18,19], and have a reported recall of 0.91 and F1-score of 0.83. Iterative stratification [20,21] was performed to obtain a pretraining (n = 112,124), fine-tuning (n = 6228) and test (n = 6228) split.

This study also used the OpenI dataset which is a publicly available chest X-ray dataset containing 8121 radiograph images corresponding to 3996 imaging studies with free text radiology reports, sourced from different institutions across North America [22]. While no explicit labels are provided, pathoanatomical findings from 14 categories are provided in the form of radiologist provided Medical Subject Headings (MeSH) terms. Due to the limited amount of data in the OpenI dataset, it is not viable for use in self-supervised pretraining. However, we followed existing work [23] in using the dataset as an external evaluation dataset (i.e., no OpenI data were used for training), deriving binary labels for the seven categories that appeared in the MIMIC-CXR dataset.

Data Preprocessing

The preprocessing of the study dataset was undertaken to split the data into pretraining, fine-tuning, and testing:

The samples were chosen by selecting image studies that contain Antero-Posterior (AP) views (the only view type to be present in all studies) and a ’findings’ section in the radiology text report. A paired sample contained both a text report and an image (i.e., AP view). This resulted in a set of 124,580 paired samples.
The sample labels were obtained from the provided label set [14]. We followed previous works [23] in processing the labels to obtain a binary label per finding category. Missing or uncertain findings were replaced with a negative (i.e., no finding).
The image features were extracted using the pretrained feature extractor described in Section 2.6. For studies containing multiple AP views, the first view was selected for feature extraction. After feature extraction, each image was represented by a set of feature embeddings V_{p ∈} R_{36 × 4}, and position embeddings V _{f ∈} R_{36 × 1024}. Dimensions were chosen for compatibility with the existing pretrained extractor model parameters.
The text component was processed to remove all text prior to the ’findings’ section in the radiology report. The resulting string was truncated to 125 characters (due to computational constraints as discussed in Section 4.4). The ’summary’ section of the text report was moved to the beginning of the text, to ensure the summary text was not truncated.
A stratified split of 90:10 was conducted on the 124,580 paired samples using iterative stratification (equal distribution, order = 4) to maintain multi-label proportionality and obtain dataset splits that are representative of the parent distribution [20,21]. The training set (90%, 112,124 samples) was used for pretraining the model.
The remaining 10% was withheld for evaluation, with a 50:50 stratified split into a fine-tuning and a test set, each containing 6228 samples. The sample sizes were chosen based on the reported availability of fine-tuning data in related radiology use cases [24] and typical test set sizes in published datasets [14,22].

Table 2 shows the prevalence of positive findings for each label in the datasets used in our experiments. The labeled data were processed to create binary labels for each finding category, with missing or uncertain labels modified to ’no finding’ [23]. The details and links to data processing code and notebooks are provided in Appendix A. Across all the datasets, Pleural Effusion had the highest proportion of positive findings in the MIMIC-CXR dataset (Pretrain: 27.36%; Finetune: 29.80%; Test: 30.19%) compared to Cardiomegaly in OpenI Test dataset (8.55%).

2.6. Model Components

The pretraining and evaluation framework in this study was agnostic to model components including the encoder, embedding layers, and projection heads (Figure 3). For our experiments, we modeled the text encoder as BERT [13], the image feature extractor as a Convolutional Neural Network (CNN) [25], and used VisualBERT with pretrained weights [9] for the joint modality encoder. The projection head was task specific and was discarded after pretraining. Following existing works [8,9,10], we performed no further training of the CNN image extractor model and computed the image features offline prior to pretraining. Further implementation details are provided in Table A1.

We followed established pretraining practices in pretraining VQA BERT models in a multi-task setting [8]. For example, the following describes one forward pass of the pretraining process for the ’MLM, MFR, ITM’ task combination:

A single pretext task was sampled uniformly from the pretraining scenario (i.e., either MLM, MFR, or ITM was chosen randomly).
A batch of $N$ inputs was sampled uniformly from the pretraining dataset (Refer to Section 2.5). Each input pair represented a text string in the radiologist report text $w$ as well as the extracted feature $v_{f}$ and position vectors $v_{p}$ corresponding to the paired image.
The input batch was processed according to the pretext task:
-
For the MLM or MFR task the relevant input elements were masked in each sample up to the masking budget and defined as:

$\begin{array}{r} m_{b u d g e t} = m_{r a t e} \times | [\hat{w} / \hat{v}] | \end{array}$

-
Alternatively, for the ITM task, a selection of samples from the batch $\hat{n} \in \hat{N}$ , $v_{f}$ and $v_{p}$ were replaced with another from the batch. Here, $\hat{N}$ was randomly selected from the batch and $|\hat{N}|$ was calculated as:

$|\hat{N}| = ⌊ |N| \times I T M_{r a t e} ⌋$
The input batch was then passed through the embedding layers and projected to fixed dimensional representations:

$\begin{array}{r} \hat{w} & = e_{w} (w) \\ \hat{v} & = \frac{1}{2} (h_{v} (v_{f}) \oplus h_{p} (v_{p})) \end{array}$

This resulted in dimensions of

\hat{w} \in ℝ^{h_{d i m} \times |\hat{w}|}

and

\hat{v} \in ℝ^{h_{d i m} \times |\hat{v}|}

where

h_{d i m}

was the input dimension of the joint encoder.

5.: We obtained a joint representation $u$ with the joint encoder $f_{θ} (\cdot)$ and projection function $g_{θ} (\cdot)$ acting on the concatenated embeddings:

$u = g_{θ} (f_{θ} (\hat{w} | \hat{v}))$
6.: Losses were calculated based on the specific pretext objective (Refer to Section 2.2).

2.7. Pretraining Scenarios

We considered all non-empty subsets of {MLM,MFR,ITM} for comparison, yielding seven pretraining scenarios, and all were pretrained using both medical radiology text and X-ray image data. We further considered a model pretrained with MLM on the radiology text-only inputs. We defined a fixed set of hyper-parameters and random seeds across all training scenarios (details in Table A1) to ensure that the only difference was the choice of tasks. We followed established works [8,9,10,13] for task parameter selection, which included a masking rate of 0.15 for both MLM and MFR, and a sampling rate of 0.50 for the ITM task. For each scenario, we trained the model with the respective training objectives for 200,000 steps equating to approximately 22 h of training time per model on a RTX 3090 GPU.

We evaluated two baseline model instances: (1) random weight initialization (denoted ’No Pretrain’), and (2) weights loaded from the public VisualBERT repository. No domain specific pretraining was conducted.

2.8. Evaluation Tasks

The seven pretrained and two baseline models were fine-tuned for six epochs on the MIMIC-CXR fine-tuning data set (Finetune) using multi-label binary cross-entropy (BCE) loss to update network parameters. Models were evaluated against multimodal multi-label Classification on both MIMIC-CXR and OpenI test sets. We assessed each scenario’s fine-tuning sensitivity to random seed initialization by fine-tuning each pretrained model 20 times. The mean per-label AUC, average AUC and sample weighted average AUC (weighted average (wAvg), AUC), along with standard deviation were reported.

3. Results

Table 3 report the mean per-label and average AUC for each scenario on MIMIC-CXR and OpenI test datasets. On the internal evaluation, ’MLM, MFR’ reported the strongest average AUC (0.989) and best or equal best per-label AUC in 8/13 findings categories (Table 3). Other training scenarios that incorporated MLM (’MLM, MFR, ITM’, ’MLM, ITM’, ’MLM (text)’, and ’MLM’) achieved comparable performance, with the average AUC greater than 0.982 and collectively contained the highest AUCs per label category. When pretraining and fine-tuning distributions differed (external evaluation, Table 3), little benefit was observed in terms of mean AUC. All resulting pretrained models obtained high AUC scores, with a minimum average AUC/weighted average AUC of 0.921/0.926 (’MFR’). The ’No Pretrain’ experiment reported the highest weighted AUC and per-label AUC for 3/7 categories. Models lacking MLM pretraining reported increased variability in AUC across label categories, with those categories having a low number of training samples reporting lower AUC than those with large relative numbers. This result was not evident in the OpenI results, where label proportionality was more balanced.

4. Discussion

4.1. Classification Performance

Our results on the MIMIC-CXR test set demonstrate a benefit over both (a) no pretraining (’No Pretrain’), and (b) general domain pretraining (’VBert-COCO’). However, given the strong performance of ’MLM (text)’, there appears to be little additional benefit from incorporating the visual inputs. We suggest the strong text performance is due in part to the demonstrated prior bias of these model types to text, and in part to the task-relevant information summarized by the radiologist text report.

Our results on the OpenI test suggest there may be less benefit provided by pretraining when the test data are from a different source. However, all models (including baseline) reported high performance (Avg. AUC ≥ 0.921), demonstrating a strong multimodal classification capability inherent to the VL BERT architecture. We suggest further investigation is required, using more demanding tasks, in order to assess generalization ability comprehensively.

4.2. Sensitivity to Label Imbalance

Our study found that models lacking MLM pretraining report increased variability in AUC across label categories. BCE loss provides a supervisory signal proportionate to the number of training samples, and our experiments suggest that incorporating suitable pretext tasks (MLM in this situation) is beneficial to training models that are robust to this supervision imbalance, at least when evaluated on multi-label classification problems.

4.3. Sensitivity to Random Initialisation

Our model was pretrained once and fine-tuned for 20 times with different random seeds. A number of pretraining scenarios report a considerable fine-tuning sensitivity to random seed initialization when compared to the baseline models. On MIMIC-CXR dataset all single-task pretraining scenarios trained on multimodal input data (i.e., ’MLM’, ’MFR’, ’ITM’) reported higher variability of average AUC than all composite pretraining scenarios incorporating MLM. This was also the case on the OpenI test set, although additionally ’MLM (text)’ reports large variability (

σ = 0.0235

). These results suggest composite pretraining tasks produce more consistent results, as well as highlighting the benefit of investigating fine-tuning sensitivity as part of the pretext task evaluation process.

4.4. Strengths and Limitations

This study has a number of strengths. Our study focuses on controlled experimentation of the pretext task component, removing variation due to model architecture and training/data protocols. Furthermore, our study includes repeated evaluations to report the variance of performance, which, while common in health research, is less so in the field of machine learning. Another strength of this study is that it focuses on reproducibility of results, publishing all data processing protocols and framework implementation (code). Diverging from the trend of machine learning research towards cluster computing, the implementation was developed to run on a single consumer grade computer (this may appeal to smaller research groups and health researchers equipped with modest computing budgets).

A limitation to this study was that the computational resources placed constraints on our exploration of areas such as pretext task scenarios, hyperparameter tuning, variability analysis, and choice of embedding layers. Only a single model architecture and implementation (VisualBERT joint encoder, Mask-RCNN visual encoder, BERT word embeddings) was tested. However, similar results are expected with other choices.

Even though our results found little evidence to support pretraining when the test distribution differed to the training distribution, drawing a reliable conclusion would require considerably more investigation into distribution shift. This analysis was outside the scope of our study and the topic of future research. The framework of this study provides a simple proof of concept for the analysis of pretext tasks used in pretraining models for pathoanatomical classification with multimodal radiology data. There exists much opportunity for expanding upon this work to gain a deeper insight into the benefits and limitations of self-supervised learning applied within radiology.

5. Conclusions

This study introduced and implemented a model-agnostic framework for evaluating self-supervised pretext tasks with multimodal radiology data on pathoanatomical classification. We conducted controlled studies for a selection of widely used pretext tasks with VL BERT type transformer architecture in order to gain an understanding of their performance and limitations. Pretraining with a composite objective of pretext tasks, was found to improve classification performance, reduce sensitivity to class imbalance in the multi-label setting, and reduce fine-tuning variance, and predominantly when the training and testing data are sourced from the same distribution. However, when the training and testing distributions differed, pretraining provided little benefit to model classification performance. Our results provide important evidence-based determination of the relative performance and reliability of multimodal self-supervised learning pretext tasks used to train pathoanatomical classification models for radiology applications.

Author Contributions

Conceptualization, M.C., J.F.D. and L.D.; writing, M.C.; methodology, M.C.; implementation, M.C.; analysis, M.C.; data curation, M.C.; writing—review and editing, M.C., J.F.D., M.C.J. and L.D.; supervision, J.F.D. and L.D.; project administration, J.F.D. and L.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in the original author works.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Our code repository can be accessed at https://github.com/mwcoleman/prerade/.

This includes the following:

Framework code, implementation instructions, and environment dependency information.
Visual feature extraction code and implementation instructions to extract features for custom datasets.
Data preprocessing Jupyter notebook.
Table and Figure generation Jupyter notebook.
Data splits.

For implementation instructions refer to the readme.md in the main directory.

Table A1. The experimental setup: model components and training protocol.

Hyperparameter	Value	Comment
Training Steps	200,000
Learning Rate	0.00005
Weight Decay	0
Dropout	0.3
LR Scheduler	Yes	Linear
Warmup ratio	0.15
Joint encoder model	VisualBERT	transformers.VisualBertForPreTraining
Max sequence length (text tokens)	125
Sequence length (image tokens)	36
Batch size	64
Number of Attn Heads	12
Number of Attn Layers	12
Hidden size	768
Visual embedding dimension	2048
Training hardware	One RTX 3090	Approx. 22 h: 200,000 steps
Joint Encoder initialization	Pretrained	uclanlp/visualbert-vqa-coco-pre
Tokenizer initialization	Pretrained	bert-base-uncased
Visual encoder	Mask-RCNN	mask_rcnn_R_101_FPN_3x.yaml (Detectron2)

References

Yang, X.; He, X.; Liang, Y.; Yang, Y.; Zhang, S.; Xie, P. Transfer Learning or Self-supervised Learning? A Tale of Two Pretraining Paradigms. ArXiv Prepr. 2020, arXiv:2007.04234. [Google Scholar]
Longlong, J.; Yingli, T. Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4037–4058. [Google Scholar] [CrossRef]
Moon, J.H.; Lee, H.; Shin, W.; Kim, Y.H.; Choi, E. Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training. IEEE J. Biomed. Health Inform. 2022, arXiv:2105.11333. [Google Scholar] [CrossRef] [PubMed]
Khare, Y.; Bagal, V.; Mathew, M.; Devi, A.; Priyakumar, U.D.; Jawahar, C.V. MMBERT: Multimodal BERT Pretraining for Improved Medical VQA. In Proceedings of the IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021; pp. 1033–1036. [Google Scholar] [CrossRef]
Wang, X.; Xu, Z.; Tam, L.; Yang, D.; Xu, D. Self-supervised image-text pre-training with mixed data in chest X-rays. arXiv arXiv:2103.16022, 2021. [CrossRef]
Tamkin, A.; Liu, V.; Lu, R.; Fein, D.; Schultz, C.; Goodman, N. DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning. arXiv 2021, arXiv:2111.12062. [Google Scholar] [CrossRef]
Su, L.; Duan, N.; Cui, E.; Ji, L.; Wu, C.; Luo, H.; Liu, Y.; Zhong, M.; Bharti, T.; Arun, S. GEM: A General Evaluation Benchmark for Multimodal Tasks. arXiv 2021, arXiv:2106.09889. [Google Scholar] [CrossRef]
Chen, Y.C.; Li, L.; Yu, L.; Kholy, A.E.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. UNITER: UNiversal Image-TExt Representation Learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar] [CrossRef]
Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.J.; Chang, K.W. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv 2019, arXiv:1908.03557. [Google Scholar] [CrossRef]
Hao, T.; Mohit, B. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. arXiv 2019, arXiv:1908.07490. [Google Scholar] [CrossRef]
Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai, J. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. arXiv 2019, arXiv:1908.08530. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Johnson, A.E.W.; Pollard, T.J.; Greembaum, N.R.; Lungren, M.P.; Deng, C.; Peng, Y.; Lu, Z.; Mark, R.G.; Berkowitz, S.J.; Horng, S. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv 2019, arXiv:1901.07042. [Google Scholar] [CrossRef]
Huang, Z.; Zeng, Z.; Liu, B.; Fu, D.; Fu, J. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. arXiv 2020, arXiv:2004.00849. [Google Scholar] [CrossRef]
Bugliarello, E.; Cotterell, R.; Okazaki, N.; Elliott, D. Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and- Language BERTs. Trans. Assoc. Comput. Linguist. 2020, 16, 15124. [Google Scholar] [CrossRef]
Aggarwal, R.; Sounderajah, V.; Martin, G.; Ting, D.S.; Karthikesalingam, A.; King, D.; Ashrafian, H.; Darzi, A. Diagnostic accuracy of deep learning in medical imaging: A systematic review and meta-analysis. Npj Digit. Med. 2021, 4, 1–23. [Google Scholar] [CrossRef] [PubMed]
Peng, Y.; Wang, X.; Lu, L.; Bagheri, M.; Summers, R.; Lu, Z. NegBio: A high-performance tool for negation and uncertainty detection in radiology reports. AMIA Summits Transl. Sci. Proc. 2018, 2018, 188–196. [Google Scholar] [CrossRef]
Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R.; Shpanskaya, K.; et al. CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 590–597. [Google Scholar] [CrossRef] [Green Version]
Sechidis, K.; Tsoumakas, G.; Vlahavas, I. On the Stratification of Multi-label Data. In Machine Learning and Knowledge Discovery in Databases; Lecture Notes in Computer Science; Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 145–158. ISBN 978-3-642-23808-6. [Google Scholar] [CrossRef] [Green Version]
Szymański, P.; Kajdanowicz, T. A Network Perspective on Stratification of Multi-Label Data. In Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications, Skopje, Macedonia, 22 September 2017; pp. 22–35. [Google Scholar] [CrossRef]
Demner-Fushman, D.; Kohli, M.D.; Rosenman, M.B.; Shooshan, S.E.; Rodriguez, L.; Antani, S.; Thoma, G.R.; McDonald, C.J. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 2016, 23, 304–310. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Li, Y.; Wang, H.; Luo, Y. A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM), virtual event, 16–19 December 2020; pp. 1999–2004. [Google Scholar] [CrossRef]
Dipnall, J.F.; Page, R.; Du, L.; Costa, M.; Lyons, R.A.; Cameron, P.; de Steiger, R.; Hau, R.; Bucknill, A.; Oppy, A. Predicting fracture outcomes from clinical registry data using artificial intelligence supplemented models for evidence-informed treatment (PRAISE) study protocol. PLoS ONE 2021, 16, e0257361. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The evaluation framework overview. Note: Eight pretraining tasks combinations were used to pretrain eight model instances. These were then fine-tuned, along with two baseline models, and evaluated 20 times on pathoanatomical classification.

Figure 2. VL BERT architectures: single stream with concatenated input embeddings and self-attention (left) vs. dual stream with explicit cross-modal attention (right). Note: ‘K’, ‘Q’, ‘V’ are the attention key, query, and value terms, ‘W’ is a text input and ‘V’ is a visual input.

Figure 3. An overview of the architecture used in the framework. Note: Model components shown within dotted boundary.

e_{v}

was modelled as a fixed parameter image feature extractor,

e_{w}

as a learnable word embedding encoder,

h_{v}

and

h_{p}

as learnable linear transform layers,

f_{θ}

as a learnable joint modality encoder, and

g_{θ}

as a learnable (task specific) projection head. An example MFR pretext process is shown in green to demonstrate interaction between model and pretraining framework.

Figure 3. An overview of the architecture used in the framework. Note: Model components shown within dotted boundary.

e_{v}

was modelled as a fixed parameter image feature extractor,

e_{w}

as a learnable word embedding encoder,

h_{v}

and

h_{p}

as learnable linear transform layers,

f_{θ}

as a learnable joint modality encoder, and

g_{θ}

as a learnable (task specific) projection head. An example MFR pretext process is shown in green to demonstrate interaction between model and pretraining framework.

Table 1. Glossary of notation.

Symbol	Description
$w / v$	a single input (word/image feature)
$\hat{w} / \hat{v}$	a single (word/image) embedding
$\hat{w} / \hat{v}$	a masked (word/image) embedding
$w / v$	$\{w_{1}, w_{2}, . . ., w_{m}\} / \{v_{1}, v_{2}, . . ., v_{n}\}$
$m / n$	The amount of (word/image) features for a single input
$m (\cdot)$	pretext manipulation function (e.g., masking)
$e_{θ} (\cdot)$	embedding encoder
$f_{θ} (\cdot)$	joint modality encoder
$h_{θ} (\cdot)$	intermediate linear transform
$g_{θ} (\cdot)$	projection head
$θ$	collective model and pretext parameters
$D$	a (training/testing) Dataset

Table 2. The proportion of samples having positive findings in each label category for each dataset used in experiments.

Label	MIM Pretrain	MIM Finetune	MIM Test	OpenI Test
Atelectasis	23.29%	26.49%	25.67%	7.95%
Cardiomegaly	23.04%	26.09%	26.08%	8.55%
Consolidation	5.52%	6.63%	6.07%	0.76%
Edema	15.80%	17.37%	17.58%	1.09%
Enlarged CM	3.50%	3.60%	3.64%	0.00%
Fracture	1.68%	2.04%	2.23%	0.00%
Lung Lesion	2.13%	2.50%	2.39%	0.00%
Lung Opacity	24.59%	27.92%	26.04%	0.00%
Pleural Effusion	27.36%	29.80%	30.19%	1.41%
Pleural Other	0.64%	0.87%	0.63%	0.00%
Pneumonia	7.07%	7.68%	7.79%	0.98%
Pneumothorax	5.31%	5.67%	5.67%	0.60%
Support Devices	38.40%	42.63%	42.39%	0.00%
Number of Samples	112,124	6229	6229	3684

Note: The proportions are given as a percentage of the total number of samples in the dataset (bottom row). MIM = MIMIC-CXR dataset.

Table 3. The per-label average and sample weighted average AUC for multimodal classification on MIMIC-CXR and OpenI test sets.

MIMIC-CXR
Label	# Cases	% Prev	MLM, MFR, ITM	MLM, MFR	MLM, ITM	MLM	MFR, ITM	MFR	ITM	No Pretrain	VBert-COCO	MLM (text)
Atelectasis	1585	25.45	0.993 (0.0006)	0.994 (0.0008)	0.991 (0.0011)	0.992 (0.0058)	0.983 (0.0101)	0.968 (0.0421)	0.984 (0.0087)	0.973 (0.0012)	0.989 (0.0016)	0.993 (0.0007)
Cardiomegaly	1621	26.03	0.992 (0.0007)	0.994 (0.0005)	0.991 (0.0018)	0.990 (0.0133)	0.974 (0.0194)	0.940 (0.0696)	0.978 (0.0156)	0.950 (0.0064)	0.989 (0.0019)	0.993 (0.0010)
Consolidation	376	6.04	0.988 (0.0017)	0.986 (0.0022)	0.984 (0.0039)	0.983 (0.0250)	0.972 (0.0349)	0.903 (0.0988)	0.963 (0.0432)	0.942 (0.0372)	0.984 (0.0132)	0.989 (0.0025)
Edema	1076	17.28	0.995 (0.0007)	0.995 (0.0009)	0.993 (0.0012)	0.993 (0.0087)	0.989 (0.0059)	0.965 (0.0533)	0.988 (0.0059)	0.971 (0.0018)	0.992 (0.0013)	0.995 (0.0009)
Enlarged CM	224	3.60	0.953 (0.0193)	0.954 (0.0130)	0.929 (0.0272)	0.950 (0.0602)	0.800 (0.1476)	0.798 (0.1606)	0.794 (0.1433)	0.759 (0.0155)	0.944 (0.0188)	0.964 (0.0107)
Fracture	132	2.12	0.987 (0.0030)	0.983 (0.0046)	0.982 (0.0040)	0.973 (0.0550)	0.867 (0.1165)	0.820 (0.1635)	0.852 (0.1319)	0.777 (0.0182)	0.968 (0.0349)	0.984 (0.0034)
Lung Lesion	163	2.62	0.994 (0.0015)	0.993 (0.0015)	0.988 (0.0064)	0.977 (0.0476)	0.898 (0.0934)	0.876 (0.1254)	0.893 (0.0966)	0.794 (0.0245)	0.951 (0.0352)	0.994 (0.0017)
Lung Opacity	1602	25.72	0.993 (0.0005)	0.994 (0.0008)	0.992 (0.0018)	0.992 (0.0045)	0.982 (0.0103)	0.973 (0.0480)	0.982 (0.0107)	0.973 (0.0012)	0.991 (0.0013)	0.993 (0.0008)
Pleural Effusion	1865	29.95	0.995 (0.0005)	0.995 (0.0006)	0.993 (0.0010)	0.993 (0.0073)	0.987 (0.0078)	0.974 (0.0287)	0.987 (0.0065)	0.968 (0.0009)	0.993 (0.0011)	0.995 (0.0007)
Pleural Other	40	0.64	0.980 (0.0190)	0.993 (0.0035)	0.975 (0.0141)	0.974 (0.0518)	0.899 (0.0845)	0.852 (0.1475)	0.895 (0.0861)	0.708 (0.0398)	0.924 (0.0404)	0.994 (0.0081)
Pneumonia	479	7.69	0.981 (0.0022)	0.982 (0.0030)	0.974 (0.0052)	0.971 (0.0327)	0.909 (0.0704)	0.883 (0.0979)	0.901 (0.0742)	0.880 (0.0201)	0.971 (0.0211)	0.981 (0.0066)
Pneumothorax	345	5.54	0.991 (0.0020)	0.987 (0.0024)	0.987 (0.0025)	0.987 (0.0195)	0.977 (0.0139)	0.922 (0.0856)	0.973 (0.0200)	0.928 (0.0106)	0.988 (0.0034)	0.991 (0.0015)
Support Devices	2648	42.52	0.992 (0.0011)	0.994 (0.0006)	0.989 (0.0017)	0.990 (0.0047)	0.984 (0.0080)	0.976 (0.0378)	0.984 (0.0070)	0.965 (0.0019)	0.990 (0.0014)	0.992 (0.0016)
Avg			0.987 (0.0025)	0.989 (0.0016)	0.982 (0.0044)	0.982 (0.0250)	0.940 (0.0461)	0.912 (0.0821)	0.936 (0.0476)	0.891 (0.0053)	0.975 (0.0082)	0.989 (0.0020)
wAvg			0.992 (0.0007)	0.993 (0.0008)	0.989 (0.0019)	0.989 (0.0109)	0.973 (0.0178)	0.954 (0.0495)	0.973 (0.0170)	0.952 (0.0024)	0.988 (0.0025)	0.992 (0.0011)
OpenI
Label	# Cases	% Prev	MLM, MFR, ITM	MLM, MFR	MLM, ITM	MLM	MFR, ITM	MFR	ITM	No Pretrain	VBert-COCO	MLM (text)
Atelectasis	293	7.95	0.957 (0.0145)	0.956 (0.0146)	0.948 (0.0123)	0.954 (0.0161)	0.964 (0.0138)	0.933 (0.1072)	0.957 (0.0162)	0.985 (0.0014)	0.937 (0.0175)	0.966 (0.0054)
Cardiomegaly	315	8.55	0.973 (0.0037)	0.970 (0.0030)	0.975 (0.0033)	0.971 (0.0160)	0.974 (0.0039)	0.924 (0.1142)	0.973 (0.0050)	0.962 (0.0038)	0.967 (0.0034)	0.966 (0.0262)
Consolidation	28	0.76	0.956 (0.0139)	0.951 (0.0127)	0.955 (0.0220)	0.951 (0.0219)	0.951 (0.0154)	0.907 (0.0865)	0.953 (0.0204)	0.929 (0.0139)	0.937 (0.0197)	0.943 (0.0270)
Edema	40	1.09	0.984 (0.0095)	0.981 (0.0068)	0.989 (0.0069)	0.986 (0.0063)	0.986 (0.0082)	0.962 (0.0923)	0.983 (0.0118)	0.993 (0.0016)	0.971 (0.0158)	0.988 (0.0058)
Pneumonia	36	0.98	0.985 (0.0049)	0.988 (0.0046)	0.985 (0.0082)	0.972 (0.0492)	0.967 (0.0237)	0.871 (0.1313)	0.967 (0.0240)	0.937 (0.0357)	0.957 (0.0101)	0.959 (0.0826)
Pneumothorax	22	0.60	0.965 (0.0187)	0.973 (0.0138)	0.969 (0.0121)	0.953 (0.0274)	0.956 (0.0196)	0.923 (0.0990)	0.955 (0.0195)	0.977 (0.0077)	0.943 (0.0210)	0.972 (0.0246)
Pleural Effusion	140	3.80	0.960 (0.0163)	0.967 (0.0102)	0.955 (0.0106)	0.975 (0.0095)	0.959 (0.0143)	0.928 (0.0912)	0.957 (0.0133)	0.951 (0.0043)	0.956 (0.0098)	0.970 (0.0125)
Avg			0.968 (0.0054)	0.969 (0.0036)	0.968 (0.0045)	0.966 (0.0141)	0.965 (0.0062)	0.921 (0.0937)	0.964 (0.0071)	0.962 (0.0053)	0.953 (0.0054)	0.966 (0.0235)
wAvg			0.966 (0.0067)	0.965 (0.0052)	0.963 (0.0046)	0.966 (0.0099)	0.967 (0.0055)	0.926 (0.0986)	0.964 (0.0065)	0.968 (0.0019)	0.953 (0.0065)	0.967 (0.0158)

Note: The results reported as mean and std. dev. Over 20 fine-tuning episodes, varying the random seed. Orange (teal) represents highest (lowest) mean AUC across scenarios, with ties broken by std. dev.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Coleman, M.; Dipnall, J.F.; Jung, M.C.; Du, L. PreRadE: Pretraining Tasks on Radiology Images and Reports Evaluation Framework. Mathematics 2022, 10, 4661. https://doi.org/10.3390/math10244661

AMA Style

Coleman M, Dipnall JF, Jung MC, Du L. PreRadE: Pretraining Tasks on Radiology Images and Reports Evaluation Framework. Mathematics. 2022; 10(24):4661. https://doi.org/10.3390/math10244661

Chicago/Turabian Style

Coleman, Matthew, Joanna F. Dipnall, Myong Chol Jung, and Lan Du. 2022. "PreRadE: Pretraining Tasks on Radiology Images and Reports Evaluation Framework" Mathematics 10, no. 24: 4661. https://doi.org/10.3390/math10244661

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PreRadE: Pretraining Tasks on Radiology Images and Reports Evaluation Framework

Abstract

1. Introduction

2. Materials and Methods

2.1. Framework Overview

2.2. Self-Supervised Pretraining

2.2.1. Masked Language Modeling (MLM)

2.2.2. Masked Feature Regression (MFR)

2.2.3. Image to Text Matching (ITM)

2.3. Pretraining on Multimodal Radiology Data

2.4. Evaluating Pretraining Performance

2.5. Data

Data Preprocessing

2.6. Model Components

2.7. Pretraining Scenarios

2.8. Evaluation Tasks

3. Results

4. Discussion

4.1. Classification Performance

4.2. Sensitivity to Label Imbalance

4.3. Sensitivity to Random Initialisation

4.4. Strengths and Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI