COVID-19-Associated Lung Lesion Detection by Annotating Medical Image with Semi Self-Supervised Technique

Pham, Vinh; Dinh, Dung; Seo, Eunil; Chung, Tai-Myoung

doi:10.3390/electronics11182893

Open AccessArticle

COVID-19-Associated Lung Lesion Detection by Annotating Medical Image with Semi Self-Supervised Technique

by

Vinh Pham

¹,

Dung Dinh

²,

Eunil Seo

³

and

Tai-Myoung Chung

^1,*

¹

Department of Computer Science and Engineering, Sungkyunkwan University, Suwon 16419, Korea

²

Respiratory Department, Cho Ray Hospital, Ho Chi Minh 700000, Vietnam

³

Department of Computing Science, Umeå University, 901 87 Umeå, Sweden

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(18), 2893; https://doi.org/10.3390/electronics11182893

Submission received: 12 August 2022 / Revised: 1 September 2022 / Accepted: 3 September 2022 / Published: 13 September 2022

(This article belongs to the Special Issue Recent Advances in Biomedical Image Processing and Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

Diagnosing COVID-19 infection through the classification of chest images using machine learning techniques faces many controversial problems owing to the intrinsic nature of medical image data and classification architectures. The detection of lesions caused by COVID-19 in the human lung with properties such as location, size, and distribution is more practical and meaningful to medical workers for severity assessment, progress monitoring, and treatment, thus improving patients’ recovery. We proposed a COVID-19-associated lung lesion detector based on an object detection architecture. It correctly learns disease-relevant features by focusing on lung lesion annotation data of medical images. An annotated COVID-19 image dataset is currently nonexistent. We designed our semi-self-supervised method, which can extract knowledge from available annotated pneumonia image data and guide a novice in annotating lesions on COVID-19 images in the absence of a medical specialist. We prepared a sufficient dataset with nearly 8000 lung lesion annotations to train our deep learning model. We comprehensively evaluated our model on a test dataset with nearly 1500 annotations. The results demonstrated that the COVID-19 images annotated by our method significantly enhanced the model’s accuracy by as much as 1.68 times, and our model competes with commercialized solutions. Finally, all experimental data from multiple sources with different annotation data formats are standardized into a unified COCO format and publicly available to the research community to accelerate research on the detection of COVID-19 using deep learning.

Keywords:

COVID-19; object detection; deep learning

1. Introduction

We have witnessed the adverse effects of COVID-19 on humanity. Many machine learning (ML)-based solutions have emerged, which suggest that the classification of chest radiography using artificial intelligence (AI) can be a diagnostic tool for COVID-19 detection. However, Cleverley et al. [1] claimed that no single feature of COVID-19 on chest radiography is specific or diagnostic for the precise detection of COVID-19. COVID-19 image datasets cannot provide distinguishable features to be detected.

Rather than detecting the COVID-19 infection using chest radiography, which is not approved by medical experts [2,3,4], the detection of injuries in the human lung caused by the disease, i.e., COVID-19-associated lung lesion, is more feasible. Because these lesions can accurately reflect disease evolution, they can facilitate the assessment of severity and monitoring of the treatment effect. Unfortunately, these lesions can be very subtle and, thus, may be a diagnostic challenge even for experienced thoracic radiologists. Therefore, this work demonstrates the automated COVID-19-associated lung lesion detection by AI as a computer-aided diagnosis (CAD) solution that assists medical professionals in the identification of the more subtle lesions that could escape the human eyes and alleviate their daily workload at overloaded hospitals.

Careful data preparation is critical to successfully training the ML model. Maguolo et al. [5] reported that the ML model for COVID-19 detection might learn the features that are specific to the dataset more than those that are specific to the disease, resulting in an incorrect prediction. Cohen et al. [6] presented evidence that the ML model, which was trained solely on one source of the dataset, achieved excellent results on test data of the same source but showed low accuracy on test data from different sources. COVID-19 is a viral infection (others include, H1N1, SARS, and MERS); thus, the lesions of COVID-19 are the same as other community-acquired lesions of pneumonia. Therefore, in this work, we also employ pneumonia image data and incorporate datasets from various sources along the COVID-19 image dataset to increase the variety of data, thus mitigating the bias of monopoly data sources to enhance lesion detection.

Regulation of the ML model to learn the disease-relevant features for correct detection is another important factor for successfully training the ML model. Majeed et al. [7] demonstrated that the detection of the COVID-19 infection by AI is not reliable because the ML model will focus on artifacts (e.g., texts, markers, medical device traces) or regions/features that have no relation with the disease to build its prediction result. To overcome this weakness, we employ the object detection architecture for our ML model, which is trained by the indicated (annotated) region, thus guaranteeing that the model only learns from the right features.

Finally, we need numerous annotated image datasets to reinforce the accuracy of object detection-based ML models, especially in medical images. COVID-19 chest image dataset with lung lesion annotation is currently nonexistent. We must annotate the lung lesions on COVID-19 chest images to employ them as training inputs. Only experienced radiologists and medical experts could perform the annotation task on medical images. The greatest challenge was the enormous time required for annotation in massive datasets, such as thousands and millions of images. To address such challenges, this work utilizes the semi-self-supervised annotation method, which can extract the annotated lung lesions on pneumonia images into supervision that guides the novice annotator (nonmedical professionals) to produce the annotation on the COVID-19 image.

There are two main imaging modalities for chest radiography: computed tomography (CT) and chest X-ray (CXR). CT has a high diagnostic sensitivity. Nonetheless, CT image acquisition is time-consuming, manual, and requires expert involvement [8]. CXR imaging is relatively cost-effective, portable, and easy to perform in a ventilated patient bed, which minimizes the risk of cross-infection between healthcare workers and other patients. CXR can be repeated over time to monitor the evolution of lung diseases. Hence, CXR is an ideal imaging modality for routine monitoring. Therefore, this study employed only CXR image data.

The main contributions of this study are as follows:

We attempt to adapt the sophisticated deep neural network for COVID-19-associated lung lesion detection. To the best of our knowledge, our lung lesion detector is the first object detection architecture model designed for COVID-19 with the ability to provide the type of lesions and their properties on CXR.
We introduced a semi-self-supervised method, which sheds light on controversial problems: annotation of medical data by a novice with the supervision of AI. Extensive experiments have been conducted to demonstrate that the COVID-19 image annotated by our method can significantly boost the performance of our COVID-19-associated lung lesion detector for up to 1.68× more accuracy in terms of the AP50 metric.
We standardized three public datasets into the COCO format standard for a 2895 CXRs dataset with 7943 bounding box annotations for three types of COVID-19 lung lesions. We also reannotated the three public datasets to evaluate our model. Finally, we provide our 235 lung-lesion annotated COVID-19 CXRs with 1192 annotations collected since the first outbreak in a province of northern Vietnam. All evaluation data were annotated by our experienced physicians. These data significantly encourage reproducibility.

The remainder of this paper is organized as follows. First, in Section 2, we introduce recent related research on the application of deep learning (DL) in COVID-19 detection. Subsequently, we sequentially describe the lung lesions, the COCO standardization process, our annotation method, and the object detection architecture of the deep neural network governing our lung lesion detector in Section 3. Section 4 contains the results of our initial experiments, which act as the baseline for later comparison. Next, in Section 5, we explain the training process with a variety of hyperparameter plannings, and the results of the evaluation are documented in detail. We deeply analyze our results and limitations and discuss future directions in Section 6. We end the paper in Section 7.

2. ML-Based COVID-19 Detection Models

Owing to the privacy policy on patient data, it is difficult to acquire a sufficient volume of COVID-19 radiography data. Previous studies employed various approaches and methods to improve the detection accuracy of COVID-19 infection by classifying COVID-19 and non-COVID-19 images. Shervin et al. [9] pioneered the detection of COVID-19 infection using their proposed Deep-COVID framework of five popular convolutional networks on their COVID-Xray-5k dataset. Linda et al. [10] tailored their COVID-Net with a lightweight design pattern that enables enhanced representation capacity while maintaining reduced complexity to classify normal, non-COVID-19, and COVID-19 CXRs. Michael et al. [11] applied transfer learning to tune their model on COVID-19 detection against pneumonia or normal on three lung image modalities: CT, CXR, and ultrasound images. Matteo et al. [12] presented a completed AI-based system to achieve both COVID-19 identification and lesion categorization on CT with an accuracy that is claimed to be at par with, or even higher than human experts. According to a medical expert consultant [1,2], no single feature of chest radiography is diagnostic of COVID-19; in fact, normal radiography does not exclude the possibility of COVID-19. The detection of COVID-19 via radiography with AI has no scientific basis, and it is impractical.

In other developments, researchers have attempted to automate the assessment of the severity of COVID-19-infected chest images by AI. In this approach, COVID-19 images are severity-assessed and assigned a quantitative value by radiologists. Neural networks are then trained to map the features associated with pneumonia/COVID-19 to a severity scoring system. Cohen et al. [13] scored their famous COVID-19 Image Data Colection [14] using the Radiographic Assessment of Lung Oedema (RALE) [15] scoring system, whereas Li et al. [16] modified the scoring system of their mRALE and developed their pulmonary X-ray severity score for severity quantification. Signoroni et al. [17] employed a multi-region and multivalued Brixia score. These AI solutions work as a black box by only giving their final estimated severity score without the medical evidence (lung lesion), thus lacking explainability. Hence, they will not be approved by medical specialists in sensitive environments, such as healthcare and hospitals. Moreover, this approach requires the data to be assessed and scored by medical experts, which is difficult to reproduce for more careful study and development.

Our work attempts to approach the COVID-19 problem scientifically, detecting the COVID-19-associated lung lesion using radiography, especially the CXR. This is a feasible approach because there are already medically approved commercial software solutions for the accurate detection of lung lesions based on the Hounsfield Unit of CT images (e.g., IntelliSpace COPD [18] by Philips, UK). Moreover, our work can alleviate the problem of explainability in previous COVID-19 severity assessments by AI through the provision of COVID-19-associated lung lesions (e.g., axial distribution, type, volume); thus, other researchers can derive their severity assessment and persuasively explain their result with the medical evidence.

3. Methods

As mentioned above, the training data should be carefully prepared to maximize the learning capability of the ML model. Hence, we conducted our research in the following order, which helped us verify our assumption about the relationship between lung lesions caused by pneumonia and COVID-19, thus, compiling a sufficient dataset to train our COVID-19-associated lung lesion detector:

To facilitate the involvement and combination of various domain datasets, we construct our framework to standardize multiple public datasets with their annotation meta-data into COCO format. This will facilitate later joint training and evaluation of our model as well as future research. The detailed process is described in a later section and in the Appendix A.
Preliminary experiments were conducted to verify our assumption. The results confirm our findings that COVID-19 and community-acquired pneumonia damaged and caused the same injuries in the human lung; thus, we can additionally combine CXR of pneumonia to detect COVID-19-associated lung lesions (i.e., GGO (Ground-glass opacity) and Consolidation) to improve the detection accuracy.
To surpass the requirements of medical professionals and save a large amount of time required to annotate the COVID-19 lung lesion images, we developed our semi-self-supervised method. It leverages a prototype model from preliminary experiments (Teacher), which extracts the lung lesion features from pneumonia images, then helps us identify lung regions that potentially have lung lesions on unannotated COVID-19 CXR, in the form of pseudo-boxes. A novice annotator is supervised by these pseudo-boxes and then distills these boxes into the annotation data of COVID-19 CXR. Figure 1 illustrates our proposed method.
Finally, we tested our model (Student), which was trained on a combination of annotated pneumonia and COVID-19 images on an unseen public nonCOVID-19 dataset and our collected COVID-19 CXR and made comparisons with two former models (i.e., baseline and teacher models) from preliminary experiments for comprehensive evaluation.
Moreover, we compared our final model (Student) with previous studies from the U.S. National Institutes of Health (NIH) for the pneumonia localization task, which indicates that the object detection model is more efficient than the classification model and heatmap combination because our model was trained on a data volume that was 37 times smaller but achieved a competitive accuracy. Figure 2 shows the entire experimental process.

3.1. Lung Lesion Identifying

The two most frequently occurring injuries caused by pneumonia in general and COVID-19, in particular, are GGO and consolidation. GGO manifests as a hazy increased opacity of the lung caused by the partial filling of airspaces, as well as interstitial thickening owing to fluids, cells, and/or fibrosis. GGO is less opaque than consolidation, which appears as a homogeneous increase in pulmonary parenchymal attenuation that obscures the margins of vessels and airway walls. Consolidation is a lung lesion caused by an exudate or other product of a disease that replaces the alveolar air, rendering the lung solid [19].

Typically, the objects to be identified in natural images have distinguishable shapes (e.g., tree, car, or human) and foreign-to-background contexts that are known from everyday life and “easily recognized by a four-year old” [20]. However, the identification of lung lesions on CXR is challenging because lung lesions do not have a clear, permanent shape, and they vary with the anatomical structure and pathology of patients. Moreover, CXR could be under/over-exposed by operating factors (e.g., inappropriate radiation dose and rotation of the patient), patient factors (e.g., high body mass index and chest wall abnormalities), or inappropriate image processing. These factors contribute to the deformity of the lung lesion appearance, which makes lesion interpretation more challenging.

The greatest difficulty arises from the naturality of the GGO and consolidation. When lung marking is partially obscured by increased whiteness, GGO occurs. GGO becomes denser (whiter) and progresses to consolidation with complete loss of lung markings [1] (Figure 3). However, GGO and consolidation differ and are categorized as two distinct lung lesions. Nonetheless, their appearances are quite similar; moreover, both GGO and consolidation share the same formality, making differentiating them difficult. This is a nontrivial task and demands specialized training and experience.

Finally, not all lesions detected by the ML model are true GGO and consolidation because there may be other lesion-like occurrences within the CXR. An intermediate bin is required to place these detected lesion-like appearances. Infiltration describes a region of pulmonary opacification caused by airspace or interstitial disease and is similar to GGO and consolidation in terms of appearance. This term is no longer recommended and has been largely replaced by other descriptors [19] (e.g., replacing opacity with relevant qualifiers). However, we still decided to include this lesion as a COVID-19-associated lung lesion in our study. Whenever the ML model recognizes a lesion as similar, but cannot identify the lesion as GGO, and, during consolidation, the model will discard that lesion into the infiltration lesion.

Generally, this poses a significant challenge with regard to correctly locating and identifying GGOs and consolidation. AI demonstrated high accuracy for the same task on CT images. Because of the intrinsic limitation of the two-dimensional projection image where various anatomic or pathologic structures overlap, CXR has a lower sensitivity than a CT [21,22,23], is prone to reading errors, and inter- or intrareader variability [24]. To the best of our knowledge, our work is the first attempt to accomplish lung lesions detection in a much more difficult context: on the CXR image.

3.2. Semi Self-Supervised

There were inevitable difficulties in acquiring annotations for medical image data:

The annotation of medical images is tedious and time-consuming because specialized training and experience are needed to interpret medical images correctly.
Correct annotation requires extensive input from medical experts; in particular, multiple expert opinions are required to overcome human error.
Clinical experts, particularly qualified experts, are few, as are cases of rare diseases.

In practice, it is difficult to always have professional personnel handling the required volume of data, especially annotating the number of medical images required for DL. Can a novice annotator replace an expert annotator? According to previous studies [25,26], novices were less accurate than individuals with average expertise in tasks of recognizing the severity of breast cancer lesions on mammograms (another type of X-ray medical image). Moreover, numerous annotations are not beneficial if performed at a high error rate [27]. Unsurprisingly, the answer to the above question is, “No”. However, we still need to free specialists from repeated and tedious tasks, such as the annotation of the training data. Therefore, the development of a general solution is critical: it will assist other researchers to overcome similar problems and promote future medical DL-related studies, which typically require the annotation of thousands and millions of data samples.

Recently, researchers from Stanford University proposed a self-training approach to classifying CXR, which relabels uncertainty labels by trained models and then retrains the model until convergence [28]. Generally, researchers from Google reported that self-supervised (in which unannotated data are annotated based on the annotation given by a pre-trained model) and self-training combination can enhance the detection accuracy for normal object detection on natural images [29]. Based on these studies, we developed our self-supervised annotation method for medical CXR images. In this section, we formally describe our semi-self-supervised methodology.

Given the annotated CXR image set

D = \{(I_{1}, T_{1}), (I_{2}, T_{2}), \dots, (I_{n}, T_{n})\}

and the unannotated image set

U = \{{\tilde{I}}_{1}, {\tilde{I}}_{2}, \dots, {\tilde{I}}_{m}\}

, where

t_{i}^{j} = \{x_{i}^{j}, y_{i}^{j}, w_{i}^{j}, h_{i}^{j}\} \in T_{i}

contains the top-left coordinate, width, and height of the j-th ground truth bounding box(interchangeable with the term instance hereafter) in the i-th CXR

I_{i}

. We first train teacher model

f (I, θ_{t}^{*})

on the annotated CXRs, where

θ_{t}^{*}

represents the model weights. Subsequently, we produced pseudo-boxes for each unannotated CXR

{\tilde{I}}_{i} \in U

:

\begin{matrix} \tilde{T_{i}}, {\tilde{L}}_{i}, \tilde{S_{i}} \leftarrow f ({\tilde{I}}_{i}, θ_{t}^{*}), i = 1, \dots, m, \end{matrix}

where each pseudo box

{\tilde{t}}_{i}^{j} \in {\tilde{T}}_{i}

is also associated with the label of lesion

{\tilde{l}}_{i}^{j} = \{GGO, Consolidation, Infiltration\} \in {\tilde{L}}_{i}

and a confidence score

{\tilde{s}}_{i}^{j} \in {\tilde{S}}_{i}

.

Owing to the insufficient data volume for training, the teacher model generates numerous pseudo-boxes

t_{i}^{j}

on each unannotated CXR

{\tilde{I}}_{i}

, which could not precisely indicate the location of the lesions within the CXR. Nevertheless, we assume that these pseudo-boxes are distributed and converged around areas where potential lesions are likely to present. In practice, pseudo-boxes were distributed into several clusters k (usually two or three according to our observation); thus, there might exist k lesions within an

{\tilde{I}}_{i}

.

Intuitively, the smallest area with the most overlapped pseudo-boxes likely represents a lesion. This is the basis of the supervision given by the teacher model for localizing potential lesions. Although the unannotated data (SIIM dataset) contains bounding box annotations indicating lung areas affected by COVID-19, most of them are large boxes that usually cover most of the lung field and hence could not localize and precisely indicate the lesion areas. We consider them as heuristic boxes.

h_{i}^{j} = \{Typical Appearance, Indeterminate Appearance, Atypical Appearance\} \in H_{i}

, if only they overlap with our pseudo boxes for reference purpose. Hence, the final localization of the lesion is the overlapping area between the most overlapped pseudo-box area with reference to heuristic boxes.

Next, the type

{\tilde{l}}_{i}^{j}

is labeled in localized potential lesions. The simplest approach is to assign

{\tilde{l}}_{i}^{j}

to follow the highest confidence score (ideal case, or higher than the predefined threshold) pseudo-box presented at the cluster containing the localized potential lesion. Sometimes, all pseudo-boxes are attached with a low confidence score (e.g., less than 5%), forcing us to select the most frequent lesion type in that cluster as the final label to be assigned to the localized potential lesion. We also focus on balancing the types of lung lesions. Because of consolidation and GGO, the contrast (brightness) of the former is brighter than the latter. When we realized that one type dominated within a CXR, based on their visual appearance contrast compared to other cluster localized lesions, we labeled remain localized lesions as the inverted type. If we could not decide, we labeled the localized lesion according to the infiltration type. This means that uncertain lesions detected later by our lung lesion detector will be categorized as infiltration.

Finally, inside the loss function of the Faster region-based convolutional neural network (R-CNN) (Equation (1)), which will be introduced in a later section, the term

p_{i}^{*} L_{r e g}

represents the regression loss

L_{r e g}

(lesion localization loss, regardless of its type). It is activated only for foreground proposals (

p_{i}^{*} L_{r e g} = 1

) and disable otherwise (

p_{i}^{*} L_{r e g} = 0

). This mechanism of the loss function implies that more foreground proposals imply that more regression loss

L_{r e g}

contributes to the training by chance, improving the lesion localization capability of the Faster R-CNN model. A practical solution to increase the foreground proposal is to increase the annotated instance. Fortunately, the foreground proposal is non-label specific, that is, regardless of whether we correctly label lesion type

{\tilde{l}}_{i}^{j}

to the instance or not, we increase the chance of having more foreground proposals as long as we get more instances.

We justify that we can increase the number of annotated instances by splitting a large instance into multiple smaller instances. For an everyday object, for instance, a car, splitting a complete car annotation is simply transforming a correct one into multiple incomplete. Otherwise, incorrect annotations may distort the training loss function, thus degrading the accuracy. However, in the case of lung lesions, the splitting approach is an augmenting method similar to the bootstrapping method [30,31] for numeric data in the traditional algorithm; we resampled the entire dataset with replacement; thus, some data samples might be repeated.

As previously mentioned, lung lesions have no permanent or distinguishable shapes. Therefore, from our point of view, splitting a large lesion instance into smaller instances would not destroy its completeness. We hypothesize that this method attempts to improve the localization capability (i.e., provide the position of the detected lesion) and improve the classification of the lesion (i.e., provide the type of detected lesion) by supplying a variety of shapes of the lung lesion to the learning model. It may be disadvantageous if the smaller instances do not maintain the type label

{\tilde{l}}_{i}^{j}

of the lesion of the original instance because their visual appearance is distorted. However, even if the smaller instances are incorrectly labeled, as discussed, they still increase the foreground proposal. In the worst case, where incorrect labels negatively influence the classifying capability of the model, we would have traded off the classifying capability to gain more lesion localization accuracy and improve the overall detection performance.

We decide that if an instance is larger than one-sixth of the area of the lung field (the area bounded with the green boundary in Figure 3), then it will be split into multiple instances equal to that size. We also attempted to create a bounding box of instances excluding artifacts (e.g., caption, cables, breathing tubes, and medical devices), which may deceive the ML model by focusing on irrelevant features of COVID-19-associated lung lesions. Figure 4 illustrates our semi-self-supervised method. Figure 4a shows pseudo-boxes (i.e., color rectangles attached with a confidence score and label of a lesion) detected by the teacher model and heuristic boxes (i.e., yellow rectangles attached with numeric identifications 1 and 2). Figure 4b shows the most overlapped areas obtained from the pseudo-boxes, which are two red-colored rectangles. Figure 4c shows the final annotated instances after being split from most overlapped areas (i.e., orange and cyan rectangles with numerical identifications 1, 2, 3, 4, 5). We label three instances (2, 4, 5) on the left lung as GGO because the blue pseudo-box (GGO 81%) had the highest confidence score. In contrast, two (1, 3) are labeled as consolidation owing to their brighter contrast than those of GGO instances. Finally, Algorithm 1 summarizes our entire semi-self-supervised method into a pseudo-code. We annotated the lung lesion on COVID-19 CXRs of the Society for Imaging Informatics in Medicine (SIIM) dataset using our method with pseudo-boxes predicted by the teacher model trained on pneumonia and other pulmonary abnormality CXRs of the RSNA3 and VinBig datasets.

Algorithm 1 Semi Self-Supervised Annotating

Require:

θ_{t}^{*}

,

U

Ensure:

\tilde{T}

,

\tilde{L}

for each image

{\tilde{I}}_{i} \in U

:
acquiring

\tilde{T_{i}}, {\tilde{L}}_{i}, \tilde{S_{i}}

by

f ({\tilde{I}}_{i}, θ_{t}^{*})

finding number of clusters c:

c = CLUSTERING (\tilde{T_{i}})

for each cluster k in [1;c]:

t_{i}^{k} = OVERLAPPING ((BELONGTO (\tilde{T_{i}}, cluster k)))

if

\exists {\tilde{s}}_{i}^{j} \in BELONGTO (\tilde{S_{i}}, cluster k) \geq 0.5

:

l_{i}^{k} = {\tilde{l}}_{i}^{j}

else

l_{i}^{k} = MOST (BELONGTO ({\tilde{L}}_{i}, cluster k))

t_{i}^{k} = OVERLAPPING (t_{i}^{k}, H_{i})

if

SQR (t_{i}^{k}) > \frac{1}{6} SQR (lung field)

:

\tilde{T_{t e m p}}, {\tilde{L}}_{t e m p} = SPLITTING (t_{i}^{k}, l_{i}^{k})

\hat{T_{i}} . APPEND (\tilde{T_{t e m p}})

\hat{L_{i}} . APPEND (\tilde{L_{t e m p}})

else

\hat{T_{i}} . APPEND (t_{i}^{k})

{\hat{L}}_{i} . APPEND (l_{i}^{k})

\tilde{T} = \hat{T_{i}}

\tilde{L} = {\hat{L}}_{i}

return

\tilde{T}

,

\tilde{L}

COCO Standardization

Each dataset came with its text-based annotation metadata alongside the image data, which was formatted differently. These metadata must be standardized into a unified COCO format to be compatible with our machine learning framework and facilitate the combination of multiple training datasets. VGG image annotator (VIA) [32,33] is an open-source web-based simple, standalone manual annotation software for images. VIA provides an intuitive GUI (Figure 5) to annotate images and the COCO export tool.

For a dataset supplied with annotation metadata (i.e., RSNA-3, VinBig, VinTest, and ChesX-ray8), we converted their annotation metadata into a VIA core data structure [34] (in CSV format) and then imported it to VIA and visualized it with their respective CXR to verify whether the bounding boxes are placed inside the lung field, which indicates that those metadata had been correctly converted into COCO format. Finally, we exported annotations into the COCO standard (in JSON format).

For data without annotation metadata (i.e., CheXpert, RSNA3, SIIM, and BG COVID), three physicians (all medical doctors (MDs)) with different degrees of expertise manually annotated the lung lesion on those images and delivered the highest-quality GGO and consolidation annotations on a variety of CXRs from different datasets employed in our research:

Physician 1 (H): MD from the University of Medicine and Pharmacy at HCMC (UMP), Vietnam.
Physician 2 (T): MD from the University of Medicine and Pharmacy at HCMC (UMP), Vietnam.
Physician 3 (D): MSc, MD, 10 years of experience at Department of Respiratory, Cho Ray Hospital, Vietnam.

Details of the annotation process followed by the physicians as well as descriptions for each dataset are provided in Appendix A. We then imported the hand-labeled CXR, with rectangles to locate the lesions by our physicians into the VIA. Next, we use the region tool to draw bounding boxes over physicians’ rectangles and assign their lesion type (GGO or consolidation) following the physician’s instructions notes. Finally, when all bounding boxes are completely drawn, they are exported to COCO standards.

All standardized datasets, their annotation style, and their respective numbers of CXRs and lung lesions are listed in Table 1.

3.3. Model Architecture

A single CNN model consists of multiple convolution layers, pooling layers, and fully connected layers. These layers label simple features on lower layers and more complex ones on higher (deeper) layers. By passing through these stacked layers, an image is convolved with a filter (kernels) and then max pooled into the feature map. This process transforms nonlinear information into higher and more abstract levels. At the end of the process, the recognizable features (i.e., the weight of the model) that portray enhanced parts of the information significant for segregation (i.e., feature map) and smother unimportant attributes are learned. Finally, the feature map is fed into fully connected neural network layers responsible for making decisions from the feature map to categorize the image and determine the classification classes.

The more layers, the more complex the CNN model is, and it can learn to improve the accuracy of the image with more abstract shapes. However, the structuring and layering of the neural network layers required to meet our objectives is a nontrivial task because, without careful design, the CNN model may collapse during training and fail to learn even the simplest feature. Initially, the model can only perform the image classification task, that is, predicting whether the image contains some specific objects or not. Some primitive CNN architectures (e.g., ResNet, VGG, AlexNet) have achieved outstanding accuracy in the ImageNet large-scale visual recognition challenge [35] that requires the classification of 1000 object types. Recently, it has been employed in studies [9,10,36,37] to detect COVID-19 infection. After intense research efforts, integration of sophisticated strategy and design, current state-of-the-art CNN architectures (e.g., R-CNN, YOLO (You Only Look Once), and SSD (Single Shot Detection)) can detect objects and localize the position of the detected objects on the image.

A CNN architecture with high accuracy with respect to abstract shapes of COVID-19-associated lung lesions and the ability to cope with the big data volume is required. Hence, we conducted numerous experiments to examine three state-of-the-art CNN architectures: Faster R-CNN [38], SSD300 [39], and YOLO version 5 (YOLOv5) [40] to find the architecture most suitable to our purpose.

SSD300 fixed the resolution of the processing input image at 300 × 300 pixels, dropping important information from medical images, thus yielding the lowest accuracy; YOLOv5 requires that the dataset must follow their dedicated format, which is costly with regard to preparation and storage, thus limiting the integration of data from different sources and reproducibility. Faster R-CNN emerges as a potential candidate owing to its reasonable performance; it follows the COCO format thus allowing us to combine data from multiple sources for training. We chose the Faster R-CNN as our foundation CNN architecture.

The overall architecture of the Faster R-CNN is shown in Figure 6. The Faster R-CNN is constructed using three key components: the feature pyramid network (FPN), region proposal network (RPN), and region of interest (ROI) heads with their dedicated mission.

The FPN contains a backbone of convolution layers grouped according to the classic CNN architecture (e.g., ResNet and VGG), and lateral convolution layers. When the image passed through the FPN, feature maps that highlight the characteristics of the lung lesions at different scales (P2 (1/4 scale), P3 (1/8), P4 (1/16), P5 (1/32), and P6 (1/64)) are extracted. Figure 7 shows the original input CXR and its feature maps corresponding to different scales. The smaller the scale, the larger the receptive field corresponding to the respective feature maps. That is, one pixel in the P6 feature map (larger receptive field) corresponds to a broader area of the input image than the P2 feature map; thus, P6 is more effective in highlighting large lesions, whereas P2 is suitable for smaller lesions. With the benefits of this architecture, Faster R-CNN can detect lung lesions of all sizes.

The RPN will place anchors, which are abstracted boxes of five sizes (32, 64, 128, 256, and 512 pixels) with three ratios (0.5, 1.0, and 2.0) for the extracted feature maps. The convolution layers of the RPN calculate the likelihood of bounding the lung lesion (regardless of the type of lesion) of the anchor and its relative position based on the experience learned from previous training data. The RPN then proposes 1000 box proposals that contain all the high-probability lung lesion boxes (foreground boxes) and randomly picked boxes with low probability (background boxes) to the ROI heads. Figure 8a shows the anchors corresponding to the P2 and P3 feature maps, whereas Figure 8b visualizes placing anchors on the P5 and P6 feature maps.

The ROI heads resampled 1000 proposals of 512 boxes to balance foreground and background boxes and alleviate the bias of the overall network toward the unimportant features of background boxes. Subsequently, based on these resampled proposals, the ROI heads crop (pooling) the content from their respective feature maps to 512 ROIs, which can be described as the feature map of the proposed box only. Next, these ROIs will be fed into fully connected layers of ROI heads, and their probability of being GGO, consolidation, or infiltration, as well as their exact localization considering the type of lesion, will be calculated.

In the inference mode (i.e., evaluation of the ML model), only ROIs with a probability higher than a predefined threshold will be filtered and regressed to the final detection on the CXR. In the training mode, for each image, the RPN and ROI heads will compare their proposals and prediction with the ground truth and then calculate the multitask loss, which is defined as

L (\{p_{i}\}, \{t_{i}\}) = \frac{1}{N_{c l s}} \sum_{i}^{} L_{c l s} (p_{i}, p_{i}^{*}) + λ \frac{1}{N_{r e g}} \sum_{i}^{} p_{i}^{*} L_{r e g} (t_{i}, t_{i}^{*})

(1)

where i is the index of an anchor,

p_{i}

is the predicted probability of the anchor i being a lesion,

p_{i}^{*}

is 1 if the anchor is the foreground and 0 if it is the background;

t_{i}

is a vector representing the four parameterized coordinates of the predicted bounding box, and

t_{i}^{*}

is that ground truth box. The classification loss

L_{c l s}

is log loss of the prediction over the truth (lesion vs. not lesion for RPN, prediction type of lesion vs. ground truth for ROI heads) defined as follows [42]:

L_{c l s} (u, v) = - l o g u_{v}

(2)

The regression loss

L_{r e g}

is the loss between the coordination of the predicted bounding box over the ground truth box and is defined as follows:

L_{r e g} (t_{i}, t_{i}^{*}) = R (t_{i} - t_{i}^{*})

(3)

where R is the robust loss function (smooth

L_{1}

) and is defined as [42]:

R (t_{i} - t_{i}^{*}) = s m o o t h_{L_{1}} (x) = \{\begin{matrix} 0.5 x^{2} & i f |x| < 1 \\ |x| - 0.5 & o t h e r w i s e \end{matrix}

(4)

Hence, the final loss value is the accumulation

L_{c l s}

and

L_{r e g}

of both RPN and ROI heads:

L = L_{c l s} (R P N) + L_{r e g} (R P N) + L_{c l s} (R O I) + L_{r e g} (R O I)

(5)

The entire Faster R-CNN network can be trained end-to-end by backpropagation and the stochastic gradient descent (SGD) [43]. In each training iteration, the loss value is accumulated, and the SGD gradient is derived and then backpropagated to all neural network layers. Based on the SGD gradient, layers adjust themselves to the ground truth of the training data, thus enhancing their accuracy in the extraction of feature maps, localization of the lung lesions, and classification into the correct type over training iterations.

4. Preliminary Experiments

We conducted two types of experiments. First, we sequentially trained ML models on the pneumonia abnormalities originally annotated dataset, then on the dataset with lung lesion annotations merged as pneumonia abnormalities, and on the combination of both. Next, we benchmarked these three pneumonia detector models to determine their accuracy. From the results of pneumonia detection, we can determine whether to annotate the lung lesion on the pneumonia image data for the training and evaluation of the COVID-19-associated lung lesion model.

Second, we benchmarked the lung lesion detector trained on the lung lesion annotated public dataset and then combined it with a small pneumonia dataset that was reannotated by a physician (D) working on lung lesions. We aimed to establish the initial result as the baseline for later comparison and measured the contribution of an additional reannotated small dataset to the performance of the lung lesion detector.

4.1. Pneumonia Detection

We trained a Faster R-CNN model on the RSNA-3 (Pneumonia annotated) dataset. All parameters follow the default settings of the Detectron2 framework. We reported its average precision (AP) at 50% and 75% intersection over union (IoU) and the average AP from 50% to 95% IoU (AP50, AP75, and AP, respectively) on multiple test datasets in Table 2. In addition, the total number of annotations was indicated for the individual test datasets. We retained the original pneumonia annotations from the RSNA dataset in the RSNA3 test dataset. The five remaining test datasets were lung lesion annotations that were annotated by our physicians but are merged as pneumonia abnormalities.

Similarly, we trained two other models on the VinBig dataset and a combination of both RSNA-3 and VinBig with all lung lesion annotations merged as pneumonia abnormalities. The results of these two models are shown in Table 3 and Table 4 respectively. We also plot the performance of both models (AP50) by evaluating RSNA3, RSNA3(D), and VinTest in Figure 9.

The results in Figure 9 support our notion of combining the pneumonia annotation data of RSNA and GGO, as well as consolidation (and infiltration) annotation VinBig data for training the ML model. We obtained significant accuracy improvements for pneumonia detection.

When evaluating RSNA3 with original pneumonia annotations, the model trained on RSNA-3 achieved an accuracy of 12.7 AP50, whereas the model trained on VinBig with lung lesion annotations achieved an accuracy of 4.99 AP50. The large gap in volume data between RSNA-3 and VinBig (5888 CXRs against 1588 CXRs) resulted in a difference in accuracies. However, when we combined RSNA-3 and VinBig, the accuracy increased to 23.1 AP50.
Conversely, when evaluating VinTest with original lung lesion annotations treated as pneumonia, the model trained on VinBig achieves higher accuracy than the model trained on RSNA-3 (19.25 against 15.42) even though the VinBig data is less than the RSNA-3 data. Furthermore, when the two training datasets were combined, the accuracy was improved to 30.44.
The model trained on the combination of datasets improved the accuracy by up to 1.97 times on VinTest (against the model trained on RSNA-3) and 4.6 times on RSNA3 (against the model trained on VinBig). We also recorded an improved accuracy of 1.53 times on the RSNA3(D) dataset (against the model trained on RSNA-3), which is the RSNA3 but with lung lesion annotations (by physician (D)) merged as pneumonia abnormalities.

Pneumonia and lung lesion-annotated images are compatible with each other and helped increase the detection accuracy in all scenarios, enriching and increasing the training data. Unfortunately, it also reinforces the visual similarity between GGO and consolidation. Therefore, discriminating against them is a nontrivial mission.

4.2. Lung Lesion Detection

To establish a baseline for the COVID-19-associated lung lesion detection target, we keep all pneumonia preliminary experimental settings and train two other models: the first (baseline) on the VinBig dataset and the second (teacher) on VinBig and RSNA3(D) combined. Subsequently, we evaluated the two models and showed their accuracy on the VinTest dataset in Table 5 and Table 6. Unfortunately, the improvements by combining RSNA3(D) for the training scoring on the VinTest dataset are marginal, only 0.11 AP50. However, the improvement when evaluating the CheXpert(D) is noticeable—from 0.04 to 2.13 AP50.

We conjecture that the compatibility between the annotations of the training dataset (RSNA3(D)) and test dataset (CheXpert(D))—because they were annotated by the same physician (D)—contribute to this enhancement. These baseline results suggest that although the training data is relatively larger than the test data (1588 CXRs against 208 CXRs, i.e., seven times larger), it is still insufficient to achieve a good result. The volume of data and the consistency and quality of annotations between the training and test data also play an important role in the model performance. Therefore, in addition to studies on model architecture, research on methods for data preparation, such as collecting, processing, and refining datasets, should be given more attention.

5. Experiments & Results

We used the programming language Python [44], Pytorch [45] as a DL framework, and Detectron2 [46] for Faster R-CNN architecture implementation. All preliminary experiments were conducted on GPU Nvidia RTX 2080 SUPER (8 GB RAM), whereas, later experiments were conducted on a GPU Nvidia RTX 3090 (24 GB RAM). Furthermore, we constructed the Faster R-CNN with ResNet-101 as the backbone network for all experiments.

In summary, there are a total of 2985 CXRs (including 1274 COVID-19 CXRs) for training, and 443 for CXRs (including 235 COVID-19 CXRs) for testing in experiments. Because the CXR dataset was provided in many formats from DICOM (which maintain the highest dynamic contrast quality of the medical image) to the normal image format such as JPEG and PNG (lower dynamic contrast quality), we decide to convert all the DICOM CXRs into PNG format to unify the same dynamic contrast level to another CXR dataset for training on the combination of them in later experiments. In the training process, the Detectron2 framework will resize the shorter edge of the CXR into one of the following sizes (640, 672, 704, 736, 768, 800) depending on which size that is most close to it, the remaining edge will be resized to the size that keeps the original aspect ratio of the CXR. There is no augmentation technique was applied to the CXR to maintain the dynamic contrast information.

5.1. Training Procedure

DL algorithms are data-hungry. To manage a large volume of data, DL models prefer to be trained in minibatch gradient descent to fit the model into the GPU memory and accelerate the training process. The number of samples in the group (i.e., batch size) are fed into the model to measure the loss value and obtain the average gradient of the mini-batch to update the model weight until it converges to the optimal point. Epoch and iteration are two terms/parameters used in describing the training process. Epoch indicates the number of passes of the entire training dataset that have been completed by the machine learning algorithm, whereas the iteration indicates the number of times the parameters of the algorithm are updated. That is, an epoch is a measure of how many times an image of training data has a chance to update the model weight, whereas iteration measures how many batches have been fed into the model. The Detectron2 framework employs an iteration-based training process.

By upgrading to a GPU with larger memory, we can store more images per batch (8 against 4 in preliminary experiments), thus accelerating the training process. However, we must also revise the parameters to train the Faster R-CNN model effectively. We mimicked the training procedure of the default Faster R-CNN model of Detectron2 [46] for normal object detection, which was trained on 118,000 images of the COCO train2017 dataset [47] for 37 epochs. According to research from Facebook [48], if we employ the ImageNet [49] pre-trained model weight of ResNet to initialize the Faster R-CNN backbone, Faster R-CNN will converge earlier than when training from scratch. We also employ this strategy to implement our Faster R-CNN model. We found from preliminary experiments that model weights trained under 150 epochs made incorrect predictions outside the lung field with a low confidence score. We conjecture that the ResNet pretrain weight is heavily trained to extract features in the normal object domain. Thus, any number under 150 epochs does not sufficiently improve the weights of the ResNet convolution layers to effectively extract features in a medical image, specifically the CXR domain. In this experiment, we trained the student model at ten times the default setting, which means 370 epochs to adapt the ResNet-101 backbone from the normal domain to our task domain.

The number of required iterations is derived from the number of epochs via the following equation:

iterations = epochs \frac{# images}{batch size}

(6)

To train the Student detector model, we combined the VinBig, RSNA3(D), and SIIM, which was newly annotated by our semi-self-supervised method for a total of 2895 CXRs as the training data; thus, the number of iterations for training is

\begin{matrix} 140, 000 \approx 133, 893 = 370 \frac{2895}{8} \end{matrix}

Because we save the model weights each after 5000 iterations, the training process should be completed after 135,000 iterations; we continually train for 5000 additional iterations, that is, 140,000 iterations, to determine whether the total loss value could be more improved. More than three annotations were typically present in a CXR; hence, we employed a learning rate of 0.005, which is five times higher than the default (i.e., 0.001) to capture all informative features inside the CXR. Halfway through the training, during which time the model weight was gradually saturated, the total loss value bounced around, indicating that the model could not learn further. We applied the learning rate scheduling method to overcome this limitation, which elegantly decreased the learning rate, thus forcing the model to completely harvest the remaining training iterations and enhance the total loss value to a global optimal point. We planned to decrease the learning rate to 0.0005 and 0.00005 at 75% and 90% of the training process, at the 105,000th and 125,000th iteration, respectively.

After determining the hyperparameters (i.e., number of iterations, learning rate, and scheduling), we trained the student model for three days and six hours. The total loss value over each training iteration is shown in Figure 10. After 70,000 iterations, the total loss value bounced around; after 105,000 iterations, where the learning rate was scheduled to reduce, the total loss value indicated a significant reduction. After 125,000 iterations, we planned a second reduction in the learning rate; although there was a slight decrease in the total loss value, it was trivial. In the rest of the training process, the total loss value did not seem to improve, indicating that the model was successfully trained.

5.2. Evaluation

Overall Performance

After 370 epochs, we observed that all predictions made by the student model were trustworthy. They were distributed only inside the lung field and rarely expanded over the clavicles, scapulae, or chest wall. Thus, we determined the confidence score threshold at 1% when evaluating model predictions. We evaluated the baseline, teacher, and student models on both the VinTest and the BG COVID dataset to obtain their lung lesion detection performance on non-COVID-19 and COVID-19 CXRs. Owing to the high number of training epochs, the training data could have been overfitted, thus being incapable of properly generalizing the unseen test data. Moreover, comparing only the final model weight may yield inconsistent and unfair results because models trained on different data and training processes may be overfitting at different stages.

We evaluated the last 10 model weights (each after 10,000 iterations), and occupied half of the entire training process of both models (baseline, teacher, student) to guarantee the fairness of the results and comparisons. In Table 7, we report the AP50 on VinTest and BG COVID dataset for 10 model weights (Roman number notation) of baseline, teacher, and student models with their respective iterations on which the model weight was trained. For easy comparison, these results are shown in Figure 11 and Figure 12.

In the first half (from I to V), the model weight is not completely mature, thus yielding many untrustworthy predictions. This made the results fluctuate because most untrustworthy predictions are correct by chance. The models degraded around VI and VII because they could not learn further owing to the high learning rate and overfitting of the training data. After reducing the learning rate between VII and VIII, the models reinforce their learning and shape their prediction more confidently, thus enhancing the overall result. Eventually, the model weights between IX and X are matured, reflecting their true ability to detect lung lesions.

On the VinBig dataset, the Student model outperformed the teacher and baseline models by approximately 1.66 and 1.68 fold (10 against 6.02 and 5.95 in term AP50), respectively. On the BG COVID dataset, enhancement was also recorded: approximately 1.18 and 1.86 times (1.9 against 1.6 and 1.02 in term AP50). We can confirm that the student model is boosted by the SIIM dataset, which was annotated by our semi-self-supervised method that leverages the primitive ability of the teacher model to supervise the novice in the annotation process. Although we cannot guarantee the correctness of SIIM annotations, they did not distort the decision-making of the lung lesion detector and degrade its performance in practice; however, they contributed noticeably to improving the accuracy.

Finally, two concerns remain:

First, why is there no improvement between the teacher and the baseline on the VinTest dataset where we supplied the additional RSNA3(D) to train the teacher model?
Second, why is the performance of the COVID-19 CXRs (BG COVID dataset) performance low compared to the non-COVID-19 CXRs (VinBig dataset)?

Which are things we will clarify in the next section.

5.3. The Bias of Annotating Style

As mentioned above, the annotation of medical images requires multiple opinions to overcome the problem of human error. The consensus among experts made the final decision; thus, the bounding boxes were stretched to cover most of the private annotations of different experts [50,51]. Figure 13 visualizes the annotations of the datasets; the VinBig and VinTest datasets are annotated by multiple experts, the SIIM is annotated by our semi-self-supervised method, and BG COVID is annotated by the physician 3 (D).

A difference in how the three physicians annotated the lung lesions on CXR was observed. Figure 14 shows the annotations of the three physicians on the same CXR (from the CheXpert dataset). Depending on the experience level and private opinion, physicians employ different sizes and bounding boxes to annotate lung lesions. For instance, the total number of instances annotated on the CheXpert dataset by physicians H, T, and D were 55, 92, and 160, respectively.

The annotation style compatibility between the training and test data biased the evaluation results. In preliminary experiments, the CheXpert(D) dataset always achieved the lowest AP50 accuracy because physician 3 (D) annotated lung lesions as bounding boxes of very small sizes with numerous boxes. In addition, the teacher model achieved a similar accuracy to the baseline model when evaluating the VinTest; however, it markedly outperformed the baseline on the BG COVID because it was boosted by the RSNA3(D). However, the detailed annotation of the BG COVID dataset inadvertently caused more difficulties in evaluating the model’s accuracy. Using the model which was trained on different annotating styles (i.e., consensus style, which means fewer but big bounding boxes) and fitting its prediction to the more detailed annotation style (i.e., many smaller bounding boxes) seems unfair, if not extremely difficult. We must carefully reevaluate the student model to investigate the difference in accuracy between nonCOVID-19 and COVID-19 CXR.

The US NIH Clinical Center (NIHCC) developed a CRX classifier to classify common thorax diseases and then extract the heatmap (likelihood map of pathology) to localize pathology [52]. To comprehensively evaluate the pathology localization, they measured the AP at multiple IoU thresholds. They also demonstrated their performance in detecting pneumonia on their ChestX-ray8 test data. Similarly, we reevaluated the student model (model weight X) at seven different IoU thresholds (from 0.1 to 0.7) on VinTest and BG COVID, which are reported in Table 8. When we decrease the IoU threshold, the accuracy of the student model is almost the same at 0.1 IoU on COVID-19 and non-COVID-19 CXRs. This reevaluation result indicates that the accuracy of the student model is the same for the non-COVID-19 and COVID-19 CXRs.

Finally, we aim to compare our method with NIHCC in terms of detecting pneumonia. We have proven that GGO, consolidation, and infiltration are related to pneumonia; we propose that pneumonia can also be detected by detecting lesions. Hence, we evaluated our model (at 140,000 iterations) on the same NIHCC ChestX-ray8 dataset. Regardless of the type of predictions made by our student model, we merged them as pneumonia type to calculate the AP against their hand-annotated ground truth. As a result, we reported both APs at the same IoU thresholds for pneumonia detection in Table 9. Our model competed with that of NIHCC and even outperformed it at 0.4 IoU (8.5 against 7.5 in terms of AP). This is an encouraging result because we trained our model on comparatively very limited data (2895 CXRs against 108,948 CXRs).

6. Discussion

Massive COVID-19 outbreaks have stressed hospital systems globally, especially in developing countries, resulting in a public health crisis with a shortage of medical resources. Timely RT-PCR testing is not suitable for quickly triaging suspicious COVID-19 patients, thus placing medical workers at risk of viral transmission during examination and transportation of the patient, as well as complicating environmental disinfection. The World Health Organization has suggested the use of chest imaging for diagnostic purposes in symptomatic patients with suspected COVID-19 when RT-PCR testing is delayed [53]. Recent international consensus statements suggest that radiologic examinations can be used as a triage tool in resource-constrained environments [2,54]. Our COVID-19-associated lung lesion detector can be deployed as a CAD system with integrated notification that could inform medical workers of abnormal results immediately after the acquisition of CXR, thus facilitating the prioritization of patients under higher suspicion of COVID-19 for isolation to prevent further transmission. The extent of radiological findings of lung lesions can mirror the clinical severity of COVID-19 [55,56,57]. Automated detection and supply of lung lesion properties such as volume and distribution by our COVID-19-associated lung lesion detector can aid medical workers and non-radiologist physicians in hospitalization and intensive care decision-making. Further, it can be utilized to monitor the severity of the disease in resource-constrained environments where expert radiologists are not available.

A recent study in Korea, in which a clinically available CAD was employed to aid non-radiologist physicians in interpreting the CXR of COVID-19 patients to triage them, has shown that non-radiologist physicians had significantly improved performance when assisted with CAD [58]. The CAD is based on the commercialized DL algorithm [59] and is approved by the Ministry of Food and Drug Safety of Korea. Demonstrably, its performance is similar to that of thoracic radiologists and higher than that of non-radiologist physicians. The commercial CAD employed the same design as the above NIHCC model (i.e., classification model and heat map combination) and was trained on 54,221 normal CXRs and 35,613 abnormal CXRs, with four major thoracic diseases including pulmonary malignancy, pulmonary tuberculosis, pneumothorax, and pneumonia (6903 pneumonia CXRs), which is approximately 80% of the training data of the NIHCC model. The downside of the design of commercial CAD and NIHCC models is that they could not provide detailed information on whether each detected pneumonia abnormality was GGO or consolidation (or infiltration); further, their localization was extracted using a heat map without considering the type of lesion. The author of that study noticed that the CAD was not trained specifically for COVID-19 and suggested that additional training with COVID-19 CXRs may improve its performance [22].

In our study, we went a step further than in the two previous studies. By employing an object detection model, our lung lesion detector can categorize the detected lesion into three types, which is more helpful to medical practitioners in terms of clinical evidence and theoretically more accurate in localizing the lesion because it also considers the type while regressing its location. We also trained our lung lesion detector on COVID-19 CXRs. To obtain lung lesion annotation on COVID-19 CXR using only a novice annotator, we propose the semi-self-supervised method. Our annotation method is a general solution that can be applied not only to annotating COVID-19 CXR but also to any type of medical image. The method frees researchers from specialist dependence with regard to annotating tasks and can promote novel studies in the future where the researcher can independently annotate the image data.

The evaluation results of the ML model can be influenced by the difficulty of the test datasets and can be exaggerated in easy test datasets. The average AP accuracy of the state-of-the-art Faster R-CNN implementation model for normal object detection was approximately 40. Although our lung lesion detector achieved a not-so-impressive accuracy of approximately 10 AP50, we discovered some scenarios—difficult evaluation criteria—that limited our lung lesion detector accuracy when we examined the detection visualization. First, when a lung lesion detection is larger than the ground truth, it is considered an incorrect prediction (ground truth lesion is GGO, the lung detection is GGO with 100% confidence, Figure 15a). Second, although the lung lesion detection is perfectly fitted to the ground truth, if its type is incorrect, it is also counted as an incorrect prediction (the ground truth lesion is consolidation, whereas the two predictions are GGO and infiltration, Figure 15b). In both cases, the detection did not contribute to the final evaluation accuracy; however, in our opinion, these predictions are still useful because they act as a warning to the medical practitioner. Finally, even if the lung lesion detector could not make any correct prediction, these incorrect predictions were frequently distributed around ground truth lung lesions, thus serving the same purpose (Figure 15c).

These extreme criteria were derived because we evaluated our lung lesion detector in an object detection fashion with the AP metric (detailed description in Appendix A). The commercial CAD is evaluated in a classification fashion by examining its prediction on test data from a radiologist and measuring the accuracy with the area under the receiver operating characteristic metric; thus, it would not meet the same extreme criteria as ours. It is infeasible to directly compare our work with commercial CAD owing to the different test data and measured metrics. However, the observed comparison in pneumonia detection using the NIHCC model (Table 9), which is the same design architecture and trained on a larger data volume as in the commercial CAD, we conjecture that our lung lesion detector accuracy also competes with that of commercial CAD but only in term of COVID-19-associated lung lesion rather than the detection of other lung anomalies. Hence, we consider that the accuracy of our lung lesion detection will be accepted for implementation in clinical practice and may help less-experienced users identify subtle evidence of COVID-19 on CXRs.

Finally, we demonstrated that our lung lesion detector is promising because we can continually enhance its accuracy by repeating the same procedure, which annotates more publicly available CXR datasets using our semi-self-supervise method and iteratively training the Faster R-CNN-based model in loops until it meets our expectations. To facilitate this, we plan to adapt the Algorithm 1 pseudo-code into the actual code in a future study, which means turning our current method into a fully automatic self-supervised method to automatically annotate CXRs and forgo the necessity for a novice annotator. For future advanced research direction, we may extract the ROI of the feature map produced by the ROI head of Faster R-CNN and then feed them into a traditional ML algorithm (e.g., Support Vector Machine, Bayesian, decision tree, etc.) as the refined input data for improving the lesion categorizing accuracy.

There are some limitations to our study. Our COVID-19 CXRs (BG COVID test data) were only collected in one hospital, which might not generally reflect COVID-19 patients in different regions and countries. In addition, the annotations for RSNA3 (training data) and BG COVID were made by only one physician who was not a thoracic radiologist, and the consensus of multiple specialists may bias our model to the subjective opinion of the physician and not generalize to other CXRs when deployed in practice. However, these limitations are unavoidable in pandemic working environments but should be considered in future research.

7. Conclusions

The construction of a DL model for COVID-19 subjects is infeasible owing to the intrinsic problem of chest radiography, which cannot provide diagnostic features; thus, classification models are deceived by learning disease-irrelevant features to build their prediction. In this study, we presented a COVID-19-associated lung lesion detector based on an object detection model, which is more effective than the classic classification model with respect to learning capability because it exploits the annotations of image data. We address the biggest barrier to the employment of object detection models for medical imaging—dependence on specialists’ experience in annotating large volumes of medical image data—by proposing our semi-self-supervised method, which leverages annotations of the current publicly available annotated pneumonia datasets to supervise the annotation of lung lesions by a novice on COVID-19 CXR.

We conducted multiple experiments, evaluations, and analyses on our prepared datasets containing 2985 CXRs (including 1274 COVID-19 CXRs) for training, 443 CXRs (including 235 COVID-19 CXRs) for testing, and more than 6000 pneumonia CXRs from our collected data and five public sources. The results demonstrated that the COVID-19 data annotated by our method significantly boosted the accuracy of our lung lesion detector, which was at par with the commercialized solution. Finally, some data were annotated by our physicians; all data were standardized into COCO format and are publicly available to the research community to accelerate the development of highly accurate and practical DL solutions to detect COVID-19-associated lung lesions on CXR images and monitor disease progression, thus promoting the effective treatment of COVID-19 patients.

Author Contributions

V.P. conducted this research including formulating the idea, experiment designing, and performance evaluation to the final manuscript under the guidance of E.S. D.D. provided the support for the medical data, data collection, and revision of medical problem statement. T.-M.C. supervised and approved the results. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2020-0-00990, Platform Development and Proof of High Trust & Low Latency Processing for Heterogeneous·Atypical·Large Scaled Data in 5G-IoT Environment).

Institutional Review Board Statement

Note that all the patient information on CXR is anonymized, hence no personally identifiable information was included. We followed all the anonymization processes guided by the Sungkyunkwan University Institutional Review Board (IRB). The study was conducted in accordance with the Declaration of Helsinki and the relevant guidelines and regulations.

Informed Consent Statement

The informed consent was obtained from all participants in Bac Giang Lung Hospital.

Data Availability Statement

The converted CXR data and their annotation in COCO format are accessible from the Open Science Framework: https://osf.io/aqh6s/?view_only=52a8efd06ea143a6ac8abc37a6b13597 (accessed on 29 August 2020).

Acknowledgments

We would like to thank Thuy Huynh and Huyen Ngo for their support in annotating CXRs and medical discussions.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

To comprehensively understand lung lesions, specifically GGO and consolidation, as well as construct the desired machine learning (ML) model for our purpose, we conducted various experiments on multiple datasets with various settings.

Data are the key to any DL/(AI)-related research [60]. The lack of suitable datasets is one of the biggest barriers to the success of DL in medical imaging. Only radiologists can prepare trustworthy and high-quality annotations of GGO and consolidation lesions for CXR datasets because the annotation requires extensive input from medical experts. Multiple expert opinions are required to overcome the problem of human error. Three physicians (all medical doctors, MD) with different degrees of expertise collaborated in this work to deliver quality GGO and consolidation annotations on a variety of CXRs from different datasets employed in our research.

Appendix A.1. CheXpert

The CheXpert dataset [28] provides 224,316 high-quality CXRs collected from Stanford Hospital between October 2002 and July 2017. This dataset was supplied as training and validation datasets. The validation dataset contained 234 CXRs and was annotated with 14 common chest radiographic observations (e.g., GGO, edema, and consolidation) at the image-level labels (i.e., without specific location) by three board-certified radiologists, and we filtered 31 CXRs that were labeled as both GGO and consolidation from the validation set of the CheXpert dataset. In addition, we asked physicians 1 and 2 to annotate these two lesions at the local-level label (i.e., specifically indicating lung lesion positions and sizes on the CXRs). GGO and consolidation are hazy, opaque regions inside the lung field on CXR and do not have a fixed shape; this makes it is extremely difficult to segment them. Moreover, a recent study [61] demonstrated that using bounding box annotation is approximately 15 times faster than pixelwise manual segmentations. Therefore, utilizing the workflow of experts from the US National Institutes of Health and Radiological Society of North America [50], we asked the physicians to place a rectangle around the GGO and consolidation-like opacities on the CXRs. These physicians used various personal computers and tools to view and annotate the images supplied to them.

When acquiring the annotation results, we realized that there was an inconsistency in the size of the bounding boxes drawn to annotate the same lesions between the results of the two physicians. Therefore, for verification, we asked physician 3 to annotate the same set of CXRs by drawing bounding boxes that encompassed the entire suspicious opacity as small as possible. For two or more discontinuous opacities, multiple bounding boxes were drawn.

All physicians were blinded to the other physicians’ annotations. Our team carefully examined the results from two of the three physicians and employed them as testing data in our experiments.

Appendix A.2. RSNA Pneumonia

In 2018, the Radiological Society of North America (RSNA) organized an AI challenge to detect pneumonia. They provided their pneumonia CXR dataset containing 30,000 CXRs, of which 6012 CXRs have pneumonia-like opacities annotated by bounding boxes by 18 board-certified radiologists from 16 academic institutions.

At the research initiation, CXR datasets with GGO and consolidation annotation was seemingly absent. Because the term “pneumonia” refers to the disease processes that can manifest as consolidation and GGO [50], we hypothesize that the pneumonia-like opacities annotation from the RSNA pneumonia dataset can be categorized into the GGO and consolidation lesions employed to develop the ML model. From 6012 pneumonia annotated CXRs, we filtered 123 CXRs that contained three or more up-annotation bounding boxes and asked our physicians to annotate GGO and consolidation-like opacity because we assumed that these CXRs contain more suspicious opacities worthy of reannotation. This saved time for physicians because they only had to annotate a few images but could still provide us with a sufficient number of lung lesion instances.

Following the same workflow on the CheXpert dataset, we asked physician 3 to annotate 123 CXRs from the RSNA Pneumonia dataset and acquired the GGO and consolidation annotated dataset, which we termed RSNA3. The physician was blinded to the original RSNA pneumonia annotations. In addition, the remaining annotated RSNA (5889 CXRs) were named RSNA-3 and employed for preliminary experiments in our research.

Appendix A.3. VinDr-CXR

The VinDr-CXR dataset [51] might provide a large CXR dataset with high-quality abnormal lung findings and diagnostic annotations for the research community. Over 100,000 raw images were collected from the two largest hospitals in Vietnam: 108 Hospital and Hanoi Medical University Hospital. These datasets were curated, compiled, and annotated: a total of 15,000 scans for the training set and 3000 scans for the test set. The training and test datasets were independently annotated by three and five radiologists, respectively, in the presence of 22 critical findings (local-level, e.g., consolidation, infiltration, and lung opacity (GGO)) and six diagnoses (image level, e.g., pneumonia, lung tumor, and other diseases). All local-level annotations were labeled like ours by employing the bounding box.

The CXR dataset contains GGO, consolidation, and infiltration annotations. We processed them to extract 1588 CRX datasets for the training dataset and 208 CXRs for the testing dataset, named VinTrain and VinTest, respectively.

Appendix A.4. SIIM COVID-19

COVID-19 is a novel disease; thus, the study of lung injuries caused by COVID-19 is still ongoing and they are not clearly understood. Alongside known lesions (i.e., GGO and consolidation), unknown pathologies, i.e., latent patterns or characteristics may also be involved. Therefore, to enhance the performance and tailor the ML model specifically for COVID-19, it is necessary to train the model on the COVID-19 CXR dataset.

Similar to the RSNA Pneumonia challenge, the Society for Imaging Informatics in Medicine (SIIM) has partnered with the Foundation for the Promotion of Health and Biomedical Research of Valencia Region, Medical Imaging Databank of the Valencia Region, and the Radiological Society of North America (RSNA) organized a competition to identify and localize COVID-19 abnormalities on chest radiographs. The SIIM COVID-19 dataset includes CXRs of positive COVID-19 patients from 11 hospitals in the Valencian Region (Spain) [62] and is annotated following the NIH mechanism [63] and RSNA schema [64]. In addition, all CXRs are categorized into three groups [65], as the follows:

Typical Appearance: Containing multifocal bilateral, peripheral opacities, and/or opacities with rounded morphology. Lower lung-predominant distribution (must be present with either or both of the first two opacity patterns).
Indeterminate Appearance: Absence of typical findings and unilateral, central, or upper lung predominant distribution of airspace disease.
Atypical Appearance: Containing pneumothorax or pleural effusion, pulmonary edema, lobar consolidation, solitary lung nodule or mass, diffuse tiny nodules, and cavity.

A panel of experienced radiologists then annotated the CXRs to detect abnormal opacities, in addition to the bounding box fashion. All the one-class annotations (i.e., Opacity) are unsuitable for training a model; thus, it is difficult to differentiate these COVID-19 abnormal opacities from GGO or consolidation (and infiltration). Specifically, the CXRs from the SIIM COVID-19 dataset need to be reannotated into three classes (i.e., GGO, consolidation, and infiltration) to fulfill our requirements for training a model. As mentioned in the semi self-supervised section, Vinh Pham applied our method and annotated the CXRs of SIIM to construct our lung lesion annotated COVID-19 CXR dataset.

Appendix A.5. ChesX-ray8

In 2017 America’s research hospital—NIHCC released the hospital-scale ChestX-ray8 dataset [52] which may be the largest public chest X-ray dataset up until the time of conducting our research.

ChestX-ray8 contains 108,948 CXRs (collected from 1992 to 2015) of 32,717 unique patients. Eight common disease labels (including pneumonia) mined from text radiological reports via natural language processing, were provided at the image level without localization of pathology (i.e., local level) for a large number of CXRs. Nevertheless, fewer CXRs (983 images) with pathology were provided with a hand-labeled bounding box by a board-certified radiologist for 1600 instances of eight pathologies (200 instances for each). Although this hand-labeled annotation is not for GGO and consolidation, we assumed that pneumonia pathology caused GGO/consolidation (or infiltration); thus, any GGO/consolidation (or infiltration) prediction within pneumonia annotation should be considered as a correct prediction. Hence, we filtered 120 CXRs, which are indicated as pneumonia, to evaluate our model and compare it with NIHCC’s model.

Appendix A.6. Bac Giang Lung Hospital COVID-19—BG COVID

It is ideal to conduct a paraclinical trial by deploying and evaluating our solution in real-world clinical environments (e.g., hospitals and clinical centers) with the examination and consultation of medical practitioners and experts. However, giving a lead time of approximately five years to conduct a clinical trial, it is infeasible, if not impossible, for us to conduct a paraclinical trial.

We attempted to simulate testing equivalent to a paraclinical trial. With the support of our physician, we could collect 235 CXRs of COVID-19 positive patients in Bac Giang (a province in North Vietnam) Lung Hospital since the third wave of the COVID-19 outbreak began in Vietnam (end of May 2021). Using the same workflow for the CheXpert dataset, physician 3 (D) annotated the GGO and consolidation that served as ground truth on all 235 CXRs we collected to evaluate the performance of our solution for COVID-19 CXR.

Appendix A.7. Evaluation Metric

In the classifying problems in which we can obtain a steady result based on the probability of the prediction, researchers tend to use precision, recall, sensitivity, and specificity to evaluate the trained model using the medical CXR-image dataset. For instance, we defined the correctly detected COVID-19 CXR as true positive (TP), incorrectly detected COVID-19 CXR as false Positive (FP), correctly detected nonCOVID-19 CXR as true negative (TN), and incorrectly detected nonCOVID-19 CXR is false negative (FN), as follows:

\begin{matrix} p r e c i s i o n = \frac{T P}{T P + F P} \\ r e c a l l = \frac{T P}{T P + F N} \end{matrix}

\begin{matrix} s e n s i t i v i t y = \frac{T P}{T P + F N} = \frac{# Image correctly predicted as COVID-19}{# Total COVID-19 Images} = r e c a l l \\ s p e c i f i c i t y = \frac{T N}{T N + F P} = \frac{# Image correctly predicted as non-COVID-19}{# Total non-COVID-19 Images} \end{matrix}

Precision is a measure of, “when the model guesses, how often does it guess correctly?” Recall is a measure of “Has the model guessed every time that it should have guessed?” It is noteworthy that there is a trade-off between precision and recall. For instance, avoiding FN is more important than avoiding FP because it cuts off the spread to protect the community. Therefore, we lower the confidence threshold to encourage the trained model to detect all potential cases; however, it lowers the detection precision, which means a false alarm (i.e., produces more FP prediction).

Plotting the precision and recall of the trained model as a function of the model’s confidence threshold is the precision–recall curve (Figure A1a). Therefore, to evaluate the overall performance, it is necessary to develop a metric that captures both precision and recall in a single metric.

Figure A1. Illustration for (a) precision, recall curve and (b) IoU.

F1-score, area under the curve (AUC), and average precision (AP) are the most preferred metrics for this purpose (Figure A2). The F1-score is the harmonic means of precision and recall at a specific confidence level, used to find the optimal confidence threshold where precision and recall produce the highest F1 value. AUC integrates the amount of the plot that falls underneath the precision and recall curve by measuring the area AUC. Finally, AP is calculated as the weighted mean of precision achieved at each threshold, with an increase in recall from the previous threshold used as the weight.

Figure A2. Illustration for F1, AUC and AP metrics [66].

For the detection task, the trained model must classify the correct category of the target and localize the target on the image. In practice, it is difficult for the model to predict the perfect position of the target. As an illustration in Figure A1b, we presume that a predicted localization (red box) that overlaps the ground truth (green box) is a correct prediction when the ratio of the overlapped region over the union of prediction and ground truth was larger than the intersection over union (IoU) threshold. Therefore, we can evaluate the performance over multiple IoU thresholds (e.g., 50%, 75%, or 95%) depending on our criteria and the object to be detected. In Figure A3, the precision and recall curves have been plotted at five levels of IoU; red is drawn with the highest requirement for IoU, whereas orange is drawn with the most lenient required IoU. Using this design, the AP naturally fits our needs, capturing the performance at multiple IoU levels into a single metric because we can calculate the mean average precision (mAP) using all APs. Finally, an average of all specific class mAPs was the final mAP metric for evaluation.

Figure A3. AP on each lesion performance [66].

References

Cleverley, J.; Piper, J.; Jones, M.M. The role of chest radiography in confirming COVID-19 pneumonia. BMJ 2020, 370, m2426. [Google Scholar] [CrossRef] [PubMed]
Rubin, G.D.; Ryerson, C.J.; Haramati, L.B.; Sverzellati, N.; Kanne, J.P.; Raoof, S.; Schluger, N.W.; Volpi, A.; Yim, J.J.; Martin, I.B.; et al. The role of chest imaging in patient management during the COVID-19 pandemic: A multinational consensus statement from the Fleischner Society. Radiology 2020, 296, 172–180. [Google Scholar] [CrossRef] [PubMed]
UPDATED BSTI COVID-19 Guidance for the Reporting Radiologist; The British Society of Thoracic Imaging: London, UK, 2020.
Simpson, S.; Kay, F.U.; Abbara, S.; Bhalla, S.; Chung, J.H.; Chung, M.; Henry, T.S.; Kanne, J.P.; Kligerman, S.; Ko, J.P.; et al. Radiological Society of North America Expert Consensus Document on Reporting Chest CT Findings Related to COVID-19: Endorsed by the Society of Thoracic Radiology, the American College of Radiology, and RSNA. Radiol. Cardiothorac. Imaging 2020, 2, e200152. [Google Scholar] [CrossRef] [PubMed]
Maguolo, G.; Nanni, L. A critic evaluation of methods for COVID-19 automatic detection from X-ray images. Inf. Fusion 2021, 76, 1–7. [Google Scholar] [CrossRef] [PubMed]
Cohen, J.P.; Hashir, M.; Brooks, R.; Bertrand, H. On the limits of cross-domain generalization in automated X-ray prediction. In Proceedings of the Medical Imaging with Deep Learning, PMLR, Montreal, QC, Canada, 6–8 July 2020; pp. 136–155. [Google Scholar]
Majeed, T.; Rashid, R.; Ali, D.; Asaad, A. Problems of Deploying CNN Transfer Learning to Detect COVID-19 from Chest X-rays. medRxiv 2020. [Google Scholar] [CrossRef]
Shi, F.; Wang, J.; Shi, J.; Wu, Z.; Wang, Q.; Tang, Z.; He, K.; Shi, Y.; Shen, D. Review of artificial intelligence techniques in imaging data acquisition, segmentation, and diagnosis for COVID-19. IEEE Rev. Biomed. Eng. 2020, 14, 4–15. [Google Scholar] [CrossRef]
Minaee, S.; Kafieh, R.; Sonka, M.; Yazdani, S.; Soufi, G.J. Deep-covid: Predicting COVID-19 from chest X-ray images using deep transfer learning. Med. Image Anal. 2020, 65, 101794. [Google Scholar] [CrossRef]
Wang, L.; Lin, Z.Q.; Wong, A. COVID-net: A tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images. Sci. Rep. 2020, 10, 19549. [Google Scholar] [CrossRef]
Horry, M.J.; Chakraborty, S.; Paul, M.; Ulhaq, A.; Pradhan, B.; Saha, M.; Shukla, N. COVID-19 detection through transfer learning using multimodal imaging data. IEEE Access 2020, 8, 149808–149824. [Google Scholar] [CrossRef]
Pennisi, M.; Kavasidis, I.; Spampinato, C.; Schinina, V.; Palazzo, S.; Salanitri, F.P.; Bellitto, G.; Rundo, F.; Aldinucci, M.; Cristofaro, M.; et al. An Explainable AI System for Automated COVID-19 Assessment and Lesion Categorization from CT-scans. Artif. Intell. Med. 2021, 118, 102114. [Google Scholar] [CrossRef]
Cohen, J.P.; Dao, L.; Roth, K.; Morrison, P.; Bengio, Y.; Abbasi, A.F.; Shen, B.; Mahsa, H.K.; Ghassemi, M.; Li, H.; et al. Predicting COVID-19 pneumonia severity on chest x-ray with deep learning. Cureus 2020, 12, e9448. [Google Scholar] [CrossRef] [PubMed]
Cohen, J.P.; Morrison, P.; Dao, L.; Roth, K.; Duong, T.Q.; Ghassemi, M. COVID-19 Image Data Collection: Prospective Predictions Are the Future. arXiv 2020, arXiv:2006.11988. [Google Scholar]
Warren, M.A.; Zhao, Z.; Koyama, T.; Bastarache, J.A.; Shaver, C.M.; Semler, M.W.; Rice, T.W.; Matthay, M.A.; Calfee, C.S.; Ware, L.B. Severity scoring of lung oedema on the chest radiograph is associated with clinical outcomes in ARDS. Thorax 2018, 73, 840–846. [Google Scholar] [CrossRef] [PubMed]
Li, M.D.; Arun, N.T.; Gidwani, M.; Chang, K.; Deng, F.; Little, B.P.; Mendoza, D.P.; Lang, M.; Lee, S.I.; O’Shea, A.; et al. Automated assessment and tracking of COVID-19 pulmonary disease severity on chest radiographs using convolutional siamese neural networks. Radiol. Artif. Intell. 2020, 2, e200079. [Google Scholar] [CrossRef] [PubMed]
Signoroni, A.; Savardi, M.; Benini, S.; Adami, N.; Leonardi, R.; Gibellini, P.; Vaccher, F.; Ravanelli, M.; Borghesi, A.; Maroldi, R.; et al. End-to-end learning for semiquantitative rating of COVID-19 severity on chest X-rays. arXiv 2020, arXiv:2006.04603. [Google Scholar]
IntelliSpace Portal 11: Philips Healthcare. Available online: https://www.philips.co.uk/healthcare/product/HC881103/intellispace-portal-11 (accessed on 1 August 2021).
Hansell, D.M.; Bankier, A.A.; MacMahon, H.; McLoud, T.C.; Muller, N.L.; Remy, J. Fleischner Society: Glossary of terms for thoracic imaging. Radiology 2008, 246, 697–722. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Computer Vision—ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Esayag, Y.; Nikitin, I.; Bar-Ziv, J.; Cytter, R.; Hadas-Halpern, I.; Zalut, T.; Yinnon, A.M. Diagnostic value of chest radiographs in bedridden patients suspected of having pneumonia. Am. J. Med. 2010, 123, 88.e1–88.e5. [Google Scholar] [CrossRef]
Hwang, E.J.; Kim, H.; Yoon, S.H.; Goo, J.M.; Park, C.M. Implementation of a deep learning-based computer-aided detection system for the interpretation of chest radiographs in patients suspected for COVID-19. Korean J. Radiol. 2020, 21, 1150. [Google Scholar] [CrossRef]
Nishio, M.; Noguchi, S.; Matsuo, H.; Murakami, T. Automatic classification between COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy on chest X-ray image: Combination of data augmentation methods. Sci. Rep. 2020, 10, 17532. [Google Scholar] [CrossRef]
Donald, J.J.; Barnard, S.A. Common patterns in 558 diagnostic radiology errors. J. Med. Imaging Radiat. Oncol. 2012, 56, 173–178. [Google Scholar] [CrossRef]
Nodine, C.F.; Kundel, H.L.; Mello-Thoms, C.; Weinstein, S.P.; Orel, S.G.; Sullivan, D.C.; Conant, E.F. How experience and training influence mammography expertise. Acad. Radiol. 1999, 6, 575–585. [Google Scholar] [CrossRef]
Nodine, C.F.; Mello-Thoms, C. The nature of expertise in radiology. In Handbook of Medical Imaging; SPIE: Bellingham WA, USA, 2000; pp. 859–895. [Google Scholar]
Rajchl, M.; Koch, L.M.; Ledig, C.; Passerat-Palmbach, J.; Misawa, K.; Mori, K.; Rueckert, D. Employing weak annotations for medical image analysis problems. arXiv 2017, arXiv:1708.06297. [Google Scholar]
Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R.; Shpanskaya, K.; et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 590–597. [Google Scholar]
Li, Y.; Huang, D.; Qin, D.; Wang, L.; Gong, B. Improving object detection with selective self-supervised self-training. In Computer Vision—ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 589–607. [Google Scholar]
Efron, B. Bootstrap methods: Another look at the jackknife. In Breakthroughs in Statistics; Springer: Berlin/Heidelberg, Germany, 1992; pp. 569–593. [Google Scholar]
Grunkemeier, G.L.; Wu, Y. Bootstrap resampling methods: Something for nothing? Ann. Thorac. Surg. 2004, 77, 1142–1144. [Google Scholar] [CrossRef] [PubMed]
Dutta, A.; Zisserman, A. The VIA Annotation Software for Images, Audio and Video. In Proceedings of the MM’19, 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; ACM: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
Dutta, A.; Gupta, A.; Zissermann, A. VGG Image Annotator (VIA). Version: 2.0.10, 2016. Available online: http://www.robots.ox.ac.uk/~vgg/software/via/ (accessed on 1 March 2021).
Code Documentation for VGG Image Annotator 2.0. Available online: https://gitlab.com/vgg/via/-/blob/via-3.x.y/via-2.x.y/CodeDoc.md#core-data-structures (accessed on 1 March 2021).
ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Available online: https://www.image-net.org/challenges/LSVRC/ (accessed on 1 March 2021).
Apostolopoulos, I.D.; Mpesiana, T.A. COVID-19: Automatic detection from x-ray images utilizing transfer learning with convolutional neural networks. Phys. Eng. Sci. Med. 2020, 43, 635–640. [Google Scholar] [CrossRef] [PubMed]
Loey, M.; Smarandache, F.; M. Khalifa, N.E. Within the lack of chest COVID-19 X-ray dataset: A novel detection model based on GAN and deep transfer learning. Symmetry 2020, 12, 651. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Computer Vision—ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Honda, H. Digging into Detectron 2. 2020. Available online: https://medium.com/@hirotoschwert/digging-into-detectron-2-47b2e794fabd (accessed on 1 August 2021).
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
Van Rossum, G.; Drake, F.L. Python 3 Reference Manual; CreateSpace: Scotts Valley, CA, USA, 2009. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]
Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.Y.; Girshick, R. Detectron2. 2019. Available online: https://github.com/facebookresearch/detectron2 (accessed on 1 August 2021).
COCO Train2017. Available online: https://cocodataset.org (accessed on 1 August 2021).
He, K.; Girshick, R.; Dollár, P. Rethinking imagenet pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 4918–4927. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Shih, G.; Wu, C.C.; Halabi, S.S.; Kohli, M.D.; Prevedello, L.M.; Cook, T.S.; Sharma, A.; Amorosa, J.K.; Arteaga, V.; Galperin-Aizenberg, M.; et al. Augmenting the National Institutes of Health chest radiograph dataset with expert annotations of possible pneumonia. Radiol. Artif. Intell. 2019, 1, e180041. [Google Scholar] [CrossRef]
Nguyen, H.Q.; Lam, K.; Le, L.T.; Pham, H.H.; Tran, D.Q.; Nguyen, D.B.; Le, D.D.; Pham, C.M.; Tong, H.T.; Dinh, D.H.; et al. VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. arXiv 2020, arXiv:2012.15029. [Google Scholar] [CrossRef]
Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2097–2106. [Google Scholar]
Akl, E.A.; Blažić, I.; Yaacoub, S.; Frija, G.; Chou, R.; Appiah, J.A.; Fatehi, M.; Flor, N.; Hitti, E.; Jafri, H.; et al. Use of chest imaging in the diagnosis and management of COVID-19: A WHO rapid advice guide. Radiology 2021, 298, E63–E69. [Google Scholar] [CrossRef]
Foust, A.M.; Phillips, G.S.; Chu, W.C.; Daltro, P.; Das, K.M.; Garcia-Peña, P.; Kilborn, T.; Winant, A.J.; Lee, E.Y. International Expert Consensus statement on chest imaging in pediatric COVID-19 patient management: Imaging findings, imaging study reporting, and imaging study recommendations. Radiol. Cardiothorac. Imaging 2020, 2, e200214. [Google Scholar] [CrossRef] [PubMed]
Yang, R.; Li, X.; Liu, H.; Zhen, Y.; Zhang, X.; Xiong, Q.; Luo, Y.; Gao, C.; Zeng, W. Chest CT severity score: An imaging tool for assessing severe COVID-19. Radiol. Cardiothorac. Imaging 2020, 2, e200047. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Dane, B.; Brusca-Augello, G.; Kim, D.; Katz, D.S. Unexpected findings of coronavirus disease (COVID-19) at the lung bases on abdominopelvic CT. Am. J. Roentgenol. 2020, 215, 603–606. [Google Scholar] [CrossRef] [PubMed]
Toussie, D.; Voutsinas, N.; Finkelstein, M.; Cedillo, M.A.; Manna, S.; Maron, S.Z.; Jacobi, A.; Chung, M.; Bernheim, A.; Eber, C.; et al. Clinical and chest radiography features determine patient outcomes in young and middle-aged adults with COVID-19. Radiology 2020, 297, E197–E206. [Google Scholar] [CrossRef]
Hwang, E.J.; Kim, K.B.; Kim, J.Y.; Lim, J.K.; Nam, J.G.; Choi, H.; Kim, H.; Yoon, S.H.; Goo, J.M.; Park, C.M. COVID-19 pneumonia on chest X-rays: Performance of a deep learning-based computer-aided detection system. PLoS ONE 2021, 16, e0252440. [Google Scholar] [CrossRef]
Hwang, E.J.; Park, S.; Jin, K.N.; Im Kim, J.; Choi, S.Y.; Lee, J.H.; Goo, J.M.; Aum, J.; Yim, J.J.; Cohen, J.G.; et al. Development and validation of a deep learning–based automated detection algorithm for major thoracic diseases on chest radiographs. JAMA Netw. Open 2019, 2, e191095. [Google Scholar] [CrossRef]
Razzak, M.I.; Naz, S.; Zaib, A. Deep learning for medical image processing: Overview, challenges and the future. In Classification in BioApps; Springer: Cham, Switzerland, 2018; pp. 323–350. [Google Scholar]
Papandreou, G.; Chen, L.C.; Murphy, K.; Yuille, A. Weakly- and Semi-Supervised Learning of a DCNN for Semantic Image Segmentation. arXiv 2015, arXiv:1502.02734. [Google Scholar]
de la Iglesia Vayá, M.; Saborit, J.M.; Montell, J.A.; Pertusa, A.; Bustos, A.; Cazorla, M.; Galant, J.; Barber, X.; Orozco-Beltrán, D.; García-García, F.; et al. BIMCV COVID-19+: A large annotated dataset of RX and CT images from COVID-19 patients. arXiv 2020, arXiv:2006.01174. [Google Scholar]
Clark, K.; Vendt, B.; Smith, K.; Freymann, J.; Kirby, J.; Koppel, P.; Moore, S.; Phillips, S.; Maffitt, D.; Pringle, M.; et al. The Cancer Imaging Archive (TCIA): Maintaining and operating a public information repository. J. Digit. Imaging 2013, 26, 1045–1057. [Google Scholar] [CrossRef]
Tsai, E.B.; Simpson, S.; Lungren, M.P.; Hershman, M.; Roshkovan, L.; Colak, E.; Erickson, B.J.; Shih, G.; Stein, A.; Kalpathy-Cramer, J.; et al. The RSNA International COVID-19 Open Radiology Database (RICORD). Radiology 2021, 299, E204–E213. [Google Scholar] [CrossRef]
Tsai, E.B.; Simpson, S.; Lungren, M.P.; Hershman, M.; Roshkovan, L.; Colak, E.; Erickson, B.J.; Shih, G.; Stein, A.; Kalpathy-Cramer, J.; et al. Data from Medical Imaging Data Resource Center (MIDRC)—RSNA International COVID Radiology Database (RICORD) Release 1c—Chest x-ray, COVID+ (MIDRC-RICORD-1c)—The Cancer Imaging Archive (TCIA). 2021. Available online: https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=70230281 (accessed on 1 March 2021).
Jacob, S. What Is Mean Average Precision (mAP) in Object Detection? 2020. Available online: https://blog.roboflow.com/mean-average-precision (accessed on 1 August 2021).

Figure 1. Paradigm of the semi-self-supervised annotating method.

Figure 2. Train and test data on three models in our experiment.

Figure 3. Visual appearance of GGO (red rectangle) and consolidation (yellow rectangle) within the lung field (green boundary).

Figure 4. Annotating lung lesion on COVID-19 CXR (SIIM dataset) by our semi-self-supervised method. (a) Pseudo boxes and Heuristic boxes. (b) Overlapping areas as red zones. (c) Final annotated boxes.

Figure 5. Open-source VIA tool employed to annotate and standardize the dataset into COCO format.

Figure 6. Overview of Faster R-CNN architecture [41].

Figure 7. CXR and different-scale feature maps extracted by the FPN. (a) Input CXR. (b) P2. (c) P3. (d) P4. (e) P5. (f) P6.

Figure 8. Illustration of (a) anchors in different sizes, ratios (Anchors in three scales and ratios) and (b) placing anchors on feature maps [41].

Figure 9. Pneumonia detection performance comparison.

Figure 10. Loss value measurement in the training process of the Student model.

Figure 11. 10 last model weight comparison on VinTest dataset.

Figure 12. 10 last model weight comparison on BG COVID dataset.

Figure 13. Difference of annotating styles between training and testing datasets. (a) VinBig (6 Infiltration, 1 GGO, 1 Consolidation). (b) SIIM (3 GGO, 1 Consolidation). (c) VinTest (4 Infiltration). (d) BG COVID (7 GGO, 6 Consolidation).

Figure 14. Differences among physicians annotating the same CXR. (a) Physician 1 (H). (b) Physician 2 (T). (c) Physician 3 (D).

Figure 15. Incorrect detection but still be useful as the warning. (a) Larger detection. (b) Incorrect type detection. (c) Prediction distribution.

Table 1. COCO standardized dataset statistics.

			#Instance
Dataset	Annotation Style	#CXR	GGO	Consolidation	Infiltration	Pneumonia
CheXpert (H)	private	31	40	15	0	0
CheXpert (T)	private	31	54	38	0	0
CheXpert (D)	private	31	132	28	0	0
RSNA3 (D)	private	123	420	297	0	0
RSNA-3	consensus	5889	0	0	0	9174
VinBig	consensus	1588	2395	583	1308	0
VinTest	consensus	208	62	136	102	0
SIIM	semi self-supervised	1274	1574	1133	233	0
ChestX-ray8	consensus	120	0	0	0	120
BG COVID (D)	private	235	866	326	0	0

Table 2. Pneumonia detection training on RSNA-3.

200,000 Iterations	RSNA3	RSNA3(D)	CheXpert(H)	CheXpert(T)	CheXpert(D)	VinTest
AP50	12.70	1.34	17.94	8.49	0.08	15.42
AP75	1.62	0.03	0.68	0.06	0.00	0.079
AP	3.86	0.22	5.33	2.00	0.02	3.98
#Instances	381	717	55	92	160	300

Table 3. Pneumonia detection training on VinBig.

200,000 Iterations	RSNA3	RSNA3(D)	CheXpert(H)	CheXpert(T)	CheXpert(D)	VinTest
AP50	4.99	1.78	11.80	7.64	1.19	19.25
AP75	0.20	0.99	1.06	0.50	0.00	1.27
AP	1.23	0.75	5.08	2.56	0.16	6.04
#Instances	381	717	55	92	160	300

Table 4. Pneumonia detection training on VinBig and RSNA-3.

165,000 Iterations	RSNA3	RSNA3(D)	CheXpert(H)	CheXpert(T)	CheXpert(D)	VinTest
AP50	23.10	2.13	17.01	14.17	0.68	30.44
AP75	2.34	0.02	1.206	0.94	0.00	1.94
AP	6.88	0.64	6.11	3.36	0.11	8.34
#Instances	381	717	55	92	160	300

Table 5. Lung lesion detection training on VinBig.

285,000 Iterations	VinTest	RSNA3(D)	CheXpert(H)	CheXpert(T)	CheXpert(D)
AP50	6.81	0.56	3.74	3.04	0.04
AP75	0.87	0.01	0.00	0.00	0.00
AP	2.34	0.17	0.968	0.84	0.00
#Instances	300	717	55	92	160

Table 6. Lung lesion detection training on VinBig and RSNA3(D).

205,000 Iterations	VinTest	RSNA3(D) $^{1}$	CheXpert(H)	CheXpert(T)	CheXpert(D)
AP50	6.92	100	2.46	3.03	2.13
AP75	0.71	100	0.00	0.00	0.00
AP	2.25	90.75	0.84	0.74	0.81
#Instances	300	717	55	92	160

¹ The model was trained with RSNA3(D) thus the accuracy was nearly perfect. We included the result on RSNA3(D) for the consistency between tables only.

Table 7. Performance evaluation on VinTest and BG COVID dataset.

	Iteration(s)		VinTest (AP50)			BG COVID (AP50)
	Student	Teacher/Baseline	Baseline	Teacher	Student	Baseline	Teacher	Student
I	50,000	210,000	5.911	5.554	9.446	1.081	1.943	1.675
II	60,000	220,000	5.576	6.278	9.573	0.941	1.58	2.315
III	70,000	230,000	6.185	6.2	9.318	1.134	1.553	2.143
IV	80,000	240,000	5.952	6.305	9.847	1.042	1.726	1.749
V	90,000	250,000	6.264	5.897	8.835	1.052	1.629	2.124
VI	100,000	260,000	6.136	6.186	8.247	1.072	1.629	2.14
VII	110,000	270,000	6.041	5.965	9.165	1.04	1.576	1.994
VIII	120,000	280,000	6.672	5.949	9.456	1.03	1.574	1.876
IX	130,000	290,000	5.953	6.022	10.002	1.017	1.604	1.897
X	140,000	300,000	6.311	5.969	9.965	1.016	1.683	1.939

Table 8. Student model evaluation at different IoUs.

	AP
IoU	BG COVID	VinTest
0.1	18.9	18.6
0.2	10.6	16.8
0.3	6	14.5
0.4	3.7	11.5
0.5	1.9	10
0.6	1.2	5.4
0.7	0.6	2

Table 9. Comparison of pneumonia detection performance between the student and NIHCC models.

	AP
IoU	Student	NIHCC
0.1	45.3	63.3
0.2	29.3	35
0.3	16.5	16.6
0.4	8.5	7.5
0.5	2.6	3.3
0.6	1.1	1.6
0.7	0.6	0.8

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pham, V.; Dinh, D.; Seo, E.; Chung, T.-M. COVID-19-Associated Lung Lesion Detection by Annotating Medical Image with Semi Self-Supervised Technique. Electronics 2022, 11, 2893. https://doi.org/10.3390/electronics11182893

AMA Style

Pham V, Dinh D, Seo E, Chung T-M. COVID-19-Associated Lung Lesion Detection by Annotating Medical Image with Semi Self-Supervised Technique. Electronics. 2022; 11(18):2893. https://doi.org/10.3390/electronics11182893

Chicago/Turabian Style

Pham, Vinh, Dung Dinh, Eunil Seo, and Tai-Myoung Chung. 2022. "COVID-19-Associated Lung Lesion Detection by Annotating Medical Image with Semi Self-Supervised Technique" Electronics 11, no. 18: 2893. https://doi.org/10.3390/electronics11182893

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

COVID-19-Associated Lung Lesion Detection by Annotating Medical Image with Semi Self-Supervised Technique

Abstract

1. Introduction

2. ML-Based COVID-19 Detection Models

3. Methods

3.1. Lung Lesion Identifying

3.2. Semi Self-Supervised

COCO Standardization

3.3. Model Architecture

4. Preliminary Experiments

4.1. Pneumonia Detection

4.2. Lung Lesion Detection

5. Experiments & Results

5.1. Training Procedure

5.2. Evaluation

Overall Performance

5.3. The Bias of Annotating Style

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. CheXpert

Appendix A.2. RSNA Pneumonia

Appendix A.3. VinDr-CXR

Appendix A.4. SIIM COVID-19

Appendix A.5. ChesX-ray8

Appendix A.6. Bac Giang Lung Hospital COVID-19—BG COVID

Appendix A.7. Evaluation Metric

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI