1. Introduction
Several X-ray examinations are performed annually on a global scale [
1], of which chest X-ray (CXR) examination accounts for the largest number of cases [
2]. In the past, the X-ray intensity distribution obtained by irradiating the subject with X-rays was taken on film, and the image could be displayed by developing the distribution (screen/film system). However, in recent years, digital imaging methods, such as computed radiography (CR), have been increasingly used. As a result, the time required to display the taken images has been significantly shortened. Although imaging plates require time for reading and processing, flat panel detectors (FPDs), which display images in real time, are becoming more widely used.
In these X-ray examinations, the target area is required to be accurately visualized, and when it cannot be visualized, retaking is necessary, which contributes to an increase in examination time and radiation exposure experienced by the subject. The study by Lin et al. [
3] reported the factors that cause retakes in general X-ray examinations, including CXR examination, and most of them were positioning errors, defects of necessary parts, and artifacts. Although the widespread use of digital imaging has enabled real-time image display, the radiologist’s eyes are still used to determine the need for a retake. Therefore, to further improve the efficiency and accuracy of X-ray examinations, it is essential to have a system that can immediately determine such needs.
In recent years, several studies have performed medical image analyses using the deep learning (DL) techniques [
4,
5,
6]. Among the DL techniques using convolutional neural network, classification [
7], semantic segmentation [
8], and object detection [
9] are suitable for medical image analyses, and there are many reports on these techniques. In the field of CXR images, the classification, semantic segmentation, and object detection technologies have been used for lesion classification [
10], the semantic segmentation of the lung field areas [
11,
12], and the detection of diseases in lung field areas [
13], respectively. In addition, with the recent global coronavirus disease 2019 pandemic, these technologies can be applied to detect the presence of pneumonia and segment the pneumonia in the lung field area [
14,
15]. The development of these technologies has led to the development of the technology.
Although these technologies have been improving, their use in determining the need for a retake has not yet been fully elucidated. Konica Minolta, Inc., has already commercialized the AeroDR solution to improve the efficiency of medical examinations, and its CS-7 console is equipped with functions to detect lung field defects and body motion in the frontal CXR images [
16]. However, using the classification and semantic segmentation techniques in DL technology, the presence of lung field defects or obstacle shadows on the taken image can be classified, and the location of the obstacle shadows, either inside or outside the lung field, can be determined, regardless of the imaging environment, such as a hospital room or an X-ray room. Junhao et al. [
17] applied the DL techniques to the construction of a quality assurance (QA) system for CXR images. Although this QA system can detect lung field defects and artifacts, it does not discriminate between medical and nonmedical devices and does not recognize images in a similar way to humans, making it insufficient to determine the need for a retake. Although the application of deep learning technology to medical imaging has made progress, there are few applications in the field of medical image acquisition technology, and furthermore, no applications have been identified in the field of determining whether or not an image should be retaken.
Therefore, in this study, we developed software for evaluating CXR images to determine whether a CXR image needs to be retaken, based on the combined application of DL techniques, and evaluated its accuracy.
4. Discussion
To the best of our knowledge, this software is the first attempt to apply the DL techniques to determine whether retaking a CXR image is necessary. In this software, four DL models were combined to form a single system. However, since there are few related previous studies, the models and software developed in this study will be objectively discussed through comparison with studies of QA systems for X-ray images, which are most relevant to this study, and studies on semantic segmentation of lung field regions not including the mediastinum. Here, we discussed each model and this software and described the limitations and prospects of this study.
First, we presented a discussion on the lung field defect, CLM. Although the classification technique has been widely applied to the classification of the presence or absence of defective products on factory production lines [
22], its application to the classification of defects in the lung field region has not been confirmed yet. In such a situation, Junhao et al. [
17] classified the presence or absence of defects in the lung field region as part of the construction of a QA system for CXR images, using a combined application of the semantic segmentation and classification techniques. In their study, they attempted to perform QA pixel-wise using the semantic segmentation technique instead of applying image classification directly, because the target area for QA evaluation was small. As a result, the accuracy of the image classification for the presence or absence of lung field defects at the image level was 92.50%, which was slightly higher than that of the present study, whereas the pixel-wise examination using the semantic segmentation technique showed an accuracy of 97.96%. In the data used in this study, there were some images in which the lung field defects were identified in small areas, such as the costophrenic angle. Considering the performance of hardware in actual clinical practice, complex data input may result in lower throughput due to slower processing speed. However, it is important to consider input features that will allow AI to identify the presence or absence of lung field defects more easily in the future. Second, we presented a discussion on the obstacle shadow CLM. For the classification of the presence/absence of obstacle shadows, Junhao et al. [
17] attempted to classify the presence/absence of artifacts in the same way. In this case, the accuracy of the image-level classification of the presence/absence of artifacts was 83.75%, which was slightly lower than the accuracy of 91.7% in the present study. One reason for this difference in accuracy may be the difference in the number of chest images with artifacts used for training. The fact that the accuracy was better in the present study, in which a larger number of chest images with artifacts were used for training, suggests that further improvement in classification accuracy can be expected with an increase in the number of images in the future. On the other hand, the classification using the semantic segmentation technique showed an accuracy of 94.90%, which was better than the accuracy of the present study.
Although we were unable to identify any studies that directly classified the types of disability shadows, Ue-Hwan et al. [
23] investigated the manufacturer’s classification, model group identification, and magnetic resonance imaging safety characterization of cardiac implantable electronic devices (CIEDs) using a DL-based algorithm. The overall accuracy rates against each internal test dataset were 99.7%, 97.2%, and 98.9%, respectively. In this study, images with the CIED portion cropped and resized as a preprocess were used for training and evaluation. Therefore, considering the purpose of the study and the difference in image formats, it is difficult to directly compare the accuracy with the present study. Third, we described a discussion of the obstacle shadow location CLM. In this study, we applied the data augmentation technique to 270 chest images that belonged to the “In” category and 730 chest images that belonged to the “Out” category. The number of data used in the training was made to be close to the same number by adjusting the expansion rate of the data augmentation. As a result, although there were some differences in the number of training data between the two classes, the classification accuracy was high without being attracted to the features of the other. This result indicates the validity of the data augmentation method used to maintain the number of data close to the same number. However, one of the reasons why the overall accuracy of the present obstacle shadow location CLM was only 91.2% may be due to the cases in which the obstacle shadows were located at the boundaries of the lung fields or where the obstacle shadows straddled the boundaries between the inside and outside of the boundaries, such as necklaces. In such cases, it is difficult to determine which region of the lung field the obstacle shadow is located, whether inside or outside. Therefore, it is considered that such a situation may have caused the accuracy degradation; however, we have not been able to visualize the basis for the decision of the DL model in this study. In the future, we believe that it is necessary to incorporate a system, such as the saliency maps [
21,
24], that can represent the basis of judgments on a heat map and examine the causes in more detail. Fourth, we described a discussion on the SSM of the lung field region. In this study, we compared the results of previous studies [
11,
12] and performed semantic segmentation for the lung field region, including the mediastinum. Among such previous studies, in the study on CardioNet by Abbas et al. [
25], semantic segmentation was performed not only for the lung field but also for the heart and clavicle. Among the CardioNet used in this study, CardioNet-B performed semantic segmentation for the lung, heart, and clavicle with mIoU values of 0.9728, 0.9042, and 0.8674, respectively. Thus, the accuracy was higher than that of the present study when the comparison was made only for the lung. However, when we focused on the mIoU values of the heart and clavicle, we could confirm that the mIoU values were lower than those of the lung. This result suggests that semantic segmentation is more difficult in the low-radiolucent tissues than in the high-radiolucent tissues, such as the lung field. This also suggests that the accuracy of semantic segmentation of the lung field, including the mediastinum, tends to be lower than that of the semantic segmentation of the lung field without the mediastinum.
However, the SSM developed in this study had a higher mIoU value than in a previous study [
15] using DeepLabv3+, similar to our method. Considering this finding, we believe that the semantic segmentation and data augmentation methods used in our method are appropriate. However, to further improve the accuracy of the semantic segmentation method in the future, it is natural to use more data, and it is necessary to take measures proposed by Johnatan et al. [
26] to reduce the accuracy of semantic segmentation due to the abnormal shadows of lesions and accurately recognize the lung field regions of more examinees. Fifth, we discussed the evaluation software for CXR images that combined these DL models. We evaluated the RT per chest image of the software by summing the RTs per chest image of the four DL models. Considering that the image processing time of FPD was approximately several seconds and that of CR was approximately several tens of seconds, the RT of 3.64 × 10
−2 s of this software is almost the same as the time required for conventional imaging operations, and it is considered to be able to provide artificial intelligence’s judgment on whether retaking is necessary. However, because the RT of this software varies depending on the device used, it is important to examine the response time on a PC with the specifications used in actual clinical practice in the future. In addition, the present study did not evaluate FLOPs. Only the response time on this software was calculated as a result, which should be taken into consideration in the future. FLOPs are useful for evaluating the performance and efficiency of models [
27], but in this study, we were interested in evaluating the time from image input to the display of results on a simple software program. This is because the actual time to display multiple models in software is one of the criteria for clinical image confirmation.
This software was created by combining multiple DL models. All images in this study included images from healthy patients to patients who were hospitalized and placed under long-term management, including electrocardiograms. Typically, inappropriate X-ray images depicting missing lung fields are retaken, so they rarely remain in the data. The reason for multiple segmentations and classifiers is that a defective lung field is not appropriate for medical imaging when detecting a lesion, and the radiological technologist must take this into consideration. Therefore, the model should focus only on the detection of lung field defects. In addition, obstruction shadows can be allowed or not, depending on their location, so the segmentation model of the lung field, including the mediastinum, needs to show the exact lung field. Therefore, in this study, the software was developed based on the radiological technologist’s decision in X-ray radiography. Each model was created by learning the images that were originally subject to retaking and could not be stored in the picture archiving and communication system or those that were drawn by a medical device because of health conditions. Therefore, for the effective use of the original images, each model was created independently without considering the overlap of training data. This suggests that it is difficult to comprehensively evaluate the software that combines each model. However, because each model of this software had an accuracy of approximately 90%, it is considered to be possible to immediately accurately provide the radiologist with a decision as to whether the retaking is necessary, and to encourage confirmation by the human eye. Finally, the limitations and prospects of this study were described. One of the limitations of this study is that the software was built and evaluated using CXR images collected from “CXR8” published by the NIH Clinical Center. Considering that CXR examination is the most common imaging method, it is important for a QA-related system, such as this software, to be trained and evaluated on data taken from several facilities. This is because it contributes to higher generalization performance. Therefore, we will aim to add more data by utilizing other datasets in the future and attempt to construct software with high generalization performance and accuracy. Even though the software developed in this study works on MATLAB, the models created in MATLAB can be converted to the open neural network exchange (ONNX) format, which allows for improvements and refinements irrespective of the development framework. Another limitation is that the contents considered in this system alone cannot provide appropriate judgments for all cases. For example, we were unable to examine the effects of the scapula on the lung field, as examined by Junhao et al. [
17]. Therefore, additional data collection and training will be necessary to apply the method to more situations. In addition, the efficiency of post-imaging should be considered for the improvement in the overall efficiency of CXR examinations. Oura et al. [
28] reported the QA of CXR images using the DL techniques. In this study, the DL techniques were applied to four points—correction of orientation, correction of angle, correction of left–right reversal, and judging the patient’s position—and proposed a method to improve the accuracy and efficiency of daily operations. Therefore, the combined application with other QA systems in the future will enable us to perform the current CXR examinations with higher throughput. For example, a study that developed a computer-aided diagnosis system (CAD system) that utilizes a convolutional neural network ensemble to reduce the workload of physicians and radiologists and achieve quick and accurate diagnosis showed a marked improvement in accuracy in the classification of chest X-ray images [
29]. In addition, a study that developed a DL-based algorithm to reduce data acquisition time in 3-D X-ray microscopy showed an 8- to 10-fold increase in speed while maintaining image quality, even with several hundred X-ray projections [
30]. There are studies that have achieved high accuracy for each of these purposes, and further studies are needed in the system to determine whether or not to perform retakes in this study. Since the penetration of X-ray images varies depending on the subject’s physique and other factors, it is necessary to be able to automatically control image quality and detect low image quality to improve the efficiency of medical image analysis, as in the system developed by Dovganich et al. [
31] to automatically determine the penetration of pulmonary X-ray images. In addition, software and hardware and their efficiency need to be considered for future applications of inference; further study on how to make classification and regression models in the development of deep learning models, as shown by Sumathi et al. [
32]; and for early prediction of infectious diseases based on chest X-ray images, as shown by Namburu et al. [
33] and the integration of deep learning algorithms with FPGA hardware for efficient analysis and low power consumption, among other improvements. Based on these improvements, further development of the method in this study is considered possible in the future.