1. Introduction
With the rapid development of tools to support medical diagnosis using artificial intelligence (AI), expectations from AI have been increasing continuously [
1,
2,
3,
4,
5]. However, in reality, the application of AI in clinical practice remains challenging. One of the major obstacles is regarded as the “black box problem” of AI [
4,
6,
7,
8]. The black box problem is a problem in which the relationship between input and output obtained from data is so complicated that any human, including the developer, cannot determine the rationale for the AI decision [
9]. There are three major approaches for achieving explainable AI using a deep neural network (DNN), a machine learning technology typically used in medical imaging for diagnosis support. The first is a method for visualizing or analyzing the internal behavior of existing high-performance DNNs [
10,
11,
12,
13,
14]. The second is to add an explanatory module to a DNN externally [
12,
15,
16,
17,
18,
19]. The third is to make DNNs perform decisions via explainable representations, which is also called “interpretable models” [
20,
21,
22,
23,
24]. Of these, the third approach is the best in terms of achieving a high-level explanatory power. However, the first and second approaches have traditionally been actively pursued in explainable AI studies because interpretable models may cause performance degradation.
In the present study, we employ the third approach, i.e., interpretable models. One reason for its choice is that the performance of conventional AI is already high, and we can accept slight performance degradation. The second reason is that our purpose of developing AI diagnostic imaging support technology is not to improve the performance of the technology alone, rather to enhance the performance of medical professionals using this technology. A more sophisticated explainable representation has the potential to enhance the performance of medical professionals. Therefore, we propose a novel interpretable model targeting videos of fetal cardiac ultrasound screening, one of the crucial obstetric examinations; however, its detection rate of congenital heart diseases (CHDs) remains low [
25,
26,
27]. This interpretable model is an auto-encoder that includes two novel techniques, cascade graph encoder and view-proxy loss, and generates a “graph chart diagram” as an explainable representation. The graph chart diagram visualizes the detection of substructures of the heart and vessels in the screening video on a two-dimensional trajectory and, thereafter, calculates the abnormality score by measuring the deviation from the normal. The examiner uses the graph chart diagram and abnormality score to perform fetal cardiac ultrasound screening.
However, studies on the comparison or collaboration between AI and humans are vital to obtain insight into the clinical implementation of AI, and many studies have been conducted in this regard [
12,
28,
29,
30,
31,
32,
33] with several of them on ultrasound [
34,
35,
36,
37,
38]. Improvement in the performance of combining human and AI scores has also been studied in the field of dermatology [
39], breast oncology [
40], and pathology [
41,
42,
43]. A small number of studies have reported the performance of examiners actually using AI [
44]. Regarding the use of explainable AI, Yamamoto et al. used explainable AI to gain new insights into pathology [
42]. Tschandl et al. [
44] educated medical students about insights obtained from Grad-CAM [
10]. However, to the best of our knowledge, there is no study in which examiners directly utilized deep learning-based explainable representations (e.g., heatmap, compressed representation, and graph) in the field of medical AI. We believe the reason for this is that the current mainstream techniques have low consistency between decisions and explanations [
4,
20]. Because decisions (or AI score) and explanations are generated from the same process in interpretable models, consistency between explanations and decisions is high, and performance enhancement by adding explanations to decisions is most expected [
20].
In this study, we attempted to verify whether the deep learning-based explanatory representation “graph chart diagram” could enhance the detection of CHD anomalies for 27 examiners (8 experts, 10 fellows, and 9 residents). Quantitative evaluation using the arithmetic mean of the area under the curve (AUC) of the receiver operating characteristic (ROC) curve showed that the screening performance was improved by utilizing the graph chart diagram in all groups: expert, fellow, and resident groups. This is the first report to demonstrate improved screening performance for CHD using explainable AI, and it presents a new direction for the introduction of explainable AI into medical testing and diagnosis.
4. Discussion
In this study, we proposed a deep learning-based explainable representation (graph chart diagram) that compresses and represents the information in fetal cardiac ultrasound screening videos and introduced two factors to realize the depiction of the graph chart, including a cascade graph encoder and view-proxy loss. We also demonstrated that the graph chart diagram and the abnormality score could improve the ability of examiners to detect abnormalities.
Research on explainability in deep learning has concentrated on analyzing models [
10,
11,
12,
13,
14] or developing external modules [
12,
15,
16,
17,
18] for explainability. Limited research has been conducted on interpretable models that modify the structure of the model. Some interpretable models improve the explanatory power by replacing modules [
21,
24]; however, domain-specific methods are also not much studied [
19,
22] because of the need for domain-specific knowledge [
20]. Furthermore, we discuss interpretable models in a broader context. Studies have been extensively conducted to obtain human-interpretable representations from highly complex data [
51,
52,
53,
54], and several of these have been on compressing time-series information to a lower dimensionality [
55]. Considering the deep learning field, TimeCluster was proposed to reduce the dimensionality of time-series information with a kernel and to represent the time-series information with a two-dimensional diagram [
56]. TimeCluster targets single and very long time-series information and finds anomalies in a part of it. Therefore, TimeCluster compresses the dimensions using autoencoders and applies principal component analysis [
51] or other projection methods to the intermediate representation. TimeCluster learns the network weights for each instance of the time-series information; therefore, different representations can be obtained for the same data. This indicates that TimeCluster is not designed to process the time-series information from several inspection videos of approximately 10 s to identify anomalies in the entire video. Therefore, our proposed method for the graph chart diagram learns many instances of normal videos. The intermediate layer learns a two-dimensional representation directly and does not train any network weight on the test videos.
Subsequently, we discuss the two proposed techniques, view-proxy loss and a cascade graph encoder. The view-proxy loss improves performance (
Table 1) and reduces standard deviation by fixing the coordinates where the ideal 4CV and 3VTV appear on the graph chart diagram (
Figure 2). The view-proxy loss can be regarded as one of the proxy losses [
57,
58,
59]. A proxy loss creates a proxy from the data belonging to one class and includes the loss between the proxy and other samples. The view-proxy loss assumes the point corresponding to the ideal diagnostic plane as the proxy and considers the loss between the proxy and the synthesized barcode-like timeline corresponding to 4CV and 3VTV. The view-proxy loss is unique because the ideal diagnostic plane is known and is utilized as a proxy, and it synthesizes barcode-like timelines to solve the problem that there is no annotation of 4CV or 3VTV. The cascade graph encoder improves performance (
Table 1) and explainability by creating sub-graph chart diagrams of sets of substructures (
Figure 4), followed by a main-graph chart diagram of all the sets. Although the cascade graph encoder is similar to the hierarchical auto-encoder [
60] or stacked auto-encoder [
61], it is unique because our graph chart diagrams comprise partial and comprehensive explanatory representations.
We analyze the qualitative features of the graph chart diagrams. The graph chart diagram discards unnecessary information in the ultrasound screening and emphasizes the necessary information. The backward and forward movements of the probe, the speed of the movement of the probe, and the movement of the fetus during video recording are not necessary information for fetal cardiac ultrasound screening. The shape does not change in the graph chart diagram, even if a phase similar to the one passed appears multiple times. This reduces the noise caused by fetal movement and probe movement. The spacing between the points also does not affect the shape. This reduces the effect of the speed at which the probe is moved. Thus, the graph chart diagram is robust to the intrinsic noise caused by probe movement. Furthermore, the graph chart diagram is helpful, considering explainability. Regarding a graph chart diagram, the coordinates corresponding to the plane of the normal structure are scattered over a two-dimensional diagram, which serves as a checkpoint. If the checkpoints cannot be seen in the video, a part of the shape will be missing. Therefore, the area of the shape functions as an indicator of the degree of abnormality. Regarding the detection of shapes, recognizing a shape from the trajectory of a point is an advanced technology. In Python, the Shapely package is a standard technology; nonetheless, a more advanced algorithm may improve the performance of the abnormality score
. Considering the experiment on collaboration between the examiners and AI, we provided raw point trajectories as shown in
Figure 3 instead of the shapes shown in
Supplementary Figure S1, because we expect the human shape recognition ability to outperform the algorithm.
Deep learning-based methods for automatically detecting diagnostic planes [
62,
63,
64,
65,
66,
67] and methods for detecting abnormalities using diagnostic planes have been proposed [
19,
35]. However, these approaches require many images from hundreds of CHD cases to develop a system to detect any CHDs, including rare CHDs. In addition, there are different and diverse forms, even within a given type of CHD. In contrast, there was no structural difference in the normal fetal heart; any deviation from the normal structure increases the possibility of CHD. Hence, to detect many types of abnormalities, our proposed method employs abnormality detection technology to detect deviations from the normal structure. In addition, the conventional screening procedure in the clinical field requires the determination of the plane that contributes to the diagnosis and to record images. However, this task is difficult to perform for unskilled examiners (especially in CHD); therefore, identifying diagnostic planes requires a high level of skill, closely related to that of diagnosis. To address this issue, we focused on the ultrasound video, which was obtained by scanning the entire fetal heart containing diagnostic planes. Deep learning-based abnormality detection methods for videos have been studied [
68,
69,
70]; nonetheless, these methods were designed for surveillance videos and exhibit poor performance for videos with moving backgrounds [
46]. Komatsu et al. proposed an abnormality detection method for fetal cardiac ultrasound screening videos. They used the sequential 20 video frames around diagnostic planes to calculate the abnormality scores [
46]. Our proposed method utilizes the entire video frames, and the calculation of the abnormality score does not require any preprocessing, such as specifying the diagnostic planes. Therefore, the proposed graph chart diagrams and abnormality scores are highly applicable to fetal cardiac ultrasound screening. Moreover, we consider the potential for the further development of AI technology in fetal cardiac ultrasound. The graph chart diagram and calculated abnormality score can be used only for screening. They cannot be used for further diagnoses, such as identifying the type of CHD simultaneously. Analyses of various metrics must be considered using image segmentation [
71,
72,
73] and other methods to effectively support the analysis of anomalies [
35].
Our study demonstrates that graph chart diagrams improve abnormality detection by examiners with a wide range of experience, from experts and residents. Considering the residents, the mean AUC of the ROC curve with AI assistance was
, which was not as high as
of AI. Detection via AI only may perform better than collaboration with examiners with low experience. This result indicates that a lower experience makes it more difficult to decide how to refer to AI information. Furthermore, considering the great performance improvement of residents using AI, the proposed methods can be used as educational and training tools. Regarding the fellows, the mean AUC increased from
to
when using AI. Their recall increased by
(=
), and their precision increased from
(=
) as shown in
Table 3. Considering all the examiners, recall and precision increased by
(=
) and
(=
), respectively. Thus, fellows tended to place slightly more weightage on recall than on precision. Because the purpose of fetal cardiac ultrasound screening is not to miss CHDs, fellows, who are the main force in obstetrics, may place more importance on recall. Furthermore, AI usage improved the performance of experts from
to
in the mean AUC of the ROC curve. Regarding the experts, the increase in precision
(=
) was greater than the increase in recall, which was
(=
). This is probably because experts, who are estimated to be less than
of fellows in Japan, are required to make secondary judgments on cases classified by fellows. Therefore, they used AI to improve precision rather than recall. We found that fellows and experts can make good use of AI based on their respective roles, with fellows focusing on recall and experts on precision. For the residents, they achieved improvement in their performances with AI assistance; however, they could not achieve the performance obtained only by AI. This result implies that the examiners need experience in order to understand the explanations of explainable AI.
There are several limitations to this study. First, our proposed graph chart diagram is robust to probe movement; however, it has not been tested and evaluated for the influence of acoustic shadows in ultrasound videos. We may have to consider preprocessing, such as shadow detection [
74]. Second, owing to the low incidence of CHD, we used a limited number of abnormal cases to test our proposed method. Furthermore, we mainly targeted severe CHDs and have not yet tested this method for the detection of mild abnormalities, such as small ventricular septal defects. Multicenter joint research is considerable to collect further CHD data for the validity and reliability evaluation of our explainable AI technology in future studies. Third, although the method is robust to probe movement, return, and speed, the robustness between devices has not been evaluated because of the limited number of ultrasound devices used for training. Finally, all training, validation, and test data in this study were acquired using the same type of ultrasound machine, and we did not perform the experiments on other machines. The generalization performance of the explainable AI that we have proposed is a subject for future studies.