1. Introduction
Transthoracic echocardiography (TTE) is the most commonly used cardiac imaging tool, which provides comprehensive observations of the cardiac structures and functions, and assists in the diagnosis and management of heart failure, ischemia, valve disease, and congenital abnormalities, among others [
1,
2]. Initially, echocardiography was a highly specialized diagnostic tool performed only by professionally trained experts, and it has now been rapidly extended to other medical specialties, especially in primary and emergency care settings [
3], because it is non-invasive, cost-effective, and convenient.
However, there has been concern that the level of training of medical staff performing echocardiography in other medical specialties is not sufficient to yield accurate and reliable results. For example, incorrect quantification of left ventricular ejection fraction (LVEF) may lead to inappropriate clinical decisions [
3], which may potentially harm patients and increase healthcare costs [
4]. Moreover, almost all examinations in echocardiography are based on the locations of the heart views. However, the training to find standard views is time-consuming and requires expert support [
5].
In order to obtain a consistent examination of echocardiography, especially in primary and emergency care settings, it is important to reduce dependence on operators [
4]. Artificial intelligence is expected to provide automated analyzing tools [
6].
The main challenge of ultrasound medicine is low image quality, noise, and artifacts. Because the machine learning methods based on hand-crafted or manually selected features lack robustness, deep learning based on feature learning has been applied to ultrasound image analysis in recent years [
7], such as image classifications of breast cancer and benign lesions [
8,
9], liver cancer [
10], and thyroid nodules [
11]. The other applications include the quality control of fetal ultrasound and standard views of the fetus [
12], and the segmentations of non-rigid organ [
13] and rigid organ [
14]. Three-dimensional analysis has not been widely used because of expensive calculations and limited datasets [
15].
Recently, deep learning has been applied to the echocardiography in four applications [
16]. The first application is evaluation of image quality in echocardiography [
17]. The second application is view classification and segmentation of cardiac structures [
18]. The third application is measurements: for example, quantification of left ventricular size and function [
19]. The final application is detection of abnormalities such as wall motion abnormalities [
20], and assessments of heart failure with preserved ejection fraction [
21] and diagnosis of myocardial disease [
19].
The classification of cardiac views can be useful for automated detection of appropriate views in TTE. For example, effective standard view recognition can remind less skilled operators to determine whether the obtained view is a standard view. They will get a message while finding a standard view.
Some studies have reported good classification of cardiac views with an accuracy of 84–98%. Zhang et al. trained a convolutional neural network (CNN) with multiple tasks including view classification, and the overall accuracy on 23 viewpoints was 84% [
19]. Madani proposed a fast and accurate cardiac view recognition method for 15 views and doppler images, which achieved an overall accuracy of 91.7% (image classification) and 97.8% (video classification) [
22]. Kusunose et al. reported the newly developed CNN for classification of cardiac views, and the overall accuracy was up to 98.1%, which was acceptable for a feasible identification model in clinical practice. However, CNN only predicts the video classes of five cardiac views [
23].
The challenge comes from large intra-class differences and small inter-class differences in cardiac views. Some individual factors, such as gender, race, age, and heart diseases, may result in alterations of the same cardiac view. The cardiac surface changes periodically and non-linearly during cardiac cycles, and the shapes of some views are relatively similar, which further increases the difficulty of recognition. Echocardiographers may not be able to identify deformed cardiac views accurately enough. See
Supplementary Figure S1 for nine standard cardiac views in TTE examination, including parasternal long-axis (PSLA), parasternal short-axis at the level of great vessels (SB), parasternal short-axis at the level of papillary muscles or mitral (SM), apical four-chamber view (A4C), apical five-chamber view (A5C), apical two-chamber view (A2C), apical three-chamber view (A3C), subcostal four-chamber view (SUB4C), suprasternal notch aortic arch (SUPAO).
This paper proposes an automatic recognition method to identify nine standard cardiac views. The presented method is based on CNN, which includes three effective strategies, i.e., graph regularization learning (GRL) [
24,
25], spatial transform networks (STM) [
26], and channel attention mechanism (squeeze-and-excitation network, SE) [
27]. The highlights are given as follows:
(1) The STM is performed as an independent pre-processing module, which learns the deformation during the cardiac cycle to reduce the intra-class variability. Second, the SE recalibrates channel-wise responses to enhance the features related to the recognition.
(2) The similarity between the samples is ignored in conventional deep learning. In the presented method, the structural signals of the sample similarity are defined as the graph-based embedding, which acts as an unsupervised regularization constraint to achieve accurate classification better than known methods.
3. Results
The datasets from two hospitals include 171,792 images (
Table 1) and 37,883 (
Table 2). Firstly, the training and testing are performed on the Dataset 1, and then independent tests are performed on the Dataset 2.
The accuracy of Inception V3 is 88.78%. After channel attention is introduced, the overall accuracy of Inception V3 + SE improves significantly. STM reduces the variability of cardiac deformation, and slightly improves the accuracy to 96.50%. Because the graph regularization serves as a robust unsupervised loss, the proposed method achieves the best overall accuracy of 99.10% (
Table 4).
The evaluation on cardiac views is shown in
Table 5. PSLA, SB, SM, SUB4C and SUPAO are all recognized, with the sensitivity of 100%, and no images are misclassified into other categories, and the AUC reaches 100%. The A5C, A2C, and A3C are slightly misclassified. In particular, the sensitivity of A2C is 94.63%.
The evaluation on independent test set is shown in
Table 6. The SM, SB, SUB4C are all correctly identified. However, few images of the PSLA, SUPAO are mistakenly classified. Similarly, some images of A4C, A5C, A2C, A3C are easily confused. In particular, the sensitivity of A2C is reduced to 94.15%. The overall accuracy of each category is all higher than 97%, and mean AUC is more than 98%, although the results in
Table 6 are only slightly worse than those of
Table 5.
In order to find the classification errors among cardiac views, the confusion matrices are computed. As shown in
Figure 4, the horizontal axis is the true labels, and the vertical axis is the predicted labels. The numbers in
Figure 4 are the percentages of predicted labels. On the diagonal, the closer the number is to 100, the more accurate the predicted labels are.
Figure 4a shows the confusion matrix of the test set in Dataset 1;
Figure 4b is the confusion matrix in Dataset 2. The classification of SB, SM, SUB4C are accurate enough. However, the misclassification mainly occurs among A4C, A5C, A2C and A3C. In particular, A2C and A3C are easily confused. In
Figure 4b, about 3.9% of A2C is misclassified as A3C, and the 2.67% of A3C is misclassified as A2C.
In
Supplementary Figure S2, after deep learning, the SB, SM, SUB4C are completely distinguishable. Only a few samples of A4C, A5C, A2C and A3C are mixed together, which is consistent with the confusion matrices. In
Supplementary Figure S3 and Table S3, our method can find important heart tissues (obscured areas) in images.
4. Discussion
TTE is one of the most important cardiac examinations because it is non-invasive, cost-effective, convenient. The accuracy and reproducibility of TTE rely on the accurate recognition of cardiac views. However, the recognition depends on echocardiographers’ experiences, and implementation of artificial intelligence is expected to provide a good solution.
The datasets came from nearly 700 patients from two hospitals. The four echocardiographers had excellent skills on TTE, and they recorded all videos of cardiac views. In order to ensure the independence of subsequent study, two other echocardiographers reviewed all the images, excluding some unqualified images.
The main challenges of recognition for cardiac views are low-quality images and shape changes during the cardiac cycle. The Inception V3 is one of the most commonly used networks for image recognition, but its overall accuracy is only 88.78%. Because the number of outputted channels by Inception V3 becomes 2048, the explicitly modeling interdependencies between channels can be expected to improve the performance across multiple datasets and tasks [
27]. After recalibration of channel-wise feature responses is introduced by SE, the recognition becomes more effective through channel attentions. STM is also useful because it models the geometric deformation of cardiac views by affine transform, which reduces the impact of the cardiac cycle on the recognition effectively. The accuracy is increased to 96.5%. To our knowledge, this result is better than the previous results [
23].
Unlike conventional deep learning, the structural signals are introduced by the similarity between samples to learn relationships among them. Ideally, the graph regularization can reduce the amount of labeled data and generalization errors. The first step of graph regularization is to build a graph. In general, the similarity between two images is not easy to evaluate based on pixel-level comparisons. The cardiac images in the same cardiac cycle are similar and appear periodically. Therefore, the mutual information between two images can be used as a measurement of the edge weights. We introduced graph regularization to STM-Inception V3-SE network, which further improved the accuracy by about 2%.
Nine usual cardiac views are studied for automatic recognition. The overall accuracy of the four networks are tested, and it is confirmed that the presented method achieves the best accuracy of 99.10%. The sensitivity, specificity, accuracy, and AUC values are also calculated for each of the nine categories, respectively. The recognition of PSLA, SB, SM, SUB4C and SUPAO show good performances, with a sensitivity of 100% and an AUC of more than 99%. A4C, A5C, A2C and A3C are slightly misclassified among them, but the mean AUC is higher than 98%.
The confusion matrices analysis further confirm the above results. In particular, the A2C and A3C are not easily classified. This result indicated the next improvement direction, especially for A2C. Moreover, the overall accuracy on independent test set is 97.73%. The proposed method could be generalized to new datasets.
The comparison of our method and other recent methods is shown in
Table 7 including the number of test set, accuracy and AUC. Zhang et al. [
19] designs a full automated method, but their overall accuracy of view classification is only 84%. Madanis’ method achieves 91.7% accuracy on 15 kinds of still images including cardiac views and doppler images, which is not satisfactory for clinical application. Although Kusonose et al. report a better method, where the overall accuracy in an independent test set is 98.1%, but the number of cardiac views is only 5. Moreover, the accuracy is based on the average of 10 selected images of video classification. In contrast, our method has an accuracy of 97.73% on nine kinds of cardiac views or images.
It is worth noting that the proposed method is based on the datasets of standard views. In clinical practice, the classification of standard views can be used as an assisted tool. For example, if the view obtained by the operator cannot be recognized as one of standard views, then this view should be non-standard, which will help less skilled operators to find more accurate views. Because the recognition or evaluation of non-standard views is also valuable, a large number of non-standard views will be included for model training in future studies.
The main contribution is to propose an effective method for the recognition of standard cardiac views. As far as we know, the obtained results are the most accurate. Because our dataset is not large enough, we believe this accuracy will be further improved by more training data. Moreover, in order to confirm the feasibility of deep learning on echocardiography, more data from other hospitals including non-standard views should be used for testing.