1. Introduction
The development of cardiac diagnostic imaging tools increased the reliance on these methods for the diagnosis and treatment of cardiovascular diseases [
1,
2]. Consequently, the use of traditional diagnostic techniques, including auscultation and electrocardiography, is declining. However, the number of hospitals with diagnostic imaging facilities is limited, thus restricting easy access for patients [
3]. Enhancing the diagnosing accuracy of accessible traditional testing methods, which are accessible even in small-scale clinics, can promote early diagnosis and treatment. This may not only alleviate the burden on patients but also reduce economic strain [
4].
Recent research has increasingly highlighted the efficacy of artificial intelligence (AI) in diagnostic processes utilizing simple testing methods [
5,
6,
7]. Nevertheless, these studies typically focus on singular modalities. This contrasts with real-world clinical scenarios, where physicians integrate data from a variety of tests for patient evaluation and diagnosis. Further, recent developments suggest that the utilization of a stacking approach involving the amalgamation of multiple machine learning models enhances the performance of predictive models [
8].
Based on this, we hypothesized that combining various simple diagnostic approaches might significantly improve the accuracy of AI-driven diagnostics.
These simple diagnostic sensors should be easily adoptable in daily clinical practice. Therefore, 12-lead electrocardiography (ECG) and auscultation were used as sensors. In this study, we aimed to create an AI model using data from 12-lead ECGs and auscultation to detect severe valvular disease (severe aortic valve stenosis (sAS) and severe mitral valve regurgitation (sMR)) and left ventricular dysfunction (LVD) with high accuracy. Furthermore, we compared the contribution of both techniques in the diagnosis of each of the conditions.
2. Materials and Methods
2.1. Research Participants and Data Collection
In this study, we extracted data from a previously reported database [
9]. The inclusion criteria were (1) age ≥ 18 years, (2) availability of a complete set of echocardiography tests, (3) availability of printed 12-lead ECG recordings on the same day of echocardiography, and (4) willingness to provide written informed consent. The exclusion criteria were (1) a history of heart valve surgery or transcatheter aortic valve implantation and (2) use of a cardiac implantable electronic device, except an implantable loop recorder. The diagnosis was based on existing guidelines for echocardiography [
10,
11]. None of the patients were enrolled twice in the present study.
The auscultators were blinded to the patients’ clinical information at the time of recording. A phonocardiogram (PCG) was recorded using an Eko Duo system (Eko Devices Inc., Oakland, CA, USA) on the day of echocardiography (±1 d). The patients’ heart sounds were recorded at 4000 Hz (.wav format) at three auscultation locations, including the second intercostal space along the right sternal border (L1: 2RSB), Erb’s area (third intercostal space along the left sternal border, L2: ERB), and apex (fifth intercostal space in the midclavicular line: L3: APX), for 15 s.
Furthermore, a 12-lead ECG was conducted on the same day. The ECG was printed and scanned in PNG format (12 lead, limb lead, and precordial reads; resolution, 3187 × 1840 pixels). Clinical data, including medical and treatment history, were collected on the day of the PCG recording.
Overall, 1052 individuals who underwent ECG were enrolled in this study (
Table 1 and
Table 2). Moreover, to enroll an equal number of participants with target cardiovascular pathologies, participants with sAS, sMR, and severe left ventricular dysfunction (sLVD) were preferentially assigned as previously documented [
9].
All data were collected after written informed consent was obtained from the patients, and the participants’ data were pseudonymized at the data center. Furthermore, the data were anonymized during the analyses in this project.
2.2. Dataset Preparation
A modified stratified 4-fold cross-validation method was adopted to train the model. We organized the established dataset according to the severity of AS, MR, and LVD to balance the number of AS, MR, and LVD cases in the training, development, and test datasets and to build four independent training/development data groups without overlap. The test cases were common to all four groups.
Details of the training/development dataset (cross-validation set) are presented in
Table 1, and details of the test dataset are presented in
Table 2.
2.3. Data Preprocessing
2.3.1. PCG
In this study, the heart sound data were not standardized or normalized, as these procedures did not affect the efficiency of model training during the preliminary experiments. For preprocessing, 128-dimension log-Mel spectrograms were extracted based on the sound (.wav format) data using the following steps (
Figure 1):
Oversampling: Each record was divided into multiple data of 4 s based on predefined time steps depending on the severity of each target disease (
Table 3).
Fourier transform: Short-time Fourier transform was computed using the Hann window, 512 sample windows, and 64 samples of hop size.
Mel-scaled filter bank: We applied the Mel-scaled filter bank (128 filters) followed by the logarithm.
We used SpecAugment (frequency mask and time mask) and mix-up for data augmentation only during training. The data shape after preprocessing was 243 × 128 (the input shape of the convolutional neural networks (CNN)).
2.3.2. ECG
The ECG scans were cropped for disease detection to extract relevant regions from the original scan to improve the accuracy and efficiency of disease detection. Three cropping techniques were used to focus on specific areas of interest.
A square region (size, 1840 × 1840 pixels) was extracted from the original ECG image (3187 × 1840 pixels). This region was selected to extract the complete waveform information across the 12 leads simultaneously. This approach provides information on the overall cardiac activity and detects anomalies that may appear across different leads.
A square region of size 960 × 960 pixels was extracted from either the upper or lower half of the original ECG image to analyze specific regions. The upper half represents the limb-lead region, while the lower half represents the precordial-lead region.
After extracting the relevant regions using the cropping process, the images were resized for standardization and ease of further processing. The 12-lead images were resized to 512 × 512 pixels, while the limb- and precordial-lead images were resized to 256 × 256 pixels. This resizing ensures a balance between preserving important details and reducing computational complexity.
The horizontal shifting process involves repeatedly cropping and resizing the original ECG image as described above while shifting the cropping window horizontally by a predetermined offset during each iteration. The number of images extracted from an original image depended on the severity of the target disease (
Table 3).
2.4. Training and Development of Models including Stacking
2.4.1. Architecture of Model
We used a 10-layer CNN based on our previous publications (
Figure 2) [
9,
12]. The output of our models indicated whether the input heart sound or ECG signal indicated severe disease. The models were separately trained for each disease (AS, MR, and LVD). Python 3.7.6 and PyTorch 1.4.0 were used for this project. The equipment and the training conditions are provided in
Supplementary Table S1.
Our model had 10 convolutional layers, followed by a ReLU activation function combined with batch normalization in the first and last convolutional blocks, based on our previous work. The 4th, 6th, 9th, and 10th convolutional blocks have max-pooling layers (MaxPool). Following the 10th convolutional block, a global pooling layer was set, and the final layer was activated using a softmax function.
We trained the CNN models using the 4 s heart sound data from all three collection locations and using the preprocessed ECG image data from all three patterns (12 lead, limb lead, and precordial lead), and then separately for each heart sound location and each ECG pattern. Six models were trained per training/development data group; thus, twenty-four models were trained in four groups.
2.4.2. Model Stacking
We used stacking CNNs for both the ECG and heart sound modalities.
Figure 3 presents the sequential flow from input data to stacking. Model stacking was performed using two types of models: Random Forest (RF) and XGBoost (XGB).
The stacking patterns for the models were defined based on the four patterns presented in
Table 4: Each stacking pattern was constructed separately for each training/development group. The goal was to leverage the complementary information extracted from the ECG and auscultation. During stacking, anomaly probabilities from individual CNNs were used as intermediate inputs and served as features to train the stacking models. By combining the predictions from the stacked models, we aimed to improve the disease detection capabilities.
The stacking model was trained using the development dataset. Hyperparameter tuning was performed using a grid search to maximize the F1 score. The stacking process using Random Forest and XGBoost was performed using scikit-learn 0.23.2 and xgboost 1.6.2.
2.4.3. Model Selection
For model selection, we evaluated all available models, including individual CNNs for heart sound and ECG data, as well as the stacking models. The evaluation was performed using test data. The area under the curve (AUC) of the receiver operating characteristic (ROC) curve was used to assess the performance of the models, considering their ability to accurately detect positive cases.
As detailed above, we trained 14 distinct models, comprising both single and stacking models, across each cross-validation fold. The model with the highest AUC for the test data was subsequently selected from the models of each method within the four cross-validation folds. To statistically compare the performances of these models, we used the bootstrapping method, sampling with replacement 2000 times. We then used analysis of variance (ANOVA) for these bootstrapped metrics to discern differences among the models. Subsequently, we performed post hoc comparisons using the Tukey–Kramer honestly significant difference (HSD) test to identify specific group discrepancies. Based on this analysis, the top three models that emerged as the most promising candidates for disease detection were identified.
2.4.4. Clinical Validation
We prospectively enrolled 103 patients as the clinical validation cohort to assess the performance of the models and the diagnostic contribution of each dataset (
Table 2). These cases did not overlap with the individuals in the establishment cases.
2.4.5. Modal Contribution Analysis
In this study, we aimed to estimate the significance of each modality incorporated into the stacking process via modal contribution analysis. We quantified these contributions using both the scikit-learn and XGBoost libraries. To bolster the reliability of our findings, we trained an identical model with fluctuating random seed values on 50 occasions for each training/development set (i.e., within the same cross-validation cohort). Subsequently, we computed statistical metrics, including the mean and standard deviation, for the modal contributions sourced from all models within each training/development set. Through this rigorous analysis, we gained insights into the relative significance of each modality within the stacking model and its respective roles in disease detection.
2.5. Statistical Analysis
Continuous data were presented as mean ± standard deviation for normally distributed data. Categorical data are presented as numbers and percentages. Non-normally distributed data were presented as median values (lower–upper quartiles). The chi-square test, Kruskal–Wallis test, Student’s t-test, or Fisher’s exact test were performed when appropriate. For the global test statistics, we used a significance level of 5%. Analyses were performed using the JMP software version 14 (SAS Institute, Cary, NC, USA) and custom Python scripts on MacOS computers.
3. Results
3.1. Performance of Single-Modal AI
Figure 4A presents the results of each single-modal AI in the test dataset to detect sAS, sMR, and rEF, trained using a 12-lead ECG (ALLECG), limb-lead ECG (LCECG), precordial-lead ECG (PCECG), and heart sounds from three locations.
In the AS model, the model trained with heart sounds from 2RSB performed the best (AUC = 0.97), followed by the models trained with heart sounds from Erb (AUC = 0.94) and apex (AUC = 0.91). The performance of the models trained with ECGs was significantly inferior to that of the heart sound models.
For the MR model, although the performance disparity between the models was smaller than that of the AS models, the models trained with ECGs, particularly from the precordial-lead ECG, tended to perform better than the other models, followed by the models trained with the Erb heart sounds.
In the rEF model, the performance of models trained with ECGs, particularly the precordial-lead ECG and 12-lead ECG, was superior to that of the models trained with heart sound models.
3.2. Stacking and Diagnostic Performance
We combined the trained models using single-modal data using the proposed stacking techniques and tested their performance in detecting the three cardiac pathologies. Based on the results from the single-model AI, we used 12-lead ECG information in all cases. To evaluate the contribution of the limb and precordial leads, we tested the performance of the 12-lead ECG as a single diagram or as separately stacked limb and precordial leads.
Figure 4B presents the performance of the stacked models and a comparison of the models. These results suggested that the optimal sensor may differ based on the target pathology. However, determining the superior model among the various models was not conclusively possible due to the lack of statistical significance.
Therefore, bootstrapping was performed to statistically evaluate the performance of each model for the three pathologies, as presented in
Table 5,
Table 6,
Table 7.
In the AS model, the model with the highest AUC was trained with all three heart sounds, and 12-lead ECGs were stacked using XGB. The model trained only with heart sounds from 2RSB and the model with the three heart sounds and the limb-lead and precordial electrocardiograms separately stacked with RF had a good performance.
In the MR model, the model with the best performance was the standalone model with a 6-lead precordial ECG, followed by the model trained with all three heart sounds and electrocardiograms separately stacked by the limb and precordial leads using XGB, followed by the model using RF.
For the rEF model, the model stacking the limb- and precordial-lead ECG separately by XGB performed the best, followed by the model trained with the three heart sounds and the limb- and precordial-lead ECG tacked separately using XGB and the standalone model trained with the 12-lead ECGs as a single diagram.
3.3. Performance in Clinical Validation
We tested the top three models selected for the clinical validation cohort (
Figure 5). The performance differences observed for detecting severe AS were small among all models. However, performance variations were observed in the clinical validation of the chosen three models for MR and rEF. Compared to the test data, the performance in the clinical validation cohort was generally inferior, and the decline in performance was particularly pronounced in the rEF detection models.
3.4. Contribution of Each Modal to Detection
Figure 6 illustrates the contribution of each modality to the selected models, calculated after 50 training cycles in each cross-validation cohort. For AS detection in model 1, the contribution of the 2RSB heart sound data was the highest, followed by the apex and Erb heart sounds. The contribution of the 12-lead ECG was significantly lower than that of the heart sounds. Model 2 was a standalone 2RSB heart sound model. In model 3, the contributions were in the order of 2RSB > Erb > apex heart sounds. Although limb- and precordial-lead ECGs were incorporated into model 3, their contributions were significantly lower than those of heart sounds for the detection of severe AS.
For MR detection, model 1 was selected as the standalone precordial-lead ECG model. In model 2, the contribution of the precordial-lead ECG was the highest, followed by that of the limb-lead ECG, which was significantly higher than that of the heart sound. In model 3, the contribution of the precordial-lead ECG was the highest, followed by the 2RSB heart sound and limb-lead ECG. Moreover, the contribution of the apex heart sounds was lower than that of the other information.
In the case of rEF in model 1, the precordial-lead ECG contributed significantly more than the limb-lead ECG. In model 2, the contribution of ECG in detection, particularly by the precordial lead, was higher than that of heart sounds. Model 3 incorporated the 12-lead ECG as a single diagram.
4. Discussion
This study revealed that efficient detection of AS, MR, and LVD is possible using the appropriate combination of multiple sensors, i.e., multiple points of auscultation and ECG. This multi-sensor strategy not only offers a novel method but also may enhance model performance in the medical field.
Moreover, this study demonstrated that the contribution of auscultation and ECG data to the detection of each disease varies. These results suggest that when considering detection models for these diseases, a “one-size-fits-all” approach is inadequate, and modeling should be tailored to the target disease. The integration of information from multiple sensors, including simple and traditional diagnostic tests using AI, can result in increased diagnostic capabilities. The tailored modeling approach incorporating simple clinical tests may enhance clinical decision-making processes.
Auscultation was significantly more important than ECG for detecting AS. This was corroborated by clinical validation, with no difference in accuracy between model 1, trained with ECG data, and model 2, trained using only auscultation data. In a previous study, we reported favorable screening capability of heart sound classification AI in identifying severe AS [
1]. The achieved AUC was 0.93, which was comparable to the models in the present study. Furthermore, Cohen-Shelly et al. reported the use of ECG-AI in detecting AS with asymptomatic severe AS efficiently screened to avoid sudden cardiac death [
13]. Their ECG model achieved an AUC of 0.85, which was comparable to our models using only ECG data. The stacked models in the present study combining ECG and PCG data demonstrated an AUC of 0.93, while the model combining ECG and age/sex by Cohen-Shelly showed an AUC of 0.87. Future research is needed to determine if adding a variety of data improves the models.
Data from precordial-lead ECG are crucial for MR diagnosis. The stacking model that combined the 12-lead ECG (integrating both limb leads and precordial leads) with auscultation data demonstrated marginally improved detection accuracy in the clinical validation group as compared to the model trained only with precordial leads. A previous study by Chorba et al. demonstrated that the deep learning model could detect severe MR efficiently (AUC 0.86), though they built the models using the auscultation data from severe MR and no MR [
14]. Elias et al. reported higher MR detection (AUC = 0.83) based solely on ECG [
15]. Their use of approximately ten times more ECG data may account for the enhanced performance of their models.
For the detection of LVD, the contribution of auscultation data was much lower than that of ECG. The performance of model 2 suggests that the addition of the auscultation data to the ECG data did not improve disease detection in the clinical validation cohort. Attia et al. reported that even a single-lead ECG could detect a low ejection fraction (AUC = 0.88) [
16]. The study by Bachtiger et al. also showed that the ECG screening could detect reduced LVEF (AUC = 0.85) [
17]. In their studies, a digital stethoscope recording a single-lead ECG was used, which may enable the use of a combination of auscultation and ECG data in clinical practice. In the present study, the clinical validation cohort had a higher percentage of severe LVD as compared to training/development/test cases and also as compared to the report by Attia and Bachtiger [
16,
17]. This difference in patients’ characteristics might have affected the results.
In each case, two of the top three models were stacked models, integrating multiple modalities. This indicates that the integration of different modalities through stacking may improve diagnosis efficiency. Thus, the ability to automatically adjust the weight of each modality and use the appropriate modality depending on the disease may enable high-accuracy detection, even when the target disease changes.
Moreover, by using stacking models, the effect of each modality’s output on disease detection becomes evident, enabling assessment of the relative importance of different modalities. Thus, the interpretability of machine learning models can be improved, and decisions on modality selection and combination decisions may become more evidence-based [
18]. This approach is also effective for gaining insights into each modality and target disease and may contribute to the development of more effective disease diagnostic methods. In this study, for MR and LVD detection, the precordial leads played a greater role than the limb leads. To the best of our knowledge, this is the first study to validate this in a clinical setting. Further research is needed to identify waveforms of precordial lead that have diagnostic value. Even in the modern age, with advancements in imaging diagnosis, screening with high accuracy using simple tools such as ECG and auscultation can alleviate the economic burden on the patients. This study suggests that evaluating the contribution of data to AI-based diagnosis can help reevaluate the value of traditional medical information.
Furthermore, one of the top three models in each case used only a single modality. Thus, high-accuracy detection may be attained even with a single modality, provided that the appropriate modality is selected. This was supported by the clinical validation for LVD, where the model using only a 12-lead ECG, combining the limb- and precordial leads in a single diagram, yielded the best accuracy.
We explored three common cardiac pathologies in the present study. Future research should focus on enhancing the precision of models that integrate simple clinical tests, such as blood pressure and oxygen saturation measurements. These tests could become a routine part of clinical practice, especially in non-specialized healthcare settings. We anticipate that validating AI models using these simple clinical tests will revolutionize clinical routines in the age of digital transformation.
This study had some limitations. This was a single-center study with no external validation. However, in the clinical validation group, the time of data collection was separated from that of the construction group with no duplication of patients; therefore, no data leakage was expected. From an informatics perspective, the performance of the original model is crucial, as the effectiveness of stacking relies on the underlying neural network’s ability to make accurate inferences. If the original neural network is unable to perform well, stacking may not yield significant improvements. Additionally, depending on the specific disease being targeted, it may be necessary to modify the model architecture, including the incorporation of non-DNN models. Furthermore, there is a possibility of improving accuracy by combining various model types in addition to multimodal stacking. However, caution is required when increasing the number of models or input modalities. There is a risk of overfitting, so careful consideration of appropriate model design (types and quantity of models) is necessary.