1.1. Background
Heart sound classification, which focuses on distinguishing heart sounds collected from patients with cardiovascular diseases (CVDs) and analyzing their specific types, is one of the most effective and non-invasive methods for the early detection and diagnosis of CVDs.
CVDs are a group of diseases that include heart valve disease, heart failure, and hypertension, which contribute significantly to premature death rate worldwide. According to the World Health Organization, approximately 17.9 million people died from various kinds of CVDs in 2019, representing 32% of all global deaths [
1]. In China, it is estimated that a total of 5.09 million people died from CVDs in 2019, with an age-standardized mortality rate of 2.76‰ [
2]. Due to the acceleration of social pace and changes in lifestyle, the morbidity of CVDs is still increasing. The causes of CVDs are complex and variable, the onset is rapid, and the mortality rate is extremely high. Therefore, developing heart sound auscultation methods for rapidly and accurately diagnosing CVDs has become increasingly important.
Heart sound originates from vibration while heart valves and vascular wall struck by blood in systole and diastole reflects hemodynamic conditions of the heart. Therefore, heart sound is a significant metric for diagnosing CVDs. A cycle of heart sound signal can be divided into four components [
3], which are shown in
Table 1. Fundamental heart sound is crucial for diagnosing CVDs. Generally, extra low-frequency murmurs occur with fundamental heart sound and are a sign of heart valve diseases, which require further diagnosis.
At present, traditional heart sound diagnosis relies on the experience of clinical experts. Previous studies have shown that the precision of traditional diagnosis is about 81% to 83% [
4]. Therefore, developing methods for computer-aided heart sound diagnosis is significant for clinics.
Deep learning (DL) is used as a tool to solve the above problem. As a main branch of machine learning, DL aims at solving the difficulty of modeling by constructing complex theories using simple theories. Nowadays, methods of deep learning, especially CNN, are widely used for image recognition and audio processing. In the subdivision of heart sound diagnosis, DL attracts increasing attention. Most prior works have focused on adapting image processing models for time series classification, with minor modifications or the construction of new architectures to continuously optimize classification performance, which are aimed at enhancing accuracy and robustness while reducing parameters and computational costs.
1.2. Relative Works
Nowadays, most studies that focused on heart sound classification effectively enhanced the performance of models. This part focuses on works related to feature fusion, attention mechanism, and transfer learning.
In recent years, multi-scale feature fusion of CNN models was widely researched and applied for image processing. For example, Feature Pyramid Networks (FPNs) [
5] proposed by Liu et al. and the skip-connection model [
6] proposed by Shrivastava et al. have been widely used. Present studies show that fusing feature maps from different layers with different resolution ratios into one layer can integrate features from deeper layers and upper layers, improving accuracy and generalization ability significantly.
Tschannen et al. [
7], the early users of the feature fusion method, presented a novel classification method fused features extracted by CNN, state statistics, and Power Spectral Density (PSD) in 2016. Li et al. [
8] fused time domain features extracted by a Gated Recurrent Unit (GRU) and frequency-domain features extracted by a group convolutional layer and expanded the feature space.
U-Net, based on skip-connection, was applied to heart sound detection [
9,
10]. Taresh et al., the first to apply the feature fusion mechanism to heart sound auscultation, proposed a U-Net-based model that decodes features from deep layers by upsampling and fuses it with top layers for heart sound denoising. Xu et al. [
11] constructed HMSNet, the first one introduced for the theory of multi-scale feature fusion into heart sound classification, which has the advantages of different hierarchical architecture for improving the performance of algorithms. The local test results showed that HMSNet performed approximately 1% better than traditional ResNet. PCTMF-Net [
12], proposed by Wang et al., was built using a two-way parallel CNN module that can fuse feature maps processed by a CNN and Transformer, and better learn information from different convolutional layers, which increases the accuracy of heart sound classification result to 93%.
Channel attention mechanism (CAM), simulating human vision paying more attention to key signs, was successfully applied to the field of computer vision. By focusing on the most informative regions within feature maps, CAM enables models to capture key features of signals more precisely. Tian et al. [
13] firstly introduced CAM to heart sound classification and proposed a model adding Efficient Channel Attention (ECA) block. More and more research concerning the application of CAM in heart sound detection [
14,
15,
16,
17] has shown that CAM can significantly improve the accuracy and robustness of models.
Currently, most studies focus on binary classification due to the lack of heart sound datasets labeled with different types of CVDs. Yaseen et al. [
18] created a dataset labeled with five classes, whose signals are high-SNR. The dataset is widely used for the research of heart sound multi-classification. However, classification algorithms are still limited by the problems of datasets. The main issues with heart sound datasets include sample imbalance and insufficient dataset size for deep learning training. Due to the large number of parameters in deep learning models, the lack of sufficient training data can negatively impact the model’s generalization ability, thereby affecting the final training performance. In order to solve problems caused by datasets, non-traditional machine learning, including transfer learning, unsupervised learning, and semi-unsupervised learning, was introduced to improve research on heart sound datasets. Most studies on transfer learning focus on heterogeneous transfer learning, which depends on large-scale pretraining datasets and weighted models to break limitations of heart sound datasets. For example, both Koike et al. [
19] and Malty et al. [
20] applied AudioSet, a large-scale audio dataset, for the source domain to propose a transfer learning method of heart sound classification, which transferred weights between time series. Mukherjee et al. [
21] proposed a transfer learning method using the ImageNet dataset as the source domain to diagnose heart valve diseases based on heart sound spectral features.
At present, most studies on heart sound classification algorithms employ end-to-end feature extraction methods based on raw heart sound signals. The performance of end-to-end methods depends heavily on the structure and parameters of the models. The robustness of deep learning models to noise and imbalanced datasets still has room for improvement, requiring further advancements in generalization, which is significant to improve the feasibility in clinical practice.
Furthermore, the lack of larger datasets, especially datasets with more diverse disease types, has hindered the validation of previous algorithms on a broader range of cardiovascular diseases and populations. Non-traditional machine learning methods can partially address these issues. However, research on heart sound classification based on non-traditional machine learning methods is still relatively scarce, particularly in the area of homogeneous transfer learning methods.
1.3. Proposed Works
In this paper, we combined CAM with multi-scale feature fusion with the proposed CAFusionNet. We fetched feature maps from three pooling layers and applied Gated Channel Transformation (GCT) as attention blocks to weight each feature map. All feature maps weighted by GCT blocks were finally fused in the deepest layer, which aims at combining pathological features from different resolution ratios and improve the performance of classification algorithms. We applied heat maps based on gradient information to visualize and compare them with feature maps at different depths, showing importance of feature fusion. In order to solve problems about datasets labeled with different types of CVDs, we tried to propose homogeneous transfer learning to carry weights trained by heart sound binary classification tasks. A local test, held with public heartbeat sound datasets and datasets collected by our group, shows that CAFusionNet achieves an accuracy of 0.9323 in binary classification, performing better than present models and traditional residual model. The accuracy of the transfer learning methods reaches 0.9665, which is better than the traditional deep learning method.
The proposed model enhances the clinical applicability of intelligent heart sound diagnosis systems. Improvements in performance metrics and the use of visualized heat maps make intelligent heart sound diagnosis both more feasible and interpretable. Transfer learning has significance in the intelligent diagnosis of specific CVDs in clinical practice.
The contents of this paper are organized as follows: The proposed methods including the CNN-based CAFusionNet and transfer learning methods are proposed in
Section 2. The process, methods, and results are shown in
Section 3. In
Section 4, we expose a discussion of our present research. In
Section 5, we make a conclusion of this paper and discuss future expectations.