The segments’ detection (the segment level) would be an intermediate step for model evaluation; thus, the study wants to detect the murmur qualities for patients (the patient level). As shown in
Figure 1, the steps of murmur quality detection include data segmentation, log-Mel spectrogram feature extraction, deep neural network feature extraction and detection. The analysis dataset used for the proposed method and the analysis dataset after segmentation are defined in
Section 3. The details are given below in this section.
4.1. Feature Extraction
A 2D log-Mel spectrogram representation [
37] was extracted for each PCG segment. In deep learning, this representation is widely used in the preprocessing of acoustic signals, which analyzes the spectrum of sound based on the mechanism of human hearing. Using them as inputs to the neural network model can more successfully characterize the sound compared to waveforms. For the extraction of log-Mel spectrograms, a frame length of 25 ms, a frame shift of 10 ms, and a window type of “Povery” were chosen, and heart sounds were analyzed in the frequency range of 0 to 2000 Hz using 128 Mel filters.
Figure 2 shows the waveforms and log-Mel spectrograms of two typical heart sound segments (two with systolic murmurs described as “Harsh” and “Blowing”). Since the average length of the segments is 4.1 s, the length of the log-Mel spectrogram was determined to be 400 values, and the log-Mel spectrograms were cut (zero-padded) for those longer (shorter) than 400 values. This preparation is necessary before inputting it into the deep neural network. Finally, a
log-Mel spectrogram representation matrix was extracted for each segment. The log-Mel spectrograms were normalized before being used as the following method:
where X denotes each value of the log-Mel spectrogram, mean_value denotes the mean value of the log-Mel spectrogram, and std_value denotes the standard deviation of the log-Mel spectrogram.
4.2. Neural Network Model for Segments’ Detection
The designed neural network model is based on two-dimensional convolutional neural networks (2D-CNNs), channel attention, and bidirectional gated recurrent units (Bi-GRU). The CNN-Block was used for initial image feature extraction from the log-Mel spectrograms. Subsequently, the feature maps were flattened into a long feature sequence to be fed into the Bi-GRU for temporal feature extraction. In particular, a Squeeze-and-Excitation (SE)-Block was added between the CNN-Block and the Bi-GRU for giving different weights to different channels of the feature map, since the information in different channels may have different importance for the detection of the murmur quality (the results verified this).
Figure 3 shows the structure of the designed neural network model, which included the following:
CNN-Block: It contains three layers of 2D convolution; each layer of 2D convolution follows an activation function (ReLU) and a batch-normalization (BN) layer. The first layer of 2D convolution has a convolution kernel size of , a stride of 2, and a padding of 2; the second and third layers of 2D convolution both have a convolution kernel size of , a stride of 1, and a padding of 1. Bias is added to all convolutions, and the padding is zero padding. Through the CNN-Block, the 1-channel log-Mel spectrogram became a 32-channel feature map.
SE-Block: This kind of block was proposed by Hu et al. [
38]. It can model the interdependencies between channels by the information of different channels of the feature map, that is, to generate corresponding weights for each channel of the feature map, so as to recalibrate the features by the importance of different channels. Now, the structure of the added SE-Block was explained. Firstly, there is global average pooling, which can change the feature map from a
matrix to a
matrix, which is called squeeze. Then, there are two fully connected (FC) layers (between which is the ReLU activation function) with the output sizes of C/2 and C and the Sigmoid activation function, which is called excitation. Finally, a
matrix was obtained, which means that each channel received a weight, and then the different channels were weighted by doing a channel-wise multiplication with the input feature map (Scale).
Bi-GRU: It is a bidirectional GRU module for extracting features from long sequences. The size of the hidden state in this module is 64, and the module is added bias.
Linear Prediction Head: It contains an FC layer and a Sigmoid activation function that produces the final predicted labels for each segment.
When training the neural network, the Adam optimizer was used, the learning rate was set to , = 0.9, = 0.98, and the weight decay was 0.01. refers to the exponential decay rate of the first moment estimation, and refers to the exponential decay rate of the second moment estimation. The weight decay setting allowed for the L2 regularization to be added to the loss function, thus reducing overfitting to some extent. The loss function used was the cross-entropy loss function. The batch size was set to 64.
4.3. Method of Feature Weighting for Patients’ Detection
The method of feature weighting is inspired by the SE-Block and based on the attention mechanism, which is called Feature Weighting. In this method, the linear prediction head of the neural network was eliminated, resulting in the neural network outputting (i.e., the output of Bi-GRU in the designed neural network model) (where ) for N segments of each patient. By using the attention mechanism, different weights were assigned to the different . Specifically, this was accomplished in the following way:
As shown in
Figure 4a,
(where
) is concatenated into a new feature
, which is then fed into the Feature Attention module (
Figure 4b). In the Feature Attention module,
performed both global maximum pooling (
MaxPool) and global average pooling (
AvgPool), which was then concatenated. The concatenated features were passed through two fully connected layers with a ReLU activation function between them. Finally, the weights
were obtained from this process. The Feature Attention module can be described as
where σ denotes ReLU, + denotes concatenation,
,
, and
denotes the feature after
performing global maximum pooling, and
denotes the feature after
performing global average pooling.
After the above processing, the feature weighting was performed, and the maximum value of each matrix column was taken, which can be described as
where
denotes the Hadamard product, i.e., feature weighting by elemental multiplication, and max denotes taking the maximum value of each matrix column.
Finally, the new feature was fed into the linear prediction head to obtain the prediction label of the patient.
To investigate whether the proposed method of feature weighting is effective, the proposed method was compared with three other methods. In these three methods, the linear prediction head of the neural network as in
Figure 3 was not eliminated, so each segment had its own detection likelihood. These three methods are now described in detail:
Method of “Voting”: It is a traditional method. When a patient had only one segment, the predicted class for that segment was considered as the predicted class for that patient; when a patient had several segments, the predicted class for that patient was determined by the majority of the segments.
Method of “Max”: It takes the maximum value of the detection likelihoods as the final likelihoods. Specifically, for the output of each patient, i.e.,
for
, where N is the number of segments, and
(
) is the likelihood of the specific segment predicted to be “Harsh” (“Blowing”), the patient’s detection result is
Method of “Average”: It takes the average value of the detection likelihoods as the final likelihoods. Specifically, for
as above, the patient’s detection result is:
At the patient-level detection, the optimizer, learning rate, , , weight decay, and loss function were the same as the segment-level detection. The batch size was set to eight patients, since one patient may have multiple segments.
4.4. Method of Finding the Contribution of Other Murmur Characteristics
In this section, the method to find the contribution of other murmur characteristics is given. At the segment-level, the Grading, Pitch, Timing, and Shape labels described in
Section 2 were sequentially fed into the neural network model as a priori information with the same conditions as mentioned in
Section 4.2. Since these characteristics are not easily accessible, they can only be used to explore the connection between other characteristics and murmur quality and were not used for patient-level detection above. As shown in
Figure 5, the labels are firstly encoded as One-Hot Encoding; then, the discrete labels are changed into continuous embedding vectors by the Embedding layer, and finally, they concatenate with the features (i.e., F as the output of Bi-GRU in the designed model) extracted by the neural network model and were fed into the linear prediction head to produce the prediction labels.
4.5. Method of Data Augmentation
The limitations of the data volume in the analysis dataset—it only contained 1266 segments with an average length of 4.1 s, whereas the entire dataset had 7338 segments with an average length of 4.2 s (if the same segmentation was applied) greatly hampered the effectiveness of neural networks. Thus, data augmentation was tried in the analysis dataset.
The data augmentation methods included speed increasing, speed decreasing, frequency shifting, time and frequency masking, and speed decreasing and increasing. Specifically, the method of speed increasing speeded up the original heart sounds to 1.5× and 2.0× speed. The method of speed decreasing slowed down the original heart sounds to 0.8× and 0.6× speed. The method of frequency shifting shifted the values of the log-Mel spectrogram up by 50 (from approximately 0–900 Hz to 490–2000 Hz) and 25 (from approximately 0–1380 Hz to 220–2000 Hz) in the frequency dimension. The method of time and frequency masking randomly masked 15% and 30% of values in both the time and frequency dimensions [
39]. Each of these methods doubled the amount of original data (1266 segments). The method of speed decreasing and increasing contained both speed decreasing and increasing as mentioned above, which quadrupled the amount of original data. After performing data augmentation on the training set of each fold, the data were fed into the neural network model for five-fold cross-validation as above in
Section 4.2 to evaluate the model performance.
4.6. Evaluation Metrics
The metrics of accuracy, precision, recall, and F1-score were used to evaluate all the algorithms above, and the exact calculation methods are described below.
The special values were explained as follows:
TH (True Harsh): number of correctly detected “Harsh”.
TB (True Blowing): number of correctly detected “Blowing”.
FH (False Harsh): number of incorrectly detected “Harsh”.
FB (False Blowing): number of incorrectly detected “Blowing”.
In addition, H is for “Harsh” and B is for “Blowing”.
The evaluation metrics were calculated as follows:
Accuracy: percentage of the number of correctly detected segments (patients) to the total number of segments (patients).
Precision: percentage of the number of correctly detected “Harsh” (“Blowing”) segments (patients) to the total detected “Harsh” (“Blowing”) segments (patients).
Recall: percentage of the number of correctly detected “Harsh” (“Blowing”) segments (patients) to the total real “Harsh” (“Blowing”) segments (patients).
F1-score: weighted average of precision and recall.
A comprehensive and thorough understanding of the model’s performance is given by these metrics, as they indicate the model’s ability to correctly detect the murmur quality (accuracy), the extent to which it produces misdetections (precision) as well as misses of the correct class (recall), and the different performances in detecting the two classes (F1-score).