1. Introduction
In this modern era, the role of health monitoring systems is increasing in our daily life. Older people are the most benefited ones from the merits of monitoring their health. Smart sensors, under any healthcare system attached to various body parts, can sense and record the required features of the human body. These kinds of sensors can be placed in any smart-watch or smartphone. The introduction of the Internet of Things (IoT) in the healthcare domain has further upgraded the facilities [
1]. Health-based alarms, personal smart medical recommendations, etc. have decreased the life-risks caused by sudden health problems.
One such problem under the domain of health monitoring is sleep apnea detection. Sleep Apnea HypoApnea Syndrome (SAHS) or simply apnea is a common sleep disorder related to interruption in breathing during sleep. In most cases, older people are affected by this syndrome [
2]. OSA is a category of apnea that causes partial or complete blockage of the airway in our body. OSA may further cause sleepiness, fatigue, morning headache, etc. [
3]. Statistics say that almost 9% of the men and 4% of women among the middle-aged people suffer from the mentioned sleep disorder. The sleep loss caused by OSA may lead to some long-term diseases like cardiovascular diseases [
4]. Thus, a smart health monitoring system is beneficial for the diagnosis of apnea so that early measures can be taken.
The most common method for apnea diagnosis is ECG [
5]. ECG records the electronic signals generated from the human heart. It serves the purpose to detect whether our heart is abnormally working or not. In this work, we have divided the whole time-series ECG data into time-intervals of equal length. Then we have used CNN [
6] based deep learning models along with ensemble learning to detect apnea in the given time-span. We have chosen three previously proposed CNN models as base models: (i) CNN architecture proposed by Wang et al. [
7], (ii) CNN model proposed by Sharan et al. [
8], and (iii) combination of CNN and LSTM network [
9] proposed by Almutairi et al. [
10]. To aggregate the base models’ predictions and to yield better results, we have applied four ensemble approaches: (i) Majority Voting, (ii) Sum rule, (iii) Choquet Integral based fuzzy fusion and (iv) Trainable ensemble using MLP. Our work involves the experimental study between these four ensemble techniques.
The main advantage of ensemble learning is that it considers and combines all the decisions by different models rather than relying on a single classifier [
11]. An ensemble will be successful if its component classifiers have diversity while making the prediction. Also, the ensemble formation will not serve any purpose if all the components generate too many inaccurate predictions [
11].
We have chosen the PhysioNet Apnea-ECG Database [
12], a standard and publicly available dataset, to conduct all the required experiments. To summarize, first, we form segments from the raw ECG data from the benchmark database, perform necessary pre-processing to derive important features and then train the three base models. Next, all three deep learning models predict the test data and the final prediction is generated by applying the ensemble technique of choice.
Figure 1 pictorially represents the above-mentioned process. The rest of the work consists of the four sections, namely, Related Work, Materials and Methods, Results and Discussion, and Conclusions.
2. Related Work
Since OSA or any other kind of apnea detection is a classification problem of the two classes—normal and apnea, machine learning classifiers like Support Vector Machine (SVM) [
13], k-Nearest Neighbours (kNN) [
14], Random Forest (RF) [
15] etc., and deep learning classifiers like CNN etc., are very much applicable in this domain. Like any other clinical diagnosis, detection of sleep apnea has become an important research topic in the healthcare domain.
Ng et al. [
16] have used thoracic and abdominal signals as input features for sleep apnea indication and have achieved 70.29–86.25% sensitivity. Alvarez et al. [
17] have worked on the non-linear analysis of blood oxygen saturation (Sa) obtained from nocturnal oximetry. From the experiments, they have discovered 111 out of 187 subjects as OSA positive. Qin et al. [
18] have studied the effect of OSA in Heart Rate Variability (HRV). They have conducted the experiments on 426 normal and 826 OSA affected subjects and have discovered that HRV tends to reduce with the severity of apnea disease.
Although there are many statistical body measures like ECG, acoustic speech signal, Sa, Electroencephalogram (EEG) available for apnea diagnosis [
5], we have solely focused on ECG signal for our work. A lot of research works on apnea diagnosis from ECG signals have already been performed. Almazaydeh et al. [
5] have extracted the relevant statistical features such as mean, standard deviation, median, inter-quartile range and some of their derivations for an RR interval (interval between two consecutive R peaks) of the raw ECG signals of the PhysioNet Apnea-ECG database [
12]. They have applied SVM on these extracted features and have achieved a maximum of 96.5% accuracy. Cheng et al. [
19] also have conducted experiments on RR intervals of the ECG signal of the PhysioNet Apnea-ECG database. By applying the Recurrent Neural Network (RNN) [
20], they have achieved 97.80% accurate results.
Nguyen et al. [
21] have considered the Recurrence Quantification Analysis (RQA) statistics of the HRV data of PhysioNet Apnea-ECG database as features. Initially, they have performed the classification task by using both SVM and Artificial Neural Network (ANN). They have used soft decision fusion to aggregate both the classifiers’ scores and have obtained 85.26% accurate results. Hassan et al. [
22] have pre-processed the raw ECG signal of the PhysioNet Apnea-ECG database by applying the Tunable-Q factor Wavelet Transform (TQWT). They have used Adaptive Boosting (AdaBoost) [
23], an ensemble method applicable to the decision tree and achieved 87.33% accurate results.
Wang et al. [
24] have considered the past time-windows for training the MLP architecture. Such time-windows are restricted to have a time-span of a minute, whereas each sample under the respective time-span has the six time-domain RR Interval (RRI) features—MRR (mean of RRI), MHR (mean of heart rates), RMSSD (root mean square of differences between adjacent RRIs), SDNN (standard deviation of RRIs), NN50 (number of adjacent RRIs exceeding 50 milliseconds) and pNN50 (NN50 divided by the number of RR intervals) and six frequency domain R-peak Amplitude features—Very Low Frequency (VLF), Low Frequency (LF), High Frequency (HF), LF/(LF + HF), and HF/(LF + HF). Finally they have achieved the best result with 87.3% accuracy. Shen et al. [
25] have proposed MultiScale Dilation Attention 1-D CNN (MSDA-1DCNN) for extracting features from the RRI and have applied Weighted-Loss Time-Dependent (WLTD) classification model for OSA detection and have achieved 89.4% accuracy on the PhysioNet Apnea-ECG database [
12].
Chang et al. [
26] have proposed a novel 1-D CNN architecture for the purpose of OSA detection. In their work, each one-minute segment of the raw ECG signal is initially undergone through the band pass filtering followed by Z-score normalization before fitted into the CNN model. Overall, they have achieved 87.9% accuracy on the PhysioNet Apnea-ECG database [
12] whereas, the performance has increased up to 97.1% in the case of pre-recorded samples. Thompson et al. [
27] have proposed a 1-D CNN architecture including a convolution layer, a max pooling layer, a fully connected MLP and a softmax output layer. In their work, they’ve applied a windowing strategy, with window sizes of 500, 1000, 1500, 2000 and 2500 for validation of their model, which achieved 93.77% accuracy for window size of 500 on the PhysioNet Apnea-ECG database [
12]. Mashrur et al. [
28] have proposed a novel Scalogram-based CNN to detect OSA using ECG signals. In their work, they’ve obtained hybrid scalograms from the ECG signals using continuous wavelet transform (CWT) and empirical mode decomposition (EMD). They train a CNN model on these scalograms to extract deep features to detect OSA, achieving an accuracy of 94.30% on the PhysioNet Apnea-ECG database [
12].
The majority of the previous works have considered ECG, and this fact motivates us to choose ECG signal data for conducting our work. PhysioNet Apnea-ECG database is also a popular one for working on OSA detection. We have chosen deep learning models for our work as they are very much applicable to the time-series data [
6]. However, only raw samples cannot produce outstanding results when fit into CNN models as discussed in the Results and Discussion section, hence it requires some pre-processing. Since our main concern is about the ensemble approaches in apnea detection domain, some of the established works based on ensemble techniques are also discussed.
Faußer et al. [
29] have applied Temporal Difference (TD) and Residual-Gradient (RG) update methods on a given set of agents with their own nonlinear function approximator, for instance, an MLP to adapt the weights to learn from joint decisions, such as Majority Voting and Averaging of the state-values. Also, Glodek et al. [
30] have worked on ensemble approaches for density estimation using Gaussian Mixture Models (GMMs) by combining individual mixture models incorporating a high diversity to create a more stable and accurate model. Chakraborty et al. [
31] have performed ensemble of filter methods, such as optimal subsets of features using filter methods Mutual Information (MI), Chi-square, and Anova F-Test, and with the selected features building learning models using MLP based classifier.
Kächele et al. [
32] have used an ensemble of RF, Radial Basis Function (RBF) networks to determine the intensity of pain based on the video shown and body features such as ECG, Electromyography (EMG). They have used MLP to train from the classification scores obtained from individual base models for score level fusion. Dey et al. [
33] have used a weighted ensemble of three CNN based models: ADNet, IRCNN and DnCNN to remove white Gaussian noise from an image. The aforementioned three models’ outputs are aggregated in the ratio 2:3:6 respectively. Bellmann et al. [
34] have applied various fusion approaches for the Multi-Classifier System (MCS) to effectively measure pain intensity levels. Their case study includes one of the most popular fusion techniques—bagging and boosting. Kundu et al. [
35] have proposed a fuzzy-rank based classifier fusion based approach which uses the Gompartz function for determining the fuzzy-ranks of the base classifiers. They have conducted the experiments on the SARS-COV-2 [
36] and Harvard Dataverse [
37] datasets for diagnosing COVID-19 from the CT-scans and have achieved the best results with 98.93% and 98.80% accuracies respectively by using the ensemble of the pre-trained models VGG-11, Wide ResNet-50-2 and Inception v3.
All these previously established works prove that the application of ensemble is spread out through many research fields. The huge success and scope of research in classifier fusion are the main reasons for its popularity. Still based on our knowledge, any ensemble technique based work has not been conducted for apnea diagnosis till now.This has motivated us to conduct experimental studies base on ensemble techniques on the OSA detection domain. Additionally, we have chosen the three deep learning models—(i) Wang et al.’s [
7] proposed CNN model, (ii) Sharan et al.’s [
8] proposed CNN model, (iii) Almutairi et al.’s [
10] proposed CNN-LSTM model as base models. The reason for such choice is these three models are all CNN based which are robust, excellent classifiers in general. The fact that the chosen three models have previously been used for OSA detection further encourages us to work with them. Thus, we have conducted our work by applying an ensemble of CNN based architectures in the popular PhysioNet Apnea-ECG database [
12].
4. Results and Discussion
In the present work, we used five classification measures—(i) accuracy, (ii) precision, (iii) recall, (iv) F1-score, (v) specificity to evaluate the performance of the base models and their ensemble. Since our only concern was binary classification, we have depicted all the measures as if there were two classes—(i) positive class, (ii) negative class present. Naturally, any classifier would also give the prediction class as either of the two. When the predicted class of a sample matched with its actual class then it was said to be True otherwise False. Thus we defined the five chosen classification metrics based on the terms True Positive (), True Negative (), False Positive () and False Negative () as follows:
Accuracy: It is defined as the ratio of number of correctly classified samples to that of total samples.
Precision: Precision of a class is defined as the ratio of correctly classified samples to total number of samples predicted as the given class.
Recall: Recall of a class is defined as the ratio of correctly classified samples to total number of samples actually belonging to that class.
F1-score: Sometimes, only Precision and Recall are not enough to measure the performance of a classifier. So,
-score is presented to combine the both aspects as it is evaluated as the harmonic mean of the two.
Specificity: Specificity is used to measure the proportion of negatives that are correctly identified. It is defined as the ratio of true negatives predicted to total number of samples which belong to negative class.
Since the current problem consists of only two classes, we used binary cross-entropy as the loss function. We purposefully applied Adam optimizer to optimize the loss value throughout 100 epochs. The training procedure was performed batches with each batch having simultaneously 64 samples. We can observe the change in training accuracy and loss with epochs for all three models in
Figure 6a–c.
After training, all the four classification measures were evaluated based on the test data and their prediction for each case in
Table 1.
Table 1 suggests that the three chosen CNN based models were compatible for ensemble as each ensemble technique successfully increased the maximum accuracy of all three models by at least 1%. Majority voting gave least accurate results because it only considered the prediction instead of the exact probabilistic values whereas, the other three being score level fusion were able to produce somewhat better results. Trainable ensemble technique performed a little better than the non-trainable ensemble techniques probably due to the fact that weight assigning to the classification scores was performed with the help of a classifier instead of applying a pre-defined weight allocation rule. Besides, MLP itself worked as an excellent classifier because of its utilization of additional hidden features [
46]. Thus, it was able to identify the patterns of classification scores as well. Among the non-trainable ensemble techniques, Choquet integral fusion worked better than sum rule because unlike sum rule, Choquet integral fusion did not assign equal weights to all three. Giving equal importance to all base models’ scores may not meet the expectations as the poor performance of an individual performance may affect the overall result. On the other hand, Choquet integral fusion assigned more weight to the model which gave more confident predictions. Among individual models, CNN-LSTM performed better than the rest two base models because LSTM considers the contexts (i.e., previous samples) along with the present sample which was beneficial for any time-series data such as, ECG signals.
We also performed experiments on the raw 1-min signal windows by taking the
Table 1’s winner MLP based trainable ensemble for a performance-wise understanding between the raw data and the feature extracted data.
Table 2 contains the results for raw ECG segments, which clearly shows that features extracted from the signal greatly outperformed the raw data as the final prediction made by the ensemble with raw data was only 70.77% accurate. The possible explanation for such an outcome was that classifiers may understand the pattern more efficiently in case of certain features which summarized the raw data.
Since, with this amount of data there was a possibility that the distribution of train and test sets may not be uniform, we applied two-fold cross-validation approach by swapping the train set and test set, and taking the average of both the results. The results of the base models and ensembles after this two-fold cross-validation are shown in
Table 3.
We also performed five-fold cross-validation over the combined dataset of the train and test sets. The performances of the base models and ensembles after five-fold cross-validation are shown in
Table 4.
From
Table 4, we observe that the models and the MLP based ensemble delivered better results during five-fold cross-validation than without doing so, because in five-fold cross validation, every sample from the dataset was there in both training data and test sets at least once, and the amount of training data increased which also included some of the test data used previously. This resulted in better identification of the test data.
Furthermore, we also used standard classifiers such as SVM, ANN with a hidden layer with 100 features and Random Forest to compare how they performed with the trainable ensemble using MLP. We flattened the features for these classifiers and after prediction, reshaped the outputs for the two classes respectively before performing the ensemble on them. The results obtained by the standard classifiers and the ensemble are shown in
Table 5.
Table 5 shows that the simple machine learning classifiers performed somewhat worse than the deep learning based models. So, overall ensemble was also affected by the choice of base models. Next, we compared the best performance achieved in our work for the original train and test dataset with some of the previous methods’ performance in
Table 6.
From
Table 6, we observe that the MLP based ensemble delivered better results than some of the previous works. Still Chang et al. [
26] and Shen et al. [
25] have obtained better results in their respective works. Although the ensemble worked fine for the given combination of models, their individual accuracy could not exceed 84%. So, this worked as a limiting factor for obtaining the better performance by the overall architecture. Additionally, the class-imbalance may be the reason for achieving not so high accuracy. Still, the current work held a good place among all the existing works experimented on the dataset under consideration.