1. Introduction
Sleep apnea is a highly prevalent sleep disorder that can cause significant daytime sleepiness and result in many cardiovascular comorbidities [
1,
2,
3,
4]. It is characterized by repetitive significant airflow reductions during sleep causing recurrent hypoxia and sleep fragmentation [
5,
6,
7]. When breathing doesn’t completely stop but the volume of air going into the lungs is significantly reduced, then the respiratory event is called a hypopnea. More than 200 million patients worldwide are affected with sleep apnea [
8].
Sleep apnea events are three types: Obstructive, central, and mixed [
9]. Obstructive sleep apnea (OSA) is characterized by repetitive upper airway obstruction that limit airflow from going in to the lungs with the presence of continued respiratory effort. Central sleep apnea (CSA) is characterized by the loss of all respiratory effort during sleep due to a neurological disorder. Mixed sleep apnea (MSA) is combination of both obstructive and central sleep apnea symptoms.
Polysomnography (PSG), often called a sleep study, is the gold standard for detecting sleep apnea. Polysomnography records basic human body activities during sleep in an attended setting (sleep laboratory). This includes electrocardiogram (ECG) for heart, oronasal thermal airflow signal (FlowTh) and nasal pressure signal (NPRE) for respiration, electroencephalogram (EEG) for brain, electromyogram (EMG) for muscles, and oxygen level in the blood (SpO
) [
10,
11]. Connecting a large number of sensors and wires to a subject in a dedicated sleep lab makes PSG uncomfortable, expensive, and unavailable to a large number of sleep patients in many parts of the world [
9]. Moreover, clinicians need an offline inspection of the recordings to score apnea and derive the apnea-hypopnea index (AHI), which is the parameter used to establish sleep apnea and its severity [
12]. Thus, the analysis process is labor-intensive and time-consuming, leading to a delayed diagnostic process and increased patient waiting lists [
13,
14,
15,
16] as well as being highly susceptible to human errors [
17].
To overcome limitations of PSG, several studies have been proposed for automated detection of sleep apnea using a limited subset of signals among those involved in PSG [
18]. This includes respiratory signals [
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29], ECG [
16,
30,
31,
32], SpO
[
33,
34,
35], tracheal sound signals [
36], or some combinations of signals listed above [
37]. A number of portable devices for sleep apnea monitoring and diagnosis have been developed and are available. LifeShirt, SleepStrip, and ApneaLink are among the most popular products [
38].
Respiratory airflow signal is a straightforward choice to look for simpler alternatives to PSG, since apneas are primarily defined on the basis of its amplitude oscillations [
12]. According to the American Academy of Sleep Medicine (AASM), the primary sensor for identifying apnea in sleep diagnostic studies is the oronasal thermal airflow sensor [
11]. Thus, several studies focused on automated detection of sleep apena events based exclusively on the analysis of this signal. In these studies, the airflow signal is analyzed in different analytic domains (linear, nonlinear, time and frequency domains) to extract features which are then used in a rule-based threshold classifier or in a “black box” machine learning model. Rule-based threshold classification has been used in [
24,
39,
40,
41,
42]. Support Vector Machines (SVM) [
43,
44], Artificial Neural Networks (ANN) [
20,
45,
46,
47], linear discriminant analysis (LDA), and regression trees (CART) with the AdaBoost (AB) [
22] are among the most popular machine learning models that used respiratory flow signal.
Despite their popularity in sleep apnea problems, a major limitation in classical rule-based threshold detectors is that they provide classification based on simple comparison for the features against experimentally derived thresholds while overlooking the statistical distributions for the input features as well as the output classes. Even more complex discriminative (black box) methods are based on learning a function that maps the features directly into decisions. There hasn’t been much research considering probabilistic view of classification for sleep apnea detection.
Gaussian Mixture Model (GMM) is a probabilistic machine learning framework that aims at providing a richer class of density models than single Gaussian using a finite weighted mixture of Gaussian densities. It is well known as a rich framework capable of characterizing any continuous density. This framework has also shown promising results in classification problems including noisy features [
48]. Nevertheless, it hasn’t been well evaluated for sleep apnea detection problems.
The contribution of this paper is two fold. First, we develop a probability based classification approach for automated detection of sleep events using single oronasal airflow record. Second, we study the performance of the proposed approach over a large data set of 96 patients of different sleep apnea severity levels. Finally, we conduct a comprehensive evaluation and comparison of the proposed probabilistic framework against a rule-based classifier for the same input features as well as two previously published algorithms for apnea detection using airflow signal.
This paper is organized as follows.
Section 2 describes the data set, the proposed algorithm, classification methods, and evaluation metrics.
Section 3 presents results for the proposed algorithm along with a detailed comparison with related works.
Section 4 discusses the results and lessons learned and
Section 5 summarizes conclusion of the paper.
2. Materials and Methods
2.1. Data Set
The Institutional Review Board (IRB) at the University of Michigan approved this study (IRB#HUM00069035). Full polysomnography (PSG) data was collected for 96 patients at the University of Michigan Sleep Disorders Center. For each patient, polysomnography consisted of electroencephalography (EEG), electrooculography (EOG), submental and tibial electromyography (EMG), electrocardiography (ECG), two piezoelectric belts for recording plethysmography (PPG), oronasal airflow sensor (FlowTH), air pressure transducer (NPRE), digital micro-phone and pulse oximeter.
The oronasal airflow sensor used in this study is a thermocouple-based one from Respironics Model: Pro-Tech-
(Philips Healthcare, Eindhoven, The Netherlands). Clinical annotations for respiratory events were carried out by expert clinicians from the Sleep Disorders Center at the University of Michigan (Ann Arbor, MI) and according to recommendations of the AASM [
11]. Apneic events in the data set are either obstructive (OSA), central (CSA), or mixed (MSA). The data set spans different apnea severity levels as reflected by the apnea-hypopnea index (AHI) computed over night for patients in the study. 10 are non/minimal sleep apnea patients (
), 36 are mild sleep apnea patients (
), 27 are moderate sleep apnea patients and 23 are severe sleep apnea patients (AHI
).
Table 1 provides distribution for the numbers and types of the apneic events per each class of the patient severity levels.
The oronasal airflow signals of 66 patients (with different apnea severity levels) were used for training the proposed modeling framework. The developed framework was then tested on 30 patients (distinct from the training data) composed of 5 none/minimal apnea, 5 mild, 5 moderate, and 15 severe sleep apnea patients.
2.2. A Data-Driven Approach for Characterizing Changes in Respiratory Baseline
According to the AASM, an apnea event is scored if there is a drop in peak thermal airflow signal excursions by ≥90% of the corresponding baseline for a duration ≥10 s [
11]. Nevertheless, the airflow baseline is not precisely defined neither in the AASM Scoring Manual nor in sleep literature. To overcome this limitation, a data driven approach will be used to derive the respiratory flow baseline from the airflow signal (FlowTH). The derived baseline will then be used to characterize dynamic changes in respiration with respect to this (dynamic) baseline in order to detect the occurrence of apneic events. To establish respiratory baseline, we will consider two important respiratory features: Inter-breath intervals and breath amplitudes.
A sliding window method will be used for detecting apnea events in the oronasal airflow signal (FlowTH). At time step t, two windows will be established. The first window (baseline window-) will be used to derive the local respiratory baseline. The second window (detection window-) will be used to detect apneic events based on relative changes in inter-breath intervals and breath amplitudes in with respect to those in the . In this study, we considered a of length 600 s that contains the airflow measurements up to time t and a of length 100 s that contains airflow measurements starting from time step .
After constructing
and
, peaks and valleys of the respiratory airflow signal are detected in both windows. An example of
and
that both include an apneic event along with peak and valley detections is illustrated in
Figure 1a. The inter-breath intervals and the breath amplitudes can now be extracted from
and
as follows:
where the airflow breath
i has a peak
that occurs at time instance
, a valley
, an inter-breath interval
, and a breath amplitude
. These Equations generate sequences of inter-breath intervals and breath amplitudes in
and
as illustrated in
Figure 1b,c,f,g.
After getting these sequences, it is required to extract the (inter-breath) intervals and (breath) amplitudes in
that contribute most to the respiratory baseline estimate of this window. Similarly, it is required to extract the intervals and amplitudes in
that belong to the apneic event to be detected. This can be effectively done by sorting sequences of intervals and amplitudes in both
and
based on corresponding values. For convenience,
and
in
are sorted in a descending order while those in
are sorted in an ascending order as illustrated in
Figure 1d,e,h,i. This process will generate ordered sequences
,
,
, and
where subscripts
b and
m specify
and
respectively, and superscripts
a and
d specify ascending and descending orders respectively.
Although the length of
and
are fixed, the number of airflow breaths in these windows typically vary during different sleep stages and across different patients. Thus, the ordered sequences (
,
,
, and
) will be filtered to keep only the intervals and amplitudes that contribute most to the baseline estimate in
and the apneic events in
respectively. This can be mathematically expressed as follows:
where
is the cut-off filter applied to the ordered sequence
s of length
in order to generate a filtered sequence of length
where
. Accordingly, the filtered sequences include the highest
inter-breath intervals and
breath amplitudes in
and the lowest
intervals and
amplitudes in
. The filter values were defined individually for each of the sequences to allow them to be tuned separately to maximize the ability to detect apneic windows. The mathematical means of the filtered sequences can now expressed as follows:
where
is the mathematical mean of the filtered sequence
s. The relative changes in the inter-breath intervals (
) and the amplitude of the breaths (
), with respect to the respiratory baseline, can now be computed as follows:
2.3. Detection of Apnea Events based on Relative Changes in Respiratory Baseline
For the classification part, we propose a probabilistic view of classification for automated detection of apnea events. We leverage a Gaussian Mixture Model (GMM) to derive a decision boundary based on probabilistic assumptions about the underlying distribution of the respiratory input features. We denote this modeling scheme as a Gaussian Mixture Model (GMM) classifier. In order to demonstrate the improvement achieved by considering GMM as a generative machine learning model, we compare results with a classical threshold-based detector that uses the same input features for automated detection of apnea events.
To prepare data for classification, a sliding window that is being successively updated each 20 s (step-size = 20 s) was applied over the oronasal airflow signal. At each of the steps,
and
are constructed. Then, Equations (
1)–(
11) are applied to compute
and
(input features to the classification model) while
provides classification label based on whether or not an apnea event was clinically scored in this window. Considering our data set with 66 patients for training and 30 for testing,
Table 2 shows the distribution of the data segments and corresponding labels for both training and test sets.
2.3.1. Rule-Based Threshold Based Classification
The rule-based classifier detects apnea in detection windows (
) when the input features
and
both activate the classification rules
and
where
,
are the classification thresholds for
and
is the classification threshold for
. An exhaustive search approach is applied for each of these thresholds in order to learn their optimal values. A novel approach was used to fit the rule-based classifier and learn the classification rules using a two step optimization method. The classification thresholds for
are optimized first to obtain the receiver operating characteristics curve (
) with the maximum area under
(
) over our training data. Once the
classification rule is learned, the optimal classification threshold for
is learned by searching along the selected receiver operating curve (with maximum
) for the threshold that provides the maximum sensitivity (
) that constrains (
) not to exceed the maximum acceptable limit of 20% (
. More details about the derivation and tuning of the rule-based classifier can be found in our recent study [
49].
2.3.2. Classification with Gaussian Mixture Models (GMM)
A Gaussian mixture model (GMM) is a probabilistic modeling framework. In this model, the probability density function (PDF) of
is defined as a finite weighted sum of
k Gaussian distributions:
such that
is the 2-dimensional feature vector
computed every time
and
are constructed,
,
is the mixture model,
corresponds to the weight of component
i, and the density of each component is given by the normal probability distribution:
The parameters
, the mean
, and the covariance
are optimized during the training process using the expectation maximization algorithm [
50] such that the log-likelihood of the model is maximized. During testing, a likelihood estimate is obtained for the apnea class, defined by the model
, and for the non-apneic (normal respiration) class, defined by the model
. Using the Bayesian classification formula, the likelihood estimates are combined to compute the posterior probability of apnea for the sample
:
where
and
are the prior probabilities of the apnea and non-apnea (normal) classes respectively. These probabilities were set by symmetry to be equal
assuming we have no prior knowledge about them. The combination of the two GMMs and the Bayesian classification formula in Equation (
15) form the GMM classifier [
51].
2.4. Evaluation of Apnea Detection Results
2.4.1. Classification Performance over Detection Windows
In this paper, five statistical metrics of accuracy (
), true positive rate (
), true negative rate (
), positive predictive value (
), and
score are applied to assess the performance of the proposed modeling framework over all detection windows (
):
where
(true positive) is the number of apneic windows that were correctly classified as such,
(true negative) is the number of normal windows that were correctly classified as such,
(false positive) is the number of normal windows that were falsely classified as apneic,
(false negative) is the number of apneic windows that were missed by the classifier.
is a classical measure for binary classification but is not enough in this problem due to class imbalance between the apnea and normal classes [
52,
53,
54]. Thus,
,
, and
are used to report a more detailed performance in detecting apneic and normal windows. The
score considers
and
detections simultaneously and thus accounts to the
/
tradeoff reporting a more comprehensive idea on the overall performance of the proposed model.
2.4.2. Receiver Operating Characteristics () Curve
The Receiver Operating Characteristics (
) curve is an effective tool used to graphically illustrate the diagnostic ability of a binary classification system as its classification threshold is varied [
55]. This curve simply plots the
against the false positive rate (
) at various discrimination threshold settings. The Area Under receiver operating characteristics Curve (
) is used as a measure of the overall ability of the classification model to automatically detect sleep apnea events. A greater
indicates a more useful and effective classification model. Additionally, the
curve can be used for optimizing classification models by finding the operating threshold that provides the highest
for the allowable
level. This approach was used in learning the classification rules of the rule-based classifier.
3. Results
For the proposed GMM classifier (AICPV with GMM) and the rule-based threshold classifier (AICPV with Threshold), we used the PSGs of 66 patients for training, tuning, and optimizing the classifiers. The trained classifiers were then tested over the PSGs of the other distinct 30 patients.
3.1. Rule-Based Threshold Classifier
The optimal filter values were set to
,
,
, and
. The classification rules for detecting apneic windows were identified as follows
such that an apneic window is detected by the rule-based classifier whenever both rules are active.
3.2. GMM Classifier
Cross validation over the training data set was used for investigating and selecting the choices of parameters for the GMM models. These parameters include the number of Gaussian distributions needed to model each class, and the type of covariance matrix used (diagonal or full symmetrical). Our results show that 12 Gaussian components are needed to model the GMM of the apneic class and 11 Gaussian components are needed to model the GMM of the normal class along with diagonal covariance matrices for both GMMs.
3.3. Classification Performance Comparison over the Testing Data Set
We performed an overall evaluation for the proposed model (AICPV with GMM—AICPVwGMM) over the 30 patient test data and we compared the performance results with the rule-based classification model that uses the same input features (AICPV with Threshold - AICPVwTH). Also, we considered performance comparison with two well known published algorithms for automated apnea detection using the oronasal thermal airflow signal and similar time-domain based features [
39,
45]. The first algorithm implements a classical threshold based classification model [
39] while the other one uses an artificial neural network classification model [
45]. Note that [
39] includes an additional module for classifying the type of detected apnea using a neural network classifier trained on the thoracic effort signal. Nevertheless, we just included the apnea detection module from this study since the proposed algorithm uses only a single channel of oronasal airflow.
In order to do a fair comparison, the four algorithms AICPVwGMM, AICPVwTH, Refs. [
39,
45] were all trained, evaluated, and tested on identical data within our data set.
Table 3 comprehensively compares classification performance over the test data between these four algorithms. First, it can be noticed that the AICPVwGMM and the AICPVwTH outperform the two previously published algorithms in all performance measures of this paper. The AICPV algorithms demonstrate a higher ability to detect apnea events (reflected by their
) as well a higher ability to detect normal respiration patterns (reflected by their
) as opposed to [
39,
45]. An overall better classification performance of the AICPV algorithms can be demonstrated with the
and
values compared to [
39,
45]. The improvement achieved with AICPV algorithms is mainly caused by the dynamic approach considered in these algorithms such that apneic events are characterized based on relative changes in airflow breath amplitudes and inter-breath intervals with respect to local respiration baseline.
Comparing the performance obtained with the proposed GMM based classifier (AICPVwGMM) and the rule-based classifier (AICPVwTH), we can notice a increase in the of the apnea detections obtained with the GMM based classifier as opposed to the rule-based threshold classifier. Recognizing / tradeoff, and that and are performance metrics of competing natures, it can be noticed that using a GMM-based classifier, we can achieve a high with a significantly improved compared to the rule-based classifier. This can be also noticed by observing the significant increase in the score for the detections with the proposed GMM model as opposed to the rule-based one. Also, higher and are obtained with the proposed algorithm reflecting a higher ability to detect normal respiratory patterns. The overall classification performance indicated by also shows an improved detection with the proposed GMM modeling framework.
To provide an in-depth analysis and understand the sources of improvement with the proposed model as opposed to the rule-based classification model, we did a comprehensive analysis for the test performance of the proposed algorithms over different apnea types and different apnea severity conditions.
Table 4 and
Table 5 provide a detailed comparison between the GMM-based classifier and the rule based classifier over different apnea types and different apnea severity levels. A clear increase in the ability to detect OSA and CSA events can be noticed in the
of these detections using the proposed algorithm along with a significant increase in the
of all types of apnea detections. It can be also noticed that the
of the MSA detections is superior with both algorithms and that it didn’t change between them which is mainly due to the fact the MSA events are minority compared to OSA and CSA events.
Looking at the detailed performance of the proposed GMM model and the rule-based classifier over different apnea severity conditions, we can notice that the best performance for the rule-based threshold classifier is in severe apnea patients and that the performance significantly degrades over less severe cases. On other hand, the GMM based classifier maintains a high ability to detect apnea events in severe patients, but more importantly, it has a significantly higher ability to detect apneic events in less severe apnea patients which can be clearly seen through the of these detections. Moreover, looking at the values, we can also notice an excellent overall ability to detect apnea events in severe patients for the GMM modeling framework as well as a significantly improved overall ability to detect apnea in less severe patients compared to the rule-based classification model.
Finally, we performed a comprehensive analysis for the performance per class of apnea types among different patient severity levels.
Table 6 evaluates how the proposed model performs on detecting different apnea types in each of apnea severity classes. As it can be seen in the table, the proposed model AIPCVwGMM maintains a high ability to detect different apnea types regardless disease severity in the test patients. MSA events are very rare in mild patients which caused low and skewed detection rates for these patients. Also, the final row in
Table 6 was left empty since there are no MSA events in the class of none/minimal apnea patients. In general, the class of none/mnimal apnea reflects patients that are healthy or with few significant apnea events and so it a class of less interest compared to other disease states. Nevertheless, we kept performance results on all disease severity classes to report comprehensive assessment of the modeling framework.
It is worthy to be mentioned that the implemented algorithms were trained and tested using full overnight PSG records. Our goal is to test the proposed framework and to compare it with existing works in a more practical setting as opposed to many previous studies that considered shorter records avoiding the class imbalance problem and excluding segments with low signal to noise ratio (SNR). False positives were affected by the class imbalance problem, segments of low airflow signal quality, and irregular breathing patterns and artifacts. High airflow peak amplitudes resulting from increased respiratory effort after the end of apnea events may affect false positives by contributing falsely to the respiratory baseline. Future work may consider adding more advanced signal filtration algorithms that can allow more accurate detection incase of artifacts as well as to reject airflow segments with very low SNR.
4. Discussion of Results
A probabilistic-based framework was developed in this study for automated apnea detection using single channel data from oronasal airflow record (FlowTH). The proposed framework leverages AASM recommendations to define apnea based on relative changes with respect to respiratory airflow baseline. to overcome the absence of a precise mathematical definition for airflow baseline, a data-driven method is developed to represent it based on two features: The breath amplitudes and inter-breath intervals. The apnea is then characterized based on relative changes in these features between two sliding windows: The baseline window which represents the current respiratory baseline and the detection window where an apneic event is to be detected.
For automatic detection of apneic events, we considered classification based on a probabilistic view using a GMM-based modeling framework. The proposed framework showed a significantly improved performance in detecting apnea compared to a rule-based classifier that uses the same input features as well as two previously published algorithms that respectively use threshold-based classification and neural networks, applied on time domain features from the same respiratory signal. Using relative changes in respiratory features to define apnea enabled a dynamic approach that accounts for continuous changes in the respiratory baseline making AICPVwTH and AICPVwGMM algorithms significantly more capable to detect apneic events than previous studies that considered absolute changes in classification features overlooking relative ones. Comparing the proposed model AICPVwGMM with AICPVwTH that uses a rule-based classifier showed a significantly improved performance for the GMM model characterized by achieving high with a significantly improved overall for all types of apnea detections. The proposed model also allows much better performance over different apnea severity levels compared to the rule-based classifier which showed best performance over severe apnea patients only with significantly degraded performance over less severe disease classes.
In recent years, many studies considered oronasal thermal airflow signal for automated apnea detection [
19,
20,
21,
39,
42,
43,
44,
45,
56]. Nevertheless, patients with severe and moderate OSA conditions have been a major focus of many of these studies while not giving sufficient attention to less severe patient populations or other types of apnea events. The present results highlight the importance of evaluating apnea models on patients of varying severity conditions as well as on different apnea types. This also agrees with previous literature which demonstrated that high performance accuracies achieved with patients with high severity levels may not be generalizable to other groups of patients [
57,
58]. The detailed analysis presented in this paper using patients of varying apnea severity conditions and different apnea types provides the basis for a more comprehensive understanding of the performance of apnea detection systems.
Comparative analysis between the performance of proposed modeling framework and previously published research highlights some of the previously reported limitations for single-respiration channel based apnea detection methods. In particular, previous studies have reported significant fall in the diagnostic accuracy of automated sleep systems that measure two or fewer physiological parameters as opposed to those that measure three or more physiologic variables [
38,
57]. Although many automated sleep systems have proved effectiveness assisting lab PSGs and at home, they still cannot completely replace dedicated centers for sleep studies [
59].
The present study leverages AASM recommendations for apnea detection in many aspects. We considered the AASM recommended sensor for scoring apneic events in PSG diagnostic studies which is the oronasal thermal airflow sensor. The criteria in the AASM manual were also employed for scoring an event after a drop in peak thermal airflow signal excursions by ≥ 90% of the corresponding baseline for a duration ≥10 s [
11]. Nevertheless, there are many sources of uncertainty in the criteria defined in literature. First, the AASM doesn’t provide a precise definition for the respiratory baseline. Second, the published criteria for mathematical scoring apnea are not consistent and vary with different standards [
60]. Moreover, the high intraobserver and interobserver variability due to human scoring and human errors [
17] significantly affect the robustness and performance of automated sleep systems. Adopting a probabilistic framework in the proposed study provides an efficient approach to propagate different sources of uncertainty using a data-driven modeling framework optimized with respect to the ability to detect apnea events. Interestingly, using the proposed GMM-based classification system showed an overall improved performance in apnea detection and a more consistent and generalizable performance across patients with different severity levels.
Finally, the presented approach focuses on automated detection for apneic events using oronasal airflow respiration signal. Future work many extend the algorithm by adding an additional input through the nasal pressure signal in order to study dynamical changes in this signal during hypopnea as recommended by the AASM. Using the oronasal airflow and the nasal pressure signals will enable detecting both apnea and hypopnea events needed to estimate the AHI in order to provide a diagnosis for sleep apnea severity. Additionally, this will improve the ability to estimate respiratory baseline by incorporating two signals instead of one leading potentially to a decreased number of false positive detections. Moreover, future work may consider adding input from the respiratory effort signal to provide diagnosis for the type of detected events (OSA, OSA, or MSA).