1. Introduction
Obstructive sleep apnea syndrome (OSAS) is characterized by complete or incomplete obstruction of the upper airway during sleep. The main symptoms of OSAS are light sleep, excessive daytime sleepiness, and snoring; these are said to increase the risk of developing serious illnesses, such as ischemic heart disease, hypertension, stroke, and cognitive dysfunction [
1]. Furthermore, it is said that 6–19% of females and 13–33% of males have OSAS, with the prevalence rate increasing with age [
2,
3]. A definitive diagnosis of OSAS is currently made using polysomnography (PSG) tests. However, this test requires multiple measurement sensors (e.g., oral thermistor, nasal pressure cannula, chest belt) to be worn directly on the body all night, which imposes a heavy burden on the patient. Previous studies suggested that the discomfort of wearing multiple sensors during PSG and restricted movements affect sleep efficiency, electrocardiographic (EEG) spectral power, and rapid-eye movements [
4,
5,
6,
7].
To resolve these problems, research is being conducted with the aim to establish an OSAS screening method that is based on noncontact microphones. These studies include (i) technological development to detect snoring/breathing episodes (SBEs) [
8,
9,
10,
11,
12,
13], (ii) studies of snoring that characterize OSAS [
14,
15,
16,
17,
18], and (iii) snoring-based OSAS screening methods and evaluations [
19,
20,
21].
In [
8,
9,
10,
11,
12,
13], snoring characteristics were extracted using ZCR, MFCC, and other statistical processing from the respiratory sounds during sleep which were obtained from patients; it was then shown that SBE sections could be classified using deep learning with accuracies in the range of 75.1–96.8% in various environments, including noise.
In [
14,
15,
16,
17,
18], the effectiveness of various characteristics in OSAS and non-OSAS patients in terms of temporal, frequency, intensity, and clinical features was evaluated to characterize OSAS-related upper airway obstruction.
In [
19,
20,
21], snoring sounds obtained from noncontact microphones were segmented, features were extracted by statistical processing and formant acoustic analysis, and machine learning tools, such as logistic regression and AdaBoost were used to classify OSAS/non-OSAS and sleep/waking states at sensitivities in the range of 80–90%, while doing so at a low cost.
As shown in [
22,
23,
24,
25], it has recently been suggested that sleep–awake activity and sleep quality could be estimated based on the analysis of respiratory sounds obtained during sleep. These results emphasized the importance of detecting SBEs during sleep. The automatic detection of SBEs from sleep sounds is the first step for automatic OSAS screening based on snoring. However, SBEs have a high dynamic range and are barely audible at intensities >90 dB. Specifically, there is a need for a method to automatically detect low-intensity SBEs without any contact, even in a low-SNR environment.
Therefore, our research group has been developing a system that automatically detects low-intensity SBEs from sleep sounds obtained by noncontact recording [
10,
12]. It has been suggested that the automatic detection of low-intensity SBEs has a high performance compared with other methods proposed in recent studies. However, the calculation speed and performance must be improved further for practical use.
The purpose of our study was to develop a more efficient method to detect low-intensity SBEs in sleep sound recordings.
Even if low-intensity SBEs are present in sleep sounds, human hearing can distinguish them from sleep sounds by careful listening. This is because the human auditory pathway has an innate function which is used to analyze the fine temporal characteristics of sound. The auditory image model (AIM) [
26,
27,
28], which simulates a human auditory mechanism from an engineering perspective, was developed by Patterson in 1995 [
26].
To generate a stabilized auditory image (SAI), this AIM describes a process of strobed temporal integration which transforms the signal flow from the cochlea up the auditory nerve to the brain. For sound event classification [
1,
13,
29], front-end, ear-like audio analysis has been conducted by generating features extracted from an SAI. However, the calculation of SAI requires large computational and memory costs. Conversely, sound event detection performed based on the peaks corresponding to glottal pulses was apparent in the neural activity pattern (NAP) which was converted into an SAI [
30,
31,
32]. Furthermore, the NAP which produces spectral profiles from AIM were used for the communication of sound recognition and the analysis of cochlear implant representations [
33,
34]. From these reports, we hypothesize that NAP carries information on the presence or absence of sound events even before SAI modeling.
A novel aspect of this study is that we propose the new feature, NAP-based cepstral coefficients (NAPCC), for the automatic, accurate, and faster detection of low-intensity SBEs in sleep sound recordings.
Based on leave-one-out cross validation of sleep sound data stored in a database, the performance of the proposed method was investigated and compared with that of the low-intensity SBEs detection method developed in our previous study in 2018 [
10,
12].
To date, sleep–awake evaluation methods and OSAS screening methods have been developed using SBEs obtained based on the noncontact approach [
19,
20,
21]. High-intensity SBEs can be detected by the energy-based approach; however, if low-intensity SBEs can be detected efficiently and automatically by this study, then the presence or absence of patient’s breathing can be estimated from the recorded data, regardless of SBE intensity.
A noncontact approach based on sleep sound analysis was developed with the objective of a cost-effective alternative approach to OSAS diagnosis. Incorporating the proposed method in these approaches may enable more accurate OSAS screening and sleep stage evaluations.
4. Discussion and Conclusions
In this study, we proposed a new method that used the auditory property-based features of NAPCC and ANN discriminators to automatically detect low-intensity SBEs from recorded data that were acquired using a noncontact microphone. The effectiveness of the proposed method was investigated by detecting SBEs from recorded data of 25 individuals which comprised silence and SBEs (Exp-1) and from recorded data of 15 individuals which included non-SBEs thought to be noise, generated in an actual environment (Exp-2). Leave-one-out cross-validation was used to evaluate the performance in each experiment. The results suggested that SBEs could be detected with an average accuracy of 85.83% in Exp-1 and of 85.99% in Exp-2. A comparison of performance with the MLP-ANN-based SBE detection method proposed in our previous work [
12] showed that the proposed method was approximately 3% better in the case of Exp-1 and approximately 10% better in the case of Exp-2. In particular, the standard deviation in both Exp-1 and Exp-2 became smaller with the proposed method when compared with the conventional method. Hence, it is thought that the influence of each individual subject could be reduced. Furthermore, the large improvement in specificity and PPV suggests that the new method is useful for the effective and automatic detection of silent or apneic sections contained in sleep sounds.
To date, gammatone frequency cepstral coefficients (GFCC) [
48] and BMM-based cepstrum coefficients [
49] have been developed, but both require gammatone filter bank outputs or mean normalization. To achieve a more accurate human auditory perception, a DCGA filter was developed to extend the domain of the gammatone auditory filter. This filter bank accommodates the nonlinear behavior observed in human psychophysics and can be useful for perceptual signal processing [
50]. The DCGA filter bank, which is the front-end of NAPCC, may be more noise-robust than the gammatone filter bank. Furthermore, the results obtained in this work showed that the noise robustness of NAPCC was improved by using the sigmoid normalization instead of the mean normalization used in GFCC and BMM-based cepstrum coefficients. Therefore, NAPCC which uses DCGA and sigmoid normalization should improve the performance of GFCC and BMM-based cepstrum coefficients.
In particular, to detect low-intensity SBEs in the sleep sound recordings, the method developed in our previous study in 2018 outperformed the recent techniques published up to 2018 [
12]. Since 2018, more recent technologies have been developed to classify snoring episodes and non-snoring episodes from sleep sounds.
Lim et al. proposed a recurrent neural network (RNN)-based classification method that was capable of classifying snoring episodes and non-snoring episodes with the use of features obtained from the sleep data recorded with a smartphone. The RNN-based classifiers achieved an accuracy of 98.9% using relatively small datasets [
9]. However, snoring segments used in this work were created based on the peak point of snoring signals obtained via the peak-detection algorithm. This means that relatively high-intensity snoring was selected for the analysis.
Jiang et al. [
51] proposed an automatic snore detection method using sound maps and a series of neural networks. The results demonstrated that the method is appropriate for identifying snores with an accuracy in the range of 91.8–95.1%. However, in this study, potential snoring episodes were segmented using the improved sub-band spectral entropy method which is based on sub-band energy calculation.
Shen et al. [
8] proposed the use of the MFCC feature extraction method and the LSTM model for the binary classification of snoring data. The experimental results showed that the developed method yielded the highest accuracy rate of 87%. However, for the analysis, very weak snoring sounds were not labeled in the data presented on PSG used in this study.
Furthermore, a sleep sound classification method based on AIM has been proposed for sleep sounds extracted by an energy-based approach, and it has been confirmed that sleep sounds can be classified with high accuracy [
13]. However, there is a need to use multiple acoustic features obtained from SAI which is converted from NAP using strobed temporal integration (STI).
This study has the following advantages. The proposed method can be conducted with low-computational costs because it eliminates the computationally expensive STI processing used in AIM and can be built using stages up to NAP. Additionally, it has been confirmed that the use of the proposed method allows for the detection of low-intensity SBEs with higher performance compared with our previous method [
12], and the computational speed was also significantly improved. Given that the performance of the proposed method was superior to that of our previous method even in the case of Exp-2, it is suggested that the new feature (NAPCC) proposed in this study is an acoustic feature that is robust against noise.
However, our study has some limitations: (i) a relatively small size of the dataset, which cannot satisfy sound variety, was used; (ii) for SBEs of short duration, the performance of the proposed method was degraded because the output of the NAPCC corresponding to the SBEs became small in the NAPCC spectrogram, which is the ordered series of the NAPCC of each frame for the recorded data; (iii) the proposed method uses the DCGA filter bank approach which has the highest calculation cost in the NAPCC calculation procedure.
The proposed method is expected to contribute as a pretreatment step to OSAS screening based on snoring and respiratory sounds. It is thought to be useful for the effective and automatic identification of respiratory sound information, particularly apneic sections and silence, from sleep sounds acquired without contact.