Voice Pathology Detection Using a Two-Level Classifier Based on Combined CNN–RNN Architecture

Ksibi, Amel; Hakami, Nada Ali; Alturki, Nazik; Asiri, Mashael M.; Zakariah, Mohammed; Ayadi, Manel

doi:10.3390/su15043204

Open AccessArticle

Voice Pathology Detection Using a Two-Level Classifier Based on Combined CNN–RNN Architecture

by

Amel Ksibi

^1,*

,

Nada Ali Hakami

²,

Nazik Alturki

^1,*,

Mashael M. Asiri

³,

Mohammed Zakariah

⁴ and

Manel Ayadi

¹

Department of Information Systems, College of Computer and Information Science, Princess Nourah bint Abdulrahman University, Riyadh 11671, Saudi Arabia

²

Computer Science Department, College of Computer Science and Information Technology, Jazan University, Jazan 45142, Saudi Arabia

³

Department of Computer Science, College of Science & Art at Mahayil, King Khalid University, Abha 62529, Saudi Arabia

⁴

College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia

^*

Authors to whom correspondence should be addressed.

Sustainability 2023, 15(4), 3204; https://doi.org/10.3390/su15043204

Submission received: 11 December 2022 / Revised: 10 January 2023 / Accepted: 11 January 2023 / Published: 9 February 2023

Download

Browse Figures

Versions Notes

Abstract

:

The construction of an automatic voice pathology detection system employing machine learning algorithms to study voice abnormalities is crucial for the early detection of voice pathologies and identifying the specific type of pathology from which patients suffer. This paper’s primary objective is to construct a deep learning model for accurate speech pathology identification. Manual audio feature extraction was employed as a foundation for the categorization process. Incorporating an additional piece of information, i.e., voice gender, via a two-level classifier model was the most critical aspect of this work. The first level determines whether the audio input is a male or female voice, and the second level determines whether the agent is pathological or healthy. Similar to the bulk of earlier efforts, the current study analyzed the audio signal by focusing solely on a single vowel, such as /a/, and ignoring phrases and other vowels. The analysis was performed on the Saarbruecken Voice Database,. The two-level cascaded model attained an accuracy and F1 score of 88.84% and 87.39%, respectively, which was superior to earlier attempts on the same dataset and provides a steppingstone towards a more precise early diagnosis of voice complications.

Keywords:

recurrent neural networks (RNNs); deep learning; audio feature extraction; Mel-frequency cepstral coefficients

1. Introduction

One of the most fundamental human instincts and voices in the subsystem is speech. The natural voice is the auditory outcome of pulmonary air bursts connecting with the larynx, which causes the actual vocal folds to abduct and produce intermittent and aperiodic sounds. Vocal hyperfunction, which is the term for repeated abusive vocal patterns, can occasionally cause speech disorders, including aphonia (totally losing one’s voice) and dysphonia (partially losing one’s voice) [1]. Voice disorders can result from a variety of causes, including: fatigue, environmental changes, muscle dystrophy, face pain, and infections of the voice tissue [2]. Vocal noise increases due to the voice pathology’s detrimental effects on vibration regularity and voice functionality. Vocal pathology affects the vocal folds’ mobility, functionality, and morphology, which causes irregular vibrations and increased acoustic noise [3]. According to [4,5], such a voice significantly decreases voice quality by sounding strained, harsh, weak, and breathy [6,7]. Voice disorders are a serious and widespread issue because they make it difficult for people to interact in social situations effectively. Vocal pathological problems can occur in those who work in occupations that typically place high demands on the voice, such as actors, singers, auctioneers, lawyers, and instructors [8]. Moreover, due to the lack of adequate medical services in remote places, this condition necessitates the employment of skilled professionals. These operations are also uncomfortable and occasionally frightening. Consequently, a thorough investigation of options for voice pathology diagnosis is being carried out.

In this regard, the assessment of voice quality is typically conducted using one of two approaches: a subjective perceptual evaluation of the patient’s voice, where a score is assigned based on the listener’s assessment, or an objective method based on acoustic analysis, which quantifies specific aspects of the vocal, acoustic signal [9]. The so-called in-hospital auditory–perceptual and visual examination of the vocal folds falls under the first category (subjective evaluation) [10,11]. Laryngostroboscopy is frequently utilized for visual assessment [12]. Several clinical rating scales to identify and assess the severity of vocal disorders have been established for the auditory–perceptual assessment [13,14,15]. However, inter-rater heterogeneity affects subjective evaluation techniques [16,17]. This assessment takes time, and physicians must carefully review and score it. Additionally, it demands patients’ attendance at the clinic, which can be problematic, particularly in more advanced stages of the condition.

Clinical professionals might benefit from automated vocal pathology detection systems because they enable the objective evaluation and diagnosis of voice pathologies in their early stages. Additionally, because the perceptual analysis is subjective, there is a greater demand for automatic systems to identify strange voices or objective criteria to rate the quality of agents [6]. Therefore, this kind of judgment is naturally free of subjectivity. Additionally, modern intelligent devices make it simple to record a voice and process it remotely utilizing cloud technology. Therefore, a system capable of precise differentiation between healthy and pathological voices can be created using signal processing techniques (to quantify vocal manifestations of the pathology under focus) and machine learning algorithms (to automate the process of voice pathology diagnosis).

Traditional machine learning and deep learning-based technologies, as well as their combination, can be used to handle challenges in voice pathology detection [18]. In feature extraction and analysis-based machine learning techniques, voice signals are first processed to collect features before being classified as pathological or expected based on these features. This method has drawbacks and difficulties, such as choosing an appropriate classification algorithm or manually choosing applicable voice attributes. Deep learning-based techniques that automatically extract features for improved classification performance may be more effective in overcoming these issues and improving the performance of speech pathology detection systems.

This study suggests a deep learning-based multi-modal fusion method for automatically detecting speech pathology. To identify voice disorders in this work, we used the Saarbruecken Voice Database (SVD) [19]. In contrast to previous studies, the gender feature was added to the pathology diagnosis model and the conventionally extracted audio data. Therefore, this method effectively blends handcrafted and in-depth characteristics to use each offer’s complementary information. The deep learning architecture combined a convolutional neural network (CNN) and a recurrent neural network (RNN). Hence, three different models were proposed for emotion recognition from speech. (1) A simple emotion recognition model using a single classifier to identify healthy or pathological speech samples. (2) A model classifing the speech samples into five disordered classes and one healthy class. (3) A hierarchical gender-based emotion recognition model consisting of gender categorization followed by a pathology detection model for each gender.

The significant contributions of this study include:

A novel multi-model architecture, which is a coupled CNN–RNN for the classification of healthy and pathological audio samples
A two-level cascaded architecture that enables the accurate identification of pathological voices from the input dataset by incorporating gender information and manually extracted features.

The rest of the paper is organized as follows: Section 1 presents the introduction, followed by Section 2, which offers a literature review. Section 3 presents the study’s dataset, the applied preprocessing techniques, and the methodology. Section 4 shows the study’s results. Section 5 is dedicated to discussion. The paper then concludes with Section 6.

2. Literature Review

Diseased and healthy voices are classified based on both automated feature extraction utilizing deep learning and classical feature extraction using machine learning.

The most widely used voice features for detecting voice pathology are based on time, frequency, and cepstral domains. These include jitter, shimmer, the Harmonics-to-Noise Ratio (HNR), Relative Spectra, Perceptual Linear Prediction (RASTA-PLP), Discrete Wavelet Energy, Discrete Wavelet Entropy [20], and Linear Prediction Cepstral Coefficients [21].

Shimmer, jitter, and seven more parameters were utilized by [22] as an iterative residual signal estimator, with jitter achieving 54.8% accuracy for 21 pathologies.

In another study [23], the authors used the GMM Classifier to categorize the normal and pathological voice or speech patterns using the most widely used acoustic feature extraction method, such as Mel-frequency cepstral coefficients (MFCCs), based on the extracted voice samples from the Massachusetts Eye and Ear Infirmary (MEEI) database, whereas in the study of [24], 22 acoustic characteristics were chosen. These parameters were calculated for each sample (50 dysphonic and 50 normal) and then fed to six different classifiers to compare their accuracy. The authors employed two independent reduction approaches before using the classification method. The binary classification SVM provided the best recognition accuracy, i.e., 94.26%.

Another study [25] proposed a system for voice disorder detection using an SVM on the SVD [19]. It concentrated on developing a trustworthy and solid function extraction to recognize and classify voice disorders through autocorrelation and entropy in analyzing various frequency bands. A total of three language datasets—English, German, and Arabic—were mined for various continuous vocal examples of both standard and disordered voices. As a classifier, the assist vector machine was employed. The SVD has the most remarkable recorded accuracy of 92%.

Another study [26] applied a deep learning model by employing practically all SVD samples and obtained an accuracy rate of 68%. In addition, an LSTM was used, a recurrent neural network, specifically for pathology identification.

Using a machine learning model for the selected /a/, /i/, and /u/ vowels and a sentence comprising all 71 pathological subsets, Refs. [27,28] reported accuracies of 74.32% and 85.71%, respectively. On the other hand, some of the studies only used three disease subgroups and statistical techniques to obtain above 90% accuracy [29].

The SVD was subjected to feature extraction by [30]. After feature extraction, the system input is fed into 27 convolutional and recurrent neural network neuronal layers. After 10 thousand-fold validation, the dataset was separated into training and testing, and the reported accuracies of the CNN and RNN are 87.11% and 86.52%, respectively.

To create deep features, ref. [8] integrated two concurrent convolutional neural networks (CNNs), one for voice signals and the other for EGG signals. In the same way, traditional handcrafted features can be obtained. These features are concatenated to create a feature set with greater prominence. Additionally, a feature selection technique is used to eliminate unnecessary characteristics. Finally, the vocal pathology is discovered using an SVM classifier. The Saarbruecken Voice Database is used for the tests (SVD). The experimental findings demonstrate that the suggested voice pathology identification approach, when applied to all speech and EGG samples, achieves an accuracy of up to 90.10%. Additionally, F1-score results of 92.9%, 84.6%, and 92.57% are obtained, respectively.

The authors in this work [31] proposed the OSELM algorithm as a classifier and the Mel-frequency cepstral coefficient (MFCC) as a featured extraction. The Saarbrücken voice samples were used for this study. In the study of [32], a unique method is proposed for diagnosing diseased voices. It applied signal processing techniques and models of cochlear implants. In the experiment, two different cochlear implant models were evaluated. The speech samples were processed using the proposed method, and a CNN was used to make a definitive diagnosis of the pathological voice. The findings indicate that using bandpass and gammatone filter banks, the two proposed models, could differentiate between pathological and healthy voices, resulting in F1 scores of 77.6% and 78.7%, respectively, with speech samples. The models were able to accomplish this by adopting a bandpass filter bank.

In addition to studies based on the SVD and MEEI datasets, there were also studies based on data from additional datasets. A unique method for classifying four common voice problems (i.e., functional dysphonia, neoplasm, phonotrauma, and vocal palsy) was proposed by [33]. This method makes use of continuous Mandarin speech rather than a single vowel. First, they converted acoustic data into Mel-frequency cepstral coefficients. Then, they used a bi-directional long short-term memory network (BiLSTM) to represent the sequential characteristics of the signal. The findings of the experiments show that the suggested framework leads to significant accuracy and unweighted average recall gains of 78.12–89.27% and 50.92–80.68%, respectively, when compared with systems that only use a single vowel.

By examining previous related studies, it is noticed that these studies employ a variety of approaches, including RNNs, CNNs, MFCCs, LSTMs, and others, although many failed to improve the model’s accuracy. Therefore, the main goal of this study is to develop a deep learning model for precise speech pathology detection. Manual audio feature extraction served as the framework for classification. Unlike previous studies, this study was designed to apply a two-level classifier model to incorporate gender as an additional piece of data. Whether a male or female voice is present in the audio input is determined by the first level, and the second level determines the agent’s pathology or health. The suggested CNN–RNN model is explained in the next section.

3. Materials and Methods

As can be seen, Figure 1 depicts the followed methodology. The first two steps include the preparation of the dataset and the extraction of features. The collection of data involved collecting both healthy and diseased samples from the various available open-source resources. The audio features were extracted from the audio samples to load the audio features into the deep learning model for classification. However, each class’s total number of auditory instances was different. To address this issue, audio preprocessing was carried out to increase the total number of voice samples in each category.

3.1. Dataset Preparation

3.1.1. Dataset

The SVD (Saarbruecken Voice Database) is a well-known audio pathology dataset. This database offers a comprehensive collection of normal and abnormal speech samples from more than 2000 individuals, all captured in the same environment. The dataset contains the average, high, and low-pitch pronunciations of the German vowels /i/, /a/, and /u/, as well as the sentence “Guten Morgen, wie Geht es Ihnen?” The 1 to 3 s speech samples were collected at 50 kHz with 16-bit resolution.

As a first step towards exploring this dataset, the current study concentrated solely on the vowel sound /a/, as did most earlier work for audio classification tasks. Figure 2 depicts the number of healthy and pathological voice samples in the retrieved /a/ vowel dataset.

3.1.2. Data Preprocessing

As shown in Figure 2, the dataset was unbalanced. This may consequently impact the model’s performance. To address this deficiency, audio data augmentation was applied. We concentrated on the four most prominent ones.

(a): Noise injection

By utilizing a function within the NumPy package known as white noise, which merely inserts a random value into the data, the noise can be either fixed or random, depending on whether or not a threshold value is maintained. A regularization effect is produced when noise is introduced into the training phase of a neural network model, which, in turn, improves the model’s robustness.

(b): Time shifting

It moves the audio to the left or right at an arbitrary second interval. If the audio is shifted to the left (fast forward), the initial x seconds are marked as 0 (i.e., silence). If the audio is shifted to the right (backward and forwards), the final x seconds are marked as 0 (i.e., silence).

(c): Changing pitch

It is a practical application of the pitch scaling method utilized in musical instruments. It is a method for altering sound pitch without affecting the speed at which it travels. Librosa’s pitch function implements it. It increases or decreases the angle by a specific and arbitrary number. Wave samples and sampling rates are required to implement the process.

(d): Stretch (speed)

The process of altering the rate or duration of an audio transmission without affecting its pitch is referred to as time stretching. In contrast, pitch scaling refers to altering a sound’s pitch while maintaining its original speed. Pitch shift is a pitch scaling designed for live performances and implemented in effects units. A recording can be slowed down or sped up during the pitch control process, which makes it possible to concurrently alter both the pitch and the speed of the recording.

3.2. Feature Extraction

3.2.1. Zero-Crossing Rate (ZCR)

The number of times a signal flips from positive to zero to negative or negative to positive within an audio frame is called the Zero-Crossing Rate (ZCR). Another possible meaning is the amount of background noise in the audio signal. Equations (1) and (2) determine the ZCR. The plot of the ZCR is given in Figure 3.

Z (i) = \frac{1}{2 W_{L}} \sum_{n = 1}^{W_{L}} | s g n [x_{i} (n)] - s g n [x_{i} (n - 1)] |

(1)

where

x_{i} (n)

is the audio signal,

W_{L}

is the length of the window, and

s g n

is the sign function that can be calculated as

s g n [x_{i} (n)] = {\begin{matrix} 1, x_{i} (n) \geq 0, \\ - 1, x_{i} (n) < 0 \end{matrix}

(2)

3.2.2. Root-Mean-Square Energy (RMSE)

The RMS Energy of the audio signal can be considered the overall magnitude of the movement. In the context of audio signals, this corresponds to how loud the sound is. The formula for determining the signal’s energy is Equation (3).

E n e r g y = \sum_{n} | x (n) |^{2}

(3)

where

x (n)

is the audio signal.

The Root-Mean-Square (RMS) calculation is a helpful tool for determining the average of variables over time. For example, when working with sound, the value of the signal (the amplitude) is squared; the result is then averaged over some time, and, finally, the square root of the development is for Equation (4) presents the mathematical definition of the Root-Mean-Square Energy (RMSE). The RMSE and the audio signal’s energy are shown in Figure 4.

RMSE = \sqrt{\frac{1}{N} \sum_{n} | x (n) |^{2}}

(4)

where

N

is the length of the signal, and

x (n)

is the audio signal.

3.2.3. Mel-Frequency Cepstral Coefficients (MFCCs)

Mel-frequency cepstral coefficients (MFCCs) are a well-known method for extracting 39 features from an audio signal. The procedure for the MFCC feature extraction is depicted in Figure 5.

(a): A/D conversion

The process of converting analog audio signals to their digital counterparts. Frequently, sampling frequencies of 8 or 16 kHz are employed.

Preemphasis: The preemphasis amplifies the intensity of the energy at a higher frequency. This phenomenon, known as spectral tilt, is discovered when researchers look at the frequency domain for voiced segments such as vowels. They find that the lower frequencies contain a more incredible, more significant energy than the higher frequencies. Increasing the amount of energy at high frequencies helps to improve the performance of the phone detection and, consequently, the model’s performance as a whole.

Windowing: The MFCC technique aims to extract features from the audio signal that can be used to identify the phones in speech. However, the given audio signal contains many phones; hence, we divide the given audio signal into several segments, with the duration of each segment being 25 milliseconds and the signal being spaced ten milliseconds apart. The windowing process includes dividing the audio waveform into frames that can slide back and forth. We derived 39 features from each segment.

Let us assume that the window that was used on the primary audio clip in the time domain is denoted by w, as in Equation (5)

x [n] = w [n] s [n]

(5)

where

x [n]

denotes the sliced frame, and

s [n]

is the original audio signal.

In addition, as we break the signal, if we directly cut it off at the edges of the signal, it would cause noise in the high-frequency domain due to the sudden drop in amplitude at the boundaries. The signal is chopped up using Hamming and Hanning windows rather than a rectangular window, since these windows do not generate noise in the high-frequency range like rectangular windows.

(b): DFT (Discrete Fourier Transform)

The signal at the time domain is converted to the frequency domain by applying the DFT, as the analysis is more straightforward in the frequency domain. The equation for the DFT is given in Equation (6).

X [k] = \sum_{n = 0}^{N - 1} x [n] e x p (- j \frac{2 π}{N} k n)

(6)

where N is the number of samples, n is the current sample, and k is the current frequency; where k ∈ [0, N − 1],

x [n]

is the sine value at sample

n

, and

X [k]

is the DFT, which includes information on both amplitude and phase.

(c): Mel Filter Bank

How machines perceive sound frequencies is distinct from the ways in which people do. Humans have a lower sensitivity to higher-frequency sounds. As frequency increases, there is a decrease in the perceived frequency resolution. The resolution of machines is unaffected by the frequency at which they operate. Improving the model’s performance by modeling the human hearing property during the feature extraction step would achieve this goal. The Mel scale was utilized to translate the actual frequency into the frequency human beings can detect. Equation (7) provides the formula that should be used for the mapping.

m e l (f) = 1127 l n (1 + \frac{f}{700})

(7)

where

f

is the frequency of the signal in Hz.

(d): Applying Log

The Mel filter bank produces a power spectrum. The log function works the same way as people do when they sense a change in the energy of an audio source at higher energies. When x is given at a low value, the log function gradient is more significant, whereas when x is given at a high value, the gradient value is lower. Therefore, the logarithmic output of the Mel filter bank simulates the human auditory system. This assists in reducing the acoustic variations that are not significantly essential for speech recognition.

(e): IDFT

It is the inverse transform of the output from the previous step. After the IDFT procedures have been applied to the signal, the MFCC (Figure 6) model uses the top 12 coefficients of the signal. It uses the energy of the signal sample as the feature, in addition to the 12 coefficients already being used. The equation used to compute the energy is given as Equation (8)

E n e r g y = \sum_{t = t_{1}}^{t_{2}} x^{2} [t]

(8)

where

x (t)

is the discrete-time signal.

3.3. Model Development

3.3.1. Model Architecture

Our proposed solution is the convolutional n network integrated with the recurrent neural network (RNN) [34] with LSTM layers. The CNN [35] typically performs very well with picture data, whereas the RNN usually does very well with data with a specific order, such as time series data. Since the data, in this case, consisted of audio, which unquestionably had a sequence, selecting the RNN was the most realistic option. However, features were extracted into a Pandas data frame; hence, it is analogous to image data that contains several dimensions. Therefore, the CNN was the ideal architecture to complete the classification task because it is the best at discovering the hidden characteristics present in a dataset. Thus, the integration of RNN layers onto a CNN design was recommended so that both architectures’ benefits increase the accuracy of the work intended to be performed. The proposed multi-modal architecture is shown in Figure 7. The model was given the data frame containing the retrieved audio features to process them further. The first five blocks were the ones responsible for the convolution. There were three layers in each blockquote to extract more valid information from the generated feature data frame; the Conv 1D, batch normalization, and max-pooling layers were used. After that, these features were fed into an RNN block, which includes an LSTM layer and an attention layer. In the end, the more compact and one-of-a-kind features were inserted into the dense layers so that they could be classified into the necessary number of categories.

3.3.2. Experimental Design

The current work also proposed a cascaded two-stage model capable of classifying the pathology of voices. The architecture is shown in Figure 8.

We suggest using a classifier with two levels to solve the problem of automatically classifying audio samples according to the gender information contained in them. This architecture comprises three different models, one of which is utilized in the first level, while the other two are used in the second. The initial step entails identifying whether the sound is coming from a male or a female, as demonstrated in Figure 8. Due to this, the model employed in this level is a binary classifier trained on the feature data frame taken from the audio dataset. The audio label consisted of the male and female gender classes. We constructed a gender classification system based on a binary system at the first level to construct our two-level model. At the second level, we decided if the audio was healthy or pathological by using separate models trained for both genders. Therefore, the results acquired in the first level contribute to an accurate assessment of the voice as either healthy or pathological. Additionally, at the second level, we created binary classification models to identify problematic voices in a manner that is distinct for each gender.

3.3.3. Training Pipeline

The initial feature extraction from the audio dataset was performed using several libraries, particularly Librosa, NumPy, OpenCV, Scikit-Learn, and other open-source modules, for all additional processing and analysis. All three developed models were of the proposed multi-model architecture depicted in Figure 7, constructed with the Keras framework, supported by TensorFlow, and coded in Python. We trained the model with 50 epochs with a batch size of eight. The ratio between the training and test datasets was 80:20, and 10% of the train images were designated for validation. The training was conducted on an NVIDIA Quadro P1000 GPU with 32 GB of memory. The optimizer was “rmsprop” with the loss function “categorical cross-entropy.” The dynamic learning rate technique was implemented by observing the change in validation accuracy for three epochs and reducing the learning rate at a rate of 0.5 until it hit a minimum of 0.00001.

4. Results

A. Detection of the pathological samples by the proposed multi-model architecture

(1) Binary classification (Healthy vs. Pathological)

We began by training the suggested multi-model architecture on the training dataset to determine how well the audio recordings performed in the binary classification (Healthy vs. Pathology). The loss plot and the accuracy plot generated during the training procedure are depicted in Figure 9. It is assumed that calculating the loss function for each data item across each epoch yields the quantitative loss measure for the given period. However, the loss is only revealed when charting a curve over iterations; this only applies to a fraction of the whole dataset. Moreover, the curve only shows the loss. One might understand the current issue more deeply by analyzing validation and training loss. To understand how neural networks develop, another type of curve frequently used is known as an accuracy curve. Figure 9 illustrates that, as the epoch advances, the validation loss decreases. At the same time, the accuracy rises exponentially until it reaches the level that is considered to be optimal. As a consequence of this, there is no clear indication that the model was overfitting.

The model’s performance in terms of the different metrics on the test dataset is presented in Table 1. The confusion matrix is shown in Figure 10. As seen in Table 1, the model’s performance improved significantly, as it achieved an accuracy of 87% on the test set. Compared to the confusion matrix, it is clear that there were considerably fewer instances of incorrect classification, which significantly contributed to the improved overall performance of the model. However, more healthy voice samples were incorrectly predicted as pathological voice samples.

(2) Multiclass classification

The same multi-model architecture with the slight modification at the output/softmax layer number of nodes was intended to classify the voice samples into six classes: Healthy and five pathological diseases (Balbuties, Dysodie, Dysphonie, Larygnitis, and Vox senilis). The loss and the accuracy plots are shown in Figure 11, and the performance is analyzed in Table 2. The class-wise performance is depicted in the confusion matrix shown in Figure 12.

B. Detection of the pathological samples by the two-level architecture

The two-level architecture consists of three models: the gender detection model, the pathology detection model on female voices, and the pathology detection model on male voices for the binary classification task.

(1): The gender detection

Gender prediction is the first phase of the two-level classification model. The current dataset contains just male and female voice samples. Hence, the gender detection model is also binary. Table 3 depicts the performance of the gender detection model. The overall gender classification had the best performance, which shows that the dataset had enough features to separate the gender classes correctly. A model with such sound performance could significantly contribute to the pathology detection model. For this purpose, the proposed CNN–RNN architecture was utilized.

The class-wise performance is given in Table 4. It demonstrates that the model performed consistently for both the female and male demographics. Given that there is no evidence of bias, the model is ideal for gender detection. A result of this kind can be incorporated as a feature in a pathology detection model.

(2): Binary classification (Healthy vs. Pathological)

The two-level architecture for binary classification consists of two steps. First, we tried to predict the gender of each voice sample. This gender information contributed to the classification prediction as either pathology or healthy, as we used two models for the female and male voices. The overall performance of the two-level cascaded architecture is depicted in Table 1. The confusion matrix on the test dataset is shown in Figure 13.

5. Discussion

The voice, an intricate array of sounds emanating from our vocal cords, is a rich source of information and plays an essential part in social interaction [36]. This is because it enables us to communicate insights about our feelings, fears, and excitement by varying the tone or pitch of our voice. However, various disorders can impact organs such as the heart, lungs, brain, muscles, and vocal folds, altering a person’s voice. Patients from all around the world have reported their suffering from a variety of vocal issues. As a result of the inherent difficulties in identifying vocal disorders in the absence of specialized equipment and experienced workers, several patients either go undetected or receive an evaluation that is prejudiced due to matters based on their personal experiences [4].

Additionally, due to the high cost of the diagnostic tests, many did not follow up on the suggested surgical treatments [37]. The use of data analysis, which can effectively detect and diagnose persons at a fraction of the cost, has been increasingly growing in recent years as a means to reduce the financial burden associated with the diagnostic process. Therefore, applying Artificial Intelligence (AI) to voice analysis paves the way for new possibilities in the medical field.

However, the subfields of AI, such as ML and DL, require enormous amounts of data to construct a model capable of detecting problematic voices. The SVD used in the present work had a sufficient number of healthy and diseased speech samples. However, it had intrinsic limits. As stated by [38], this dataset showed an uneven distribution of healthy and pathological audio data, as shown in Figure 2. Some of the samples lacked information regarding the degree of the pathology and the appearance of the pathology in phonation, meaning that some samples may sound healthy, despite being categorized as pathological, and vice versa. In addition, some recordings were tagged with more than one sort of pathology, making it challenging to discern numerous pathology classes.

Before extracting features from the audio samples, we applied various audio augmentation techniques to rebalance the dataset. This was done so that the results were more accurate. In addition to performing operations on the raw audio samples, a manual feature extraction technique was utilized because it was established in earlier research that, when working with such a small amount of training data, it is preferable to use inputs with decreased dimensionality as opposed to raw waveform inputs. Alternatively, transfer learning or data augmentation can be utilized [38]. Ref. [38] used an XGBoost architecture in their work. However, it obtained an accuracy of 77.21% for the classification of Healthy vs. Pathological voices. Instead of transfer learning methods, we employed a hybrid CNN–RNN architecture for carrying out the same task. We anticipated that the convolutional layers would learn to recognize a variety of patterns that would assist us in distinguishing normal voices from voices with pathological conditions. The time-distributed abstract feature vectors outputted from convolution stacks should subsequently be transformed by long short-term memory layers (LSTM) into a representation understandable for fully linked dense layers, which should execute the final classification. A similar model was designed by [26], but the model was developed on the raw audio signal rather than the extracted feature set. Similarly, for [30], separate CNN and RNN models were developed on the SVD. Nevertheless, the model’s performance was worse than the proposed system (Table 5).

In addition, it was evident that the acoustic signals produced by male and female participants were somewhat different. Hence, we devised a two-level architecture in which all samples were separated into two groups based on the gender of the speaker. Then, for each group, a gender-dependent classifier was generated. In the recognition step, separate classifiers recognized samples based on the group to which they belong. The two-level architecture outperformed the typical classification model, as seen in Table 1. Table 3 and Table 4 show that the first-stage gender classification model had the best performance. Hence, gender information contributed significantly to the effectiveness of pathology detection tasks.

However, the language element may substantially impact performance when the application is used in a real-world setting [39]. We plan to collect data from patients who are both healthy and pathological under the guidance of medical professionals. Additionally, to overcome the labeling ambiguity in the SVD, a more detailed data collection will be performed or audio samples will be combined from multiple datasets in future works. For instance, the system must determine which section of the acoustic signal contributed to the system’s conclusion. In this regard, a potential future extension would be the development of a network architecture with the capacity to explain its reasoning for prediction.

6. Conclusions

The proposed work merges deep learning-based applications to provide an integrated speech pathology audio categorization system. In this unified method, a CNN, RNN, and LSTM work together. Before manually extracting audio features, we found that data augmentation has a positive effect. The two-level classifier created in this study first determines if the input audio data corresponds to a male or female and then uses a CNN–RNN model trained independently for each gender class to identify abnormal speech samples in each gender. A two-level architecture that offers gender information was able to accurately categorize 88.83% of the audio samples in the test dataset, which was a significant improvement over much earlier research on the same dataset. Due to the labeling ambiguity in the SVD, it was impossible to create a three-level architecture by categorizing problematic samples into various classes in this work. More effort must be made to generate an error-free dataset for the challenge of multiclass pathology identification. In addition, more extensive feature extraction and selection must be performed, and the contribution of each feature must be investigated in depth to create a more accurate and effective classification model.

Author Contributions

Conceptualization, A.K. and N.A.H.; methodology, M.Z., N.A.; software, M.M.A.; validation, M.Z., M.A. and M.Z.; formal analysis, M.Z., N.A.; investigation, M.A.; resources, A.K.; data curation, N.A.H.; writing—original draft preparation, M.Z., N.A.; writing—review and editing, N.A.H.; visualization, A.K.; supervision, A.K.; project administration, M.A.; funding acquisition, A.K. All authors have read and agreed to the published version of the manuscript.

Funding

Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2023R333), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset details are available in the manuscript.

Acknowledgments

Authors acknowledge funding support given by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2023R333), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Conflicts of Interest

The authors declare no conflict of interest.

References

Titze, I.R.; Verdolini, K. Vocology: The Science and Practice of Voice Habilitation; National Center for Voice and Speech: Salt Lake City, UT, USA, 2012. [Google Scholar]
Al-Dhief, F.T.; Latiff, N.M.A.; Malik, N.N.N.A.; Salim, N.S.; Baki, M.M.; Albadr, M.A.A.; Mohammed, M.A. A Survey of Voice Pathology Surveillance Systems Based on Internet of Things and Machine Learning Algorithms. IEEE Access 2020, 8, 64514–64533. [Google Scholar] [CrossRef]
Muhammad, G.; Altuwaijri, G.; Alsulaiman, M.; Ali, Z.; Mesallam, T.A.; Farahat, M.; Malki, K.H.; Al-Nasheri, A. Automatic voice pathology detection and classification using vocal tract area irregularity. Biocybern. Biomed. Eng. 2016, 36, 309–317. [Google Scholar] [CrossRef]
Hillenbrand, J.M.; Houde, R.A. Acoustic Correlates of Breathy Vocal Quality: Dysphonic Voices and Continuous Speech. J. Speech Lang. Hearing Res. 1996, 39, 311–321. [Google Scholar] [CrossRef] [PubMed]
Teager, H. Some observations on oral air flow during phonation. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 599–601. [Google Scholar] [CrossRef]
Mekyska, J.; Janousova, E.; Gomez-Vilda, P.; Smekal, Z.; Rektorova, I.; Eliasova, I.; Kostalova, M.; Mrackova, M.; Alonso-Hernandez, J.B.; Faundez-Zanuy, M.; et al. Robust and complex approach of pathological speech signal analysis. Neurocomputing 2015, 167, 94–111. [Google Scholar] [CrossRef]
Brabenec, L.; Mekyska, J.; Galaz, Z.; Rektorova, I. Speech disorders in Parkinson’s disease: Early diagnostics and effects of medication and brain stimulation. J. Neural. Transm. 2017, 124, 303–334. [Google Scholar] [CrossRef]
Omeroglu, A.N.; Mohammed, H.M.; Oral, E.A. Multi-modal voice pathology detection architecture based on deep and handcrafted feature fusion. Eng. Sci. Technol. Int. J. 2022, 36, 101148. [Google Scholar] [CrossRef]
Barsties, B.; De Bodt, M. Assessment of voice quality: Current state-of-the-art. Auris Nasus Larynx 2015, 42, 183–188. [Google Scholar] [CrossRef]
Oates, J. Auditory-Perceptual Evaluation of Disordered Voice Quality. Folia Phoniatr. Logop. 2009, 61, 49–56. [Google Scholar] [CrossRef]
Song, P. Assessment of vocal cord function and voice disorders. In Principles and Practice of Interventional Pulmonology; Springer: Berlin/Heidelberg, Germany, 2013; pp. 137–149. [Google Scholar]
Uloza, V.; Vegiene, A.; Saferis, V. Correlation between the quantitative video laryngostrobo scopic measurements and parameters of multidimensional voice assessment. Biomed. Signal Process. Control 2015, 17, 3–10. [Google Scholar] [CrossRef]
Gerratt, B.R.; Kreiman, J.; Antonanzas-Barroso, N.; Berke, G.S. Comparing Internal and External Standards in Voice Quality Judgments. J. Speech Lang. Hearing Res. 1993, 36, 14–20. [Google Scholar] [CrossRef]
De Bodt, M.S.; Wuyts, F.L.; Van de Heyning, P.H.; Croux, C. Test–retest study of the grbas scale: Influence of experience and professional background on perceptual rating of voice quality. J. Voice 1997, 11, 74–80. [Google Scholar] [CrossRef]
Dejonckere, P.H.; Bradley, P.; Clemente, P.; Cornut, G.; Crevier-Buchman, L.; Friedrich, G.; Van De Heyning, P.; Remacle, M.; Woisard, V. A basic protocol for functional assessment of voice pathology, especially for investigating the efficacy of (phonosurgical) treatments and evaluating new assessment techniques. Eur. Arch. Oto-Rhino-Laryngology 2001, 258, 77–82. [Google Scholar] [CrossRef]
Armstrong, D.; Gosling, A.; Weinman, J.; Marteau, T. The Place of Inter-Rater Reliability in Qualitative Research: An Empirical Study. Sociology 1997, 31, 597–606. [Google Scholar] [CrossRef]
Gwet, K.L. Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters; Advanced Analytics LLC: Montgomery, UK, 2014. [Google Scholar]
Islam, R.; Tarique, M.; Abdel-Raheem, E. A Survey on Signal Processing Based Pathological Voice Detection Techniques. IEEE Access 2020, 8, 66749–66776. [Google Scholar] [CrossRef]
Barry, W.J.; Putzer, M. Saarbrucken Voice Database. May 2018. Available online: http://www.stimmdatenbank.coli.uni-saarland.de/ (accessed on 20 May 2018).
Muhammad, G.; Alhamid, M.F.; Hossain, M.S.; Almogren, A.S.; Vasilakos, A.V. Enhanced living by assessing voice pa-thology using a co-occurrence matrix. Sensors 2017, 17, 267. [Google Scholar] [CrossRef]
Alhussein, M.; Muhammad, G. Voice Pathology Detection Using Deep Learning on Mobile Healthcare Framework. IEEE Access 2018, 6, 41034–41041. [Google Scholar] [CrossRef]
Rosa, M.D.O.; Pereira, J.; Grellet, M. Adaptive estimation of residue signal for voice pathology diagnosis. IEEE Trans. Biomed. Eng. 2000, 47, 96–104. [Google Scholar] [CrossRef]
Chang, C.C. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 1–27. [Google Scholar] [CrossRef]
Arjmandi, M.K.; Pooyan, M.; Mikaeili, M.; Vali, M.; Moqarehzadeh, A. Identification of Voice Disorders Using Long-Time Features and Support Vector Machine With Different Feature Reduction Methods. J. Voice 2011, 25, e275–e289. [Google Scholar] [CrossRef]
Al-Nasheri, A.; Muhammad, G.; Alsulaiman, M.; Ali, Z.; Malki, K.H.; Mesallam, T.A.; Ibrahim, M.F. Voice Pathology Detection and Classification Using Auto-Correlation and Entropy Features in Different Frequency Regions. IEEE Access 2017, 6, 6961–6974. [Google Scholar] [CrossRef]
Harar, P.; Alonso-Hernandezy, J.B.; Mekyska, J.; Galaz, Z.; Burget, R.; Smekal, Z. Voice pathology detection using deep learning: A preliminary study. In Proceedings of the 2017 International Conference and Workshop on Bioinspired Intelligence (IWOBI), Funchal, Portugal, 10–12 July 2017. [Google Scholar] [CrossRef]
Kadiri, S.R.; Alku, P. Analysis and Detection of Pathological Voice Using Glottal Source Features. IEEE J. Sel. Top. Signal Process. 2019, 14, 367–379. [Google Scholar] [CrossRef]
Dankovičová, Z.; Sovák, D.; Drotár, P.; Vokorokos, L. Machine learning approach to dysphonia detection. Appl. Sci. 2018, 8, 1927. [Google Scholar] [CrossRef]
Dahmani, M.; Guerti, M. Glottal signal parameters as features set for neurological voice disorders diagnosis using K-Nearest Neighbors (KNN). In Proceedings of the 2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP), Algiers, Algeria, 25–26 April 2018; pp. 1–5. [Google Scholar]
Syed, S.A.; Rashid, M.; Hussain, S.; Zahid, H. Comparative Analysis of CNN and RNN for Voice Pathology Detection. BioMed Res. Int. 2021, 2021, 6635964. [Google Scholar] [CrossRef]
Al-Dhief, F.T.; Baki, M.M.; Latiff, N.M.A.; Malik, N.N.N.A.; Salim, N.S.; Albader, M.A.A.; Mahyuddin, N.M.; Mohammed, M.A. Voice Pathology Detection and Classification by Adopting Online Sequential Extreme Learning Machine. IEEE Access 2021, 9, 77293–77306. [Google Scholar] [CrossRef]
Islam, R.; Abdel-Raheem, E.; Tarique, M. A Novel Pathological Voice Identification Technique through Simulated Cochlear Implant Processing Systems. Appl. Sci. 2022, 12, 2398. [Google Scholar] [CrossRef]
Wang, S.S.; Wang, C.T.; Lai, C.C.; Tsao, Y.; Fang, S.H. Continuous Speech for Improved Learning Pathological Voice Disorders. IEEE Open J. Eng. Med. Biol. 2022, 3, 25–33. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Grossmann, T.; Vaish, A.; Franz, J.; Schroeder, R.; Stoneking, M.; Friederici, A.D. Emotional voice processing: Investigating the role of genetic variation in the serotonin transport-er across development. PLoS ONE 2013, 8, e68377. [Google Scholar] [CrossRef]
Alhussein, M.; Muhammad, G. Automatic Voice Pathology Monitoring Using Parallel Deep Models for Smart Healthcare. IEEE Access 2019, 7, 46474–46479. [Google Scholar] [CrossRef]
Harar, P.; Galaz, Z.; Alonso-Hernandez, J.B.; Mekyska, J.; Burget, R.; Smekal, Z. Towards robust voice pathology detection. Neural Comput. Appl. 2020, 32, 15747–15757. [Google Scholar] [CrossRef]
Bhattacharjee, S.; Xu, W. VoiceLens: A multi-view multi-class disease classification model through daily-life speech data. Smart Health 2022, 23, 100233. [Google Scholar] [CrossRef]

Figure 1. Data preparation and feature extraction.

Figure 2. Healthy and pathological sample distribution.

Figure 3. Zero- Crossing Rate (ZCR) of (a) pathological and (b) healthy audio signals.

Figure 4. RMSE of the (a) pathological and (b) healthy audio signal.

Figure 5. MFCC feature extraction.

Figure 6. MFCC of the (a) pathological and (b) healthy audio signal.

Figure 7. Multi-model architecture.

Figure 8. Experimental design.

Figure 9. The (a) loss and (b) accuracy plots of the binary classification model.

Figure 10. The confusion matrix of the proposed multi-model on the test dataset.

Figure 11. The (a) loss and (b) accuracy plots of the multiclass model.

Figure 12. The confusion matrix of the proposed multi-model on the test dataset for multiclass classification.

Figure 13. The confusion matrix of the proposed two-level architecture on the test dataset for binary classification.

Table 1. Performance evaluation of the proposed models (binary classification).

Model	Accuracy	F1 Score	Recall	Precision
Proposed multi-model architecture	87%	84.99%	84.24%	85.96%
Proposed two-level architecture	88.83%	87.39%	87.91%	86.95%

Table 2. Performance evaluation of the proposed models (multiclass classification).

Accuracy	F1 Score	Recall	Precision
80.70%	72.42%	72.75%	73.65%

Table 3. Performance evaluation of the gender classification model.

Accuracy	F1 Score	Recall	Precision
97.86%	97.82%	97.90%	97.75%

Table 4. Class-wise performance of the gender detection model.

Gender Class	F1 Score	Recall	Precision
Female	98%	98%	99%
Male	98%	98%	97%

Table 5. Comparative performance analysis.

Previous Works	Model	Accuracy
[26]	CNN–RNN	68.08%
[30]	CNN	87.11%
[30]	RNN	86.52%
[31]	Online Sequential Extreme Learning Machine (OSELM)	77.21%
[38]	XGBoost	73.3%
Proposed System	Two-level architecture (CNN–RNN)	88.83%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ksibi, A.; Hakami, N.A.; Alturki, N.; Asiri, M.M.; Zakariah, M.; Ayadi, M. Voice Pathology Detection Using a Two-Level Classifier Based on Combined CNN–RNN Architecture. Sustainability 2023, 15, 3204. https://doi.org/10.3390/su15043204

AMA Style

Ksibi A, Hakami NA, Alturki N, Asiri MM, Zakariah M, Ayadi M. Voice Pathology Detection Using a Two-Level Classifier Based on Combined CNN–RNN Architecture. Sustainability. 2023; 15(4):3204. https://doi.org/10.3390/su15043204

Chicago/Turabian Style

Ksibi, Amel, Nada Ali Hakami, Nazik Alturki, Mashael M. Asiri, Mohammed Zakariah, and Manel Ayadi. 2023. "Voice Pathology Detection Using a Two-Level Classifier Based on Combined CNN–RNN Architecture" Sustainability 15, no. 4: 3204. https://doi.org/10.3390/su15043204

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Voice Pathology Detection Using a Two-Level Classifier Based on Combined CNN–RNN Architecture

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Dataset Preparation

3.1.1. Dataset

3.1.2. Data Preprocessing

3.2. Feature Extraction

3.2.1. Zero-Crossing Rate (ZCR)

3.2.2. Root-Mean-Square Energy (RMSE)

3.2.3. Mel-Frequency Cepstral Coefficients (MFCCs)

3.3. Model Development

3.3.1. Model Architecture

3.3.2. Experimental Design

3.3.3. Training Pipeline

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI