Multi-Scale Recurrence Quantification Measurements for Voice Disorder Detection

Zhu, Xin-Cheng; Zhao, Deng-Huang; Zhang, Yi-Hua; Zhang, Xiao-Jun; Tao, Zhi

doi:10.3390/app12189196

Open AccessArticle

Multi-Scale Recurrence Quantification Measurements for Voice Disorder Detection

by

Xin-Cheng Zhu

,

Deng-Huang Zhao

,

Yi-Hua Zhang

,

Xiao-Jun Zhang

^*

and

Zhi Tao

^*

School of Optoelectronic Science and Engineering, Soochow University, Suzhou 215006, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2022, 12(18), 9196; https://doi.org/10.3390/app12189196

Submission received: 7 August 2022 / Revised: 11 September 2022 / Accepted: 12 September 2022 / Published: 14 September 2022

(This article belongs to the Section Acoustics and Vibrations)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Due to the complexity and non-stationarity of the voice generation system, the nonlinearity of speech signals cannot be accurately quantified. Recently, the recurrence quantification analysis method has been used for voice disorder detection. In this paper, multiscale recurrence quantification measures (MRQMs) are proposed. The signals are reconstructed in the high-dimensional phase space at the equivalent rectangular bandwidth scale. Recurrence plots (RPs) combining the characteristics of human auditory perception are drawn with an appropriate recurrence threshold. Based on the above, the nonlinear dynamic recurrence features of the speech signal are quantized from the recurrence plot of each frequency channel. Furthermore, this paper explores the recurrence quantification thresholds that are most suitable for pathological voices. Our results show that the proposed MRQMs with support vector machine (SVM), random forest (RF), Bayesian network (BN) and Local Weighted Learning (LWL) achieve an average accuracy of 99.45%, outperforming traditional features and other complex measurements. In addition, MRQMs also have the potential for multi-classification of voice disorder, achieving an accuracy of 89.05%. This study demonstrates that MRQMs can characterize the recurrence characteristic of pathological voices and effectively detect voice disorders.

Keywords:

voice disorder; recurrence quantification analysis; speech signal

1. Introduction

With the development of society, more and more factors lead to a higher and higher possibility of people suffering from voice diseases. Voice diseases directly affect people’s phonation function and psychological health [1], especially for some professionals who need to communicate frequently. The timely detection and early prevention of voice diseases can provide practical help in treating patients. Compared with traditional medical diagnosis methods, such as laryngeal electromyography, the automatic voice pathology detection system (AVPD) has the advantages of non-invasiveness, objectivity and portability. It is convenient for doctors to confirm the effectiveness of the treatment plans and will make it easier for patients to self-diagnose. AVPD usually consists of two procedures: the first is to extract the parameters that characterize the acoustic characteristics of the voice signal, and the second is to detect whether the voice is pathological or healthy through machine learning algorithms [2]. The focus of the current work is the first procedure.

The acoustic features in speech signals can be roughly grouped into three categories: perturbation features, cepstral features and complexity measures [3,4]. Perturbation features describe the aspiration noise generated by the irregular vibration of the vocal folds, which is caused by voice diseases, such as jitter and shimmer [5,6]. Jitter represents the short-term perturbations of the fundamental frequency (F0), and shimmer represents the short-term perturbations in amplitude [7]. Various statistics of jitter and shimmer, such as relative jitter, relative jitter average perturbation, absolute shimmer and relative shimmer, have been utilized in AVPD [8]. However, the calculation of perturbation features depends on selecting the appropriate window length and accurate estimation of F0, which is difficult in pathological voice. It should be pointed out that the characteristics of the vocal tract system can be effectively captured by the property of cepstral feature [3], such as mel-frequency cepstral coefficients (MFCC) [9]. MFCC is the coefficient of the linear transformation of the logarithmic energy spectrum based on the nonlinear mel scale of frequency.

In the past, perturbation features and cepstral features are widely used in pathological voice detection. Both of them are based on the linear acoustic theory. Speech production with planar sound propagation is assumed in the linear source/filter model. With the development of nonlinear dynamics, Vazir et al. have pointed out that there are extensive nonlinear phenomena in the process of speech production. Thompson et al. used aerodynamic theory to study voice signals and found that voice production was not a deterministic linear process, nor a random process, but a non-linear process [10]. The research of Thyssen et al. showed that the turbulence formed in the vocal tract during the speech production process is the reason for the chaotic characteristics of speech signals [11]. The physical model of the throat [12] and the fluid dynamics model [13] also proved that a sound description by nonlinear fluid dynamics was more realistic. Different from perturbation features, spectral and cepstral features, complexity measures tend to represent the speech signal’s aperiodic, non-stationary and nonlinear characteristics [14]. Voice diseases directly affect the vibration of the vocal folds, and non-linear dynamic analysis can effectively capture this change. Complexity measures [15,16] have been used in previous works, including the largest Lyapunov exponent (LLE), Hurst exponent (HE), sample entropy (SE), correlation dimension (CD), Shannon entropy, etc. These popular features tend to describe the signal’s dynamics, periodicity and regularity [17,18].

However, the non-stationary nature of speech causes significant fluctuations in the nonlinear representation obtained by calculation [19]. Short-term analysis is used in traditional speech signal processing methods to avoid the nonlinear and non-stationary problems of speech signals. Eckmann [20] et al. proposed recurrence plots for nonlinear data analysis, representing the global state correlation of signals on the full-time scale. The most essential feature of a particular dynamic system is the recurrence phenomenon. Recurrence phenomenon refers to the fact that some states of the system have similar characteristics at a specific time. Recurrence characteristics exist in both nonlinear and chaotic systems. It can study high-dimensional phase space trajectory periodicity through a two-dimensional representation. Therefore, from the perspective of RPs, the system’s evolution over time can be more deeply understood. Recurrence quantification analysis (RQA) technique developed by Zbilut and Webber [21] is the most commonly used tool in non-stationary time series analysis. It can obtain quantitative information on nonlinear dynamical systems by analyzing the distribution of structures, such as diagonal lines and vertical lines in RPs. Compared with traditional nonlinear analysis methods, RQA has apparent advantages in sensitivity to changing dynamic characteristics, including non-stationarity. In recent years, RQA has been widely used in various fields, such as life sciences, earth sciences, finance and physics [22,23,24,25,26]. In the field of pathological voice detection that we are interested in, Vieira and Costa et al. directly extracted recurrence quantification measurements (RQMs) from the original voice signal to identify normal and distinct pathological voices, with an average recognition of over 90% [27]. Lopes et al. [28] introduced the embedding dimension and delayed time into the RQMs and analyzed the accuracy of the combination of different RQMs in distinguishing individuals with and without voice disorders. The results show that RQA has substantial discriminative potential in pathological voice detection.

In addition, some studies have pointed out the effectiveness of multi-scale features in pathological voice detection [29]. The perceptual characteristics of the auditory system prompt researchers to explore auditory features for pathological voice detection. Typically, the extracted auditory feature is MFCC, which meets the principles of human auditory perception in the mel scale. According to pathological voice energy distribution characteristics, Zhang [30] used the frequency division method to improve the accuracy rate in the bark scale. Zhou [31] extracted multi-scale nonlinear features GTSLs in the equivalent rectangular bandwidth scale. These multi-scale features provide better resolution on frequency scales.

In this paper, we propose Multi-scale Recurrence Quantification Measurements (MRQMs). The proposed feature can decompose non-stationary, nonlinear complex sequences into a set of frequency subsequence features by multi-scale auditory analysis. MRQMs can effectively resolve the significant fluctuations in nonlinear representations caused by the non-stationarity of speech. Furthermore, it can use the acoustic characteristics of the auditory system to improve the accuracy of its pathological voice detection.

2. Extraction of Multi-Scale Recurrence Quantification Measurements (MRQMs)

In this work, the glottal signal as the source signal was studied to achieve the purpose of intelligent detection and classification of vocal fold diseases [27].

Figure 1 illustrates the framework of pathological voice detection and classification system. Most pathological voice detection and classification methods use voice signals. The glottal signal is directly produced by the vibration of vocal folds caused by the airflow from lungs. Glottal waveforms can directly reflect the difference between normal vocal folds vibration and pathological vocal folds vibration [4]. Therefore, the glottal signal is closer to the vibration and sound mechanism of the vocal folds as the source signal. The human auditory perception experiment found that the human auditory system is a unique nonlinear system. Glottal signals are decomposed into 24 channels at the equivalent rectangular bandwidth (ERB) scale, which is a multi-scale analysis that satisfies the auditory characteristics. Signals in each frequency channel is embedded in the high-dimensional phase space to construct an RP according to the recurrence characteristics. RQMs are used to quantify the non-stationary recurrence characteristics of the signal in each channel. A total of 312-dimensional features consists of recurrence characteristics for pathological voice detection and classification.

2.1. Gammatone Filter Bank

The research on auditory perception found that analyzing and processing acoustic signals by the basement membrane in the human cochlea is equivalent to filtering and frequency decomposition. The gammatone filter bank is very close to the acoustic characteristics of the auditory, and only a few parameters can simulate the basement membrane’s filtering and frequency decomposition functions. The gammatone filter bank is a band-pass filter whose impulse response is composed of the Gamma function [32].

g_{i} (t) = \partial t^{(N - 1)} e^{- 2 π B t} \cos (2 π f_{i} + \emptyset) u (t)

(1)

where

\partial

is a constant,

N

is the filter order.

f_{i}

is the center frequency of the ith channel of the gammatone filter.

\emptyset

is the phase, usually set to 0 because the human auditory perception system is not sensitive to the phase.

u (t)

is the unit pulse function. B is determined by th ERB of the gamma pass filter [33].

B = 1.019 \cdot E R B (f_{i})

(2)

The relationship between the ERB and the center frequency of the ith channel of the gammatone filter is as follows [34]:

E R B (f_{i}) = 24.7 + 0.108 f_{i}

(3)

In the frequency range of 0–12.5 kHz, this work uses a 24-channel gammatone filter bank to decompose the glottal signal after iterative adaptive inverse filtering. The corresponding magnitude response is shown in Figure 2.

It can be seen from Figure 2 that, in the low-frequency part, the bandwidth of the gammatone filter bank is relatively narrow, and the frequency intervals of the filter banks are very close. In the high-frequency part, the bandwidth of the gammatone filter bank is relatively large, and the frequency intervals of the filter banks are relatively far apart. Most of the effective information of the pathological voice exists in low-frequency bands, so multi-scale features extracted by gammatone processing signals can effectively detect pathological voice. The specific fixed values of the gammatone filter band used in this work are listed in Table 1. In particular, in the first eight frequency bands, the bandwidth is less than 100 Hz, and the processing of low-frequency signals is very meticulous. The center frequency of the eighth frequency band is 549 Hz, which is similar to the main frequency upper limit of the glottal signal, so the combination of glottal signals and gammatone can produce good results.

2.2. Recurrence Quantification Analysis (RQA)

Nonlinear dynamics theory has become an effective tool for studying the characteristics of speech production mechanisms. Recurrence quantification analysis is a nonlinear method used to quantitatively study the recurrence characteristics of the state in the phase space of a dynamic system [35]. The recurrence quantification analysis method has an excellent ability to deal with non-stationary, nonlinear and relatively short data series. Compared with traditional signal processing methods, it can describe the sensitivity of some changing dynamic characteristics in detail. Generally speaking, pathological voice detection uses vowels to record voice signals. RQA can analyze the complex system behind the chaotic sound based on the vibration of the vocal folds caused by the continuous vowel. The first step in analyzing a signal using nonlinear dynamics theory is to reconstruct the multi-dimensional phase space trajectory of the signal. Assuming that a time series

\{x (1), x (2), \dots, x (N)\}

of length N is obtained from the microphone, the phase space can be reconstructed by Takens’ embedding theorem [36].

X = [\begin{matrix} X_{1} \\ X_{2} \\ ⋮ \\ X_{n} \end{matrix}] = [\begin{matrix} x_{1} & x_{1 + τ} & \dots & x_{1 + (m - 1) τ} \\ x_{2} & x_{2 + τ} & \dots & x_{2 + (m - 1) τ} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{n} & x_{n + τ} & \dots & x_{n + (m - 1) τ} \end{matrix}]

(4)

where τ is the time delay and m is the best embedding dimension. The total number of points, represented by vectors

{X_{1}, X_{2}, X_{3} \dots, X_{n}}

, in the reconstructed phase space is n = N − (m − 1) τ.

As long as m and τ are selected appropriately, the dynamic characteristics of the original system can be restored in the sense of the topological equivalence. Determining the time delay, τ, and the best embedding dimension, m, better represents the speech signal’s time sequence in the reconstructed phase space. In this article, the traditional mutual information method is used to obtain the time delay, τ, of the reconstruction of the chaotic time series [37], and the false nearest neighbors (FNN) algorithm is used to find the best embedding dimension [38]. The non-stationary nature of the high-dimensional phase space studied by the RPs is used to describe the fundamental dynamics of the system.

2.2.1. Recurrence Plots (RPs)

Recurrence plot is a tool for analyzing signal recurrence phenomenon in two-dimensional space graphs. The recurrence characteristics of the time series can be represented by black and blue dots and two-time axes. RP is defined in classical mathematics as:

\begin{matrix} R_{i j} = θ (ε - ||X_{i} - X_{j}||) \\ i, j = 1, 2 \dots n \end{matrix}

(5)

where

θ (x)

is a Heaviside function.

ε

is a threshold parameter, an empirical value.

||\cdot||

is an Euclidean norm. Recurrence plots determine a field with

X_{i}

as the center and

ε

as the radius. Through formula a, an n*n distance matrix is converted into a 0–1 matrix. If

X_{j}

is in this field, under the action of the Heaviside function,

R_{i j} = 1

. At this time, the one-dimensional time series is considered to have recurrence properties. If

X_{j}

is not in this field,

R_{i j} = 0

. Blue dots represent the recurrence characteristics of the dynamic system, and a blue dot is scored when the dynamic system is non-recurrent.

The recurrence threshold is an essential parameter of RPs. Different systems will correspond to different types of signals. With the transformation of signal properties, the choice of recurrence threshold should also be different due to non-stationarity, vibration characteristics or noise influence.

At present, the essential methods for the selection of recurrence threshold are as follows:

Phase space diameter—The threshold is set as a percentage of the phase space diameter. Usually, the value is 10% phase space diameter [39].
The standard deviation of noise—Take the standard deviation of the observation noise as the benchmark, choose a multiple of the standard deviation, and generally take 5σ as the threshold [40].
Recurrence rate (rr)—Measure rr value based on the percentage of the recurrence rate, 1%rr is usually used as the recurrence threshold [41].

The above commonly used threshold selection methods require experiments on the signal. For pathological voice signals, it is not appropriate to use the 5σ of the observed signal noise as the threshold because the speech signals were recorded in the acoustic laboratory, which is a quiet place. We have done related research, and the results of this threshold are not ideal. The abnormal waveform in the time series contains different characteristics of the pathological voice. The recurrence threshold is an empirical parameter. Too large or too small a recurrence threshold will destroy the structure of the recurrence plot. Too large a selection of recurrence threshold results in too many adjacency error points, which makes the black points in the recurrence plot too dense. When the recurrence threshold is too small, many normal adjacency points are not described, so the blue area of the recurrence plot becomes larger. As shown in [27], Vieira considers the most appropriate recurrence threshold is 1%rr. Building on his work, we continue to explore the best suited recurrence threshold for pathological voices.

Figure 3 shows the phase spaces and RPs of normal and pathological glottal signals. The m and τ are obtained using the FNN and mutual information method. It can be seen that Figure 3, which uses a typical 1%rr as the recurrence threshold, shows a diagonal recurrence structure. The phase space of the normal glottal signal is regular, but the phase space of the pathological glottal signal is chaotic. The regularity of healthy vowels is approximately a quasi-periodic process. A regular long diagonal structure is formed in RP, with parallel diagonal lines, no vertical and horizontal lines and no scattered, isolated points. Non-periodic vibrations of the vocal folds, incomplete glottis closure and interstitial vibrations can all lead to pathological voices. Obviously, the loop points on the long diagonal lines of the pathological glottal signal are discontinuous, which represents the degree of signal amplitude fluctuation. Vocal fold lesions cause the speaker to work harder to produce vowels. The vibration period of the vocal folds determines the pitch frequency of the vocal folds. Therefore, the diagonals on the RP cannot be kept parallel, and there are many scattered and isolated points. A specific analysis can refer to Table 2.

2.2.2. Recurrence Quantification Measurements

The recurrence properties of time series depend on the geometric properties of the RP. RQA is a method of quantifying the dynamics of the system based on the RP [42]. A series of statistical parameters can be obtained based on the analysis of the density of recurrence points, diagonal lines, vertical lines or horizontal lines. This work uses 13 measurement RQMs, such as mean diagonal line length, maximal diagonal line length, clustering coefficient and transitivity.

Recurrence rate (RR) indicates the ratio of the number of recurrence points that appear in the RP, compared to the total number of

n \times n

matrix types.

R R = \frac{1}{n^{2}} \sum_{i, j = 0}^{n} R_{i j}

(6)

Determinism (DET) indicates the ratio of the recurrence points forming the diagonal line segment appearing in the RP to the total recurrence points. It owns the function of distinguishing individual recurrence points that diverge in disorder from other recurrence points that regularly form specific patterns.

D E T = \frac{\sum_{i = l_{\min}}^{n} l P^{ε} (l)}{\sum_{i, j}^{n} R_{i j}}

(7)

where l is the length of the diagonal line segment, and

l_{\min}

is its minimum value. The frequency distribution of the lengths, l, of the diagonal lines is represented by

P^{ε} (l)

.

P^{ε} (l) = \{l_{i}; i = 1 \dots n_{l}\}

and

n_{l}

are the absolute number of diagonal lines.

Maximal diagonal line length (

L_{\max}

) refers to the length of the longest diagonal line in the RP structure.

L_{\max} = \max (\{l_{i}; i = 1 \dots n_{l}\})

(8)

Entropy of the diagonal line lengths (ENTR) refers to the Shannon entropy of the diagonal structure length distribution on the RP, which measures the amount of information contained in the RP structure.

E N T R = - \sum_{l = l_{\min}}^{n} P (l) \ln P (l)

(9)

and

P (l)

can be defined as

P (l) = \frac{P^{ε} (l)}{\sum_{l = l_{\min}}^{n} P^{ε} (l)}

.

Mean diagonal line length (<L>) is highly correlated with the mean prediction time of the dynamic system and the divergence of the system.

< L > = \frac{\sum_{i = l_{\min}}^{n} l P (l)}{\sum_{i, j}^{n} P l}

(10)

Laminarity (LAM) refers to the ratio of recurrence points forming a vertical structure to all recurrence points in the RP, which can reflect the complexity of the dynamic system.

L A M = \frac{\sum_{v = v_{\min}}^{n} v P^{ε} (v)}{\sum_{v = l}^{n} v P^{ε} (v)}

(11)

where v is the length of the vertical line segment and

P^{ε} (v) = \{l_{i}; i = 1 \dots n v\}

.

Trapping time (TT) indicates the mean length of vertical lines in the RP structure. It measures the average time that the system is in a very slowly changing state.

T T = \frac{\sum_{v = v_{\min}}^{n} v P^{ε} (v)}{\sum_{v = l}^{n} P^{ε} (v)}

(12)

Maximal vertical line length (

V_{\max}

) indicates the maximal length of vertical lines in the RP structure.

V_{\max} = \max (\{v_{i}; i = 1 \dots n_{v}\})

(13)

Recurrence time of 1st type (

T_{1}

) and Recurrence time of 2nd type (

T_{2}

):

\begin{matrix} T_{1} (i) = t_{i + 1} - t_{i}, t = 1, 2, K \\ T_{2} (i) = t_{i + 1} - t_{i}, t = 1, 2, K \end{matrix}

(14)

Recurrence time entropy (RPDE) has been successfully applied in biomedical testing. It has advantages in detecting subtle changes in biological time series and can indicate the degree to which the time series repeat the same sequence.

R P D E = - {(\ln T_{\max})}^{- 1} \sum_{t = 1}^{T_{\max}} P (t) \ln P (t)

(15)

Each point of the time series

\{x (1), x (2), \dots, x (N)\}

is plotted as a histogram during the period of the threshold return. P(t) is the normalized result of the histogram. Where

T_{\max}

is the maximum recurrence period and t is the time between returns.

Clustering coefficient (Clust) represents the probability that two neighbors of any state in the RP structure are also gathered together in the complex network theory.

C l u s t = \sum_{i = 1}^{n} \frac{\sum_{j, k = 1}^{n} R_{i j}^{} R_{j k}^{} R_{k i}^{}}{R R_{i}}

(16)

R R_{i} = \sum_{j = 1}^{n} R_{i j}^{}

(17)

where

R R_{i}

indicates the local recurrence rate.

Transitivity (Trans) quantifies the geometric properties of the phase space trajectory and has an effective role in different dynamic systems.

T r a n s = \frac{\sum_{i, j, k = 1}^{N} R_{i j}^{} R_{j k}^{} R_{k i}^{}}{\sum_{i, j, k = 1}^{N} R_{i j}^{} R_{k i}^{}}

(18)

With the further development of nonlinear analysis, researchers realized that the nonlinearity of speech limits the performance of speech signal processing, for example, it is difficult to improve the accuracy of the linear prediction of speech. The speech utterance system is a nonlinear time-varying system. In this work, RQMs can quantify the recurrence characteristics of glottal signals.

3. Material and Evaluation

Pathological voice contains a wealth of information, including not only the original voice information to be conveyed but also the speaker’s age, gender and voiceprint. In order to evaluate the method proposed in this work, we utilized the Massachusetts Eye and Ear Infirmary (MEEI) database [43], which contains the normal and pathological voice /a:/ with the patient’s expert diagnosis results, gender and smoking status. The vowel /a:/ sound will be more affected by acoustic characteristics.

This database selected 53 normal voice samples and 173 pathological voice samples with various disorders as a data subset. This subset was selected by considering different voice disorders, gender and age mean and standard deviation of normal and pathological voice databases [44]. The statistics of the selected voice samples in the MEEI database in this work is shown in Table 3. All voice disorders are listed up in a Table 4. Among them, 53 normal voice samples, 20 cases of vocal fold polyp samples and 67 cases of vocal fold paralysis samples were selected as multi-class samples. All voice signals were sampled with 16-bit resolution. The sampling rate of healthy voice signals was 50 kHz, and that of pathological voice signals was 25 kHz or 50 kHz. Before all the processing, we had down-sampled all the voice signals to 25 kHz.

This work presented multi-scale recurrence quantification measurements for voice disorder detection and classification. We use Matlab 2020b software to extract this feature. Thereinto, the cross recurrence plot toolbox, was used to build recurrence plots and extract recurrence quantification measurements [45]. All machine learning algorithms and testing are done using Waikato Environment for Knowledge Analysis software. Four traditional machine learning algorithms were utilized for pathological voice detection experiments. After many times of parameter adjustment and optimization, the final model parameters were determined as follows: (1) The support vector machine (SVM) utilized a PolyKernel kernel function. The penalty factor was one. (2) The random forest (RF) utilized two hundred trees, and the maximum depth was four. (3) The Bayesian network (BN) utilized the tree augmented naive Bayes algorithm. (4) The Local Weighted Learning (LWL) utilized random forest classifier and FilteredNeighbourSearch. In addition, the four machine learning algorithms were utilized for pathological voice classification experiments. (1) The support vector machine utilized a PUK kernel function, and the penalty factor was two. (2) The random forest utilized fifty trees, and the maximum depth was twenty. (3) The Bayesian network utilized the K2 search algorithm. (4) The Local Weighted Learning utilized random forest classifier and FilteredNeighbourSearch. Due to the problem of unbalanced pathological samples, the FC-SMOTE algorithm was adopted for sample balancing for multi-classification experiments [46]. A 10-fold cross-validation was used to test the accuracy of the algorithm. All voice samples were divided into ten equal parts. In each validation, nine folds were selected as training sets and one-fold as test sets. The process was repeated ten times, so all ten pieces of data were all tested as test sets.

4. Results and Discussion

Although some recurrence states of pathological voice can be obtained by recurrence quantification analysis, the dynamic recurrence characteristics of voice signals cannot be described comprehensively. From the recurrence plot’s definition, the recurrence threshold selection plays a vital role in extracting MRQMs. When a larger or smaller recurrence threshold is selected, the state in RP will change, which is not conducive to the practical analysis of the system, thus affecting the recognition results. This paper sought a more appropriate threshold to make distinguishing the diagonal and vertical recurrence structures of pathological voices easier. Table 5 adopted four machine learning algorithms in different thresholds to detect pathological voice: SVM, RF, BN and LWL.

It could be found from the table that the accuracy of pathological voice detection under the four machine learning classifiers had different results under different recurrence thresholds. However, the overall accuracy was considerable. It showed extensive and effective thresholds of multi-scale recurrence quantification measurements in pathological voice detection. The average accuracy showed an upward trend from 1%rr to 40%rr and a downward trend from 40%rr to 99%rr. In the case of 40%rr, the average accuracy of pathological voice detection under the four machine learning classifiers reached 99.45%, which reached the optimal result in different classifiers. Therefore, this article selected 40%rr as the recurrence threshold.

In order to further explore in detail that the recurrence threshold at 40%rr is more suitable for pathological voice detection than the other recurrence threshold, the first three multi-scale recurrence quantification measurements ranked by the FDR algorithm were selected to draw three-dimensional scatter plots. It was obvious to find that the first three multi-scale recurrence quantification measurements ranked by the FDR algorithm of the two types of samples can be highly differentiated from Figure 4. The blue circles represented samples of normal voices, and the red crosses represented samples of pathological voice. When the recurrence threshold was 40%rr, the samples had a slight overlap. However, when the recurrence threshold was 1%rr, 80%rr or 99%rr, the degree of overlap was significantly more than 40%rr. It proved that the method proposed in this article could effectively separate the normal samples from the pathological samples, and the dichotomous classification performance was superior.

The RP structure of multi-scale recurrence quantification measurements also illustrated the difference when the threshold is 40%rr. Figure 5 shows the RP of the normal and pathological glottal signals in the tenth frequency band. Among them, the diagonal recursion of the normal glottal source signal was obvious. The diagonals were parallel, without vertical and horizontal lines and no scattered points, indicating that the signal had periodicity. Pathological glottal source signals were mainly parallel to the diagonal, but there were some isolated dispersion points. Massive recurrence points on the diagonal indicate that the pathological voice still had a certain periodicity, but its vibration was abnormal. In addition, some isolated points on the RP of the pathological glottal signal indicated that the pathological voice signal was not stable, and there were anti-correlation processes and weighty fluctuations.

Compared with the method proposed in this article, the features from the other works in recent years were applied for pathological voice detection. They all used 10-fold cross-validation. Moreover, the experiment in the MEEI database adopted the same classifier: BN. The results are listed in Table 6. All the evaluation indicators in the table are closer to 1, which indicates the better effect and robustness of the feature. Refs. [27,28] both showed an accuracy of nearly 90%, indicating that RQA has a strong potential in pathological voice detection. Nevertheless, they did not make any improvements to RQA. They just used and explored combined RQMs. The method proposed in this article analyzed the signal itself and selected the best recurrence threshold to draw a more accurate RP structure, using 13 RQMs to quantify recurrence features. The accuracy also showed that the method in this article was better than the method of directly extracting RQMs for pathological voice detection. The accuracy was improved by 8 % and 12.45%, respectively, for RQMs in [27] and RQMs in [28].

MFCC are typical characteristic parameters of acoustic tasks. They achieved 92% recognition results, respectively, from raw voice signals. Traditional cepstral characteristic parameters can characterize the differential changes of the vocal tract system. They cannot describe the changes in the vocal fold’s vibration mechanism directly caused by voice diseases, so they cannot characterize all pathological voice information. The accuracy of the nonlinear feature combination was only 75.56%, which is far inferior to the traditional MFCC. The short-term non-stationary characteristics of speech signals limited the application of traditional nonlinear time series analysis methods. The method proposed in this paper can quantitatively analyze the dynamic nonlinear behavior of different aspects of the speech dynamic system by using RQMs. MRQMs combined with machine learning achieved the best results in terms of accuracy rate, kappa value, precision, recall rate, F-Measure, ROC area and so on. MRQMs analyzed the glottal signal from the perspective of vocal fold vibration, filtering out lip radiation’s influence on pathological voice detection. They quantified the recurrence characteristics of different frequency bands under the ERB scale. MRQMs embodied excellent detection performance and application prospects.

In this paper, multi-classification experiments used normal voice samples, vocal folds paralysis samples and vocal folds polyp samples. From the confusion matrix in Figure 6, the information of the multi-classification experiment can be further viewed. The true positive accuracy of normal voice and vocal folds paralysis was more than 90% in the multi-classification experiment. Vocal fold polyps were confused with other normal voices and vocal folds paralysis categories in the multi-classification experiment. Other indicators of multi-classification experiment results are also listed in the Table 7.

Compared with the original voice signal, the glottal signal removes the effects of lip radiation and so on, so that it can better reflect the characteristics of vocal folds vibration. MRQMS used the glottal signal as the raw signal and used the human auditory perception characteristics to extract multi-scale recurrence features. From the point of view of nonlinear dynamic recurrence, although the pathological voice was a relatively deterministic signal, there was a break in the diagonal of the pathological voice signal. The relatively complete diagonal recursion structure of the RP of the pathological voice in each channel indicates that the vocal folds of the patient had lesions, but it still maintained approximately periodic vibration. According to the density of the diagonal, it can be known that the vibration cycle of the vocal folds disease in this case is larger than that of the normal voice vibration cycle, which may be caused by the slow vibration of the vocal folds disease. In addition, the diagonals of the RP of pathological voice were burr, indicating that vertical or horizontal recursion, different from diagonal recursion, existed near the diagonal. Compared with normal vocal folds vibration, rapid collision and separation occurred in patients with pathological voice. Therefore, pathological voice had more dense diagonal recurrence structure than normal voice, which reflected the phenomenon that its vibration period was shorter than normal voice. Periodic tensioning of vocal folds determines the fundamental frequency of voice signals. For different vocal folds diseases, the air pressure and muscle tension at the glottal diaphragm and the instability of the vocal folds state results in different voice signal waveforms. However, different vocal folds diseases also had similar recurrence characteristics directly, so the classification and detection of vocal folds diseases had not achieved ideal results.

5. Conclusions

In order to accurately quantify the nonlinear features in the speech signal to effectively detect and classify pathological voice, MRQMs were proposed. A recognition rate of 99.56% was achieved, and each evaluation index was the best, demonstrating the effectiveness of the proposed method. However, due to the high similarity of different vocal fold diseases, the method in this paper could not classify different pathological voice types well. However, it also achieved a three-class recognition rate of 89.05%. This article points out that the nonlinear analysis method can effectively detect pathological voices.

Cross-database training and multi-class recognition are still challenges for pathological voice detection. Considering the diversity of different pathological voice databases, pathological voice detection will be carried out in different databases in the future. In addition, we will continue to study MRQMs or other complex measures combined with the occurrence model to explore the phenomenon of turbulent noise in pathological voices to improve the multi-class recognition rate. With the development of the Internet of Things and artificial intelligence, Wise Information Technology of 120 is gradually entering the link of medical treatment. This work can provide ideas for the self-diagnosis of patients.

Author Contributions

Conceptualization, Z.T.; Validation, X.-J.Z.; Visualization, Y.-H.Z.; Writing—original draft, X.-C.Z.; Writing—review & editing, X.-C.Z. and D.-H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant No. 61271359.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The MEEI database is commercialized and not publicly available.

Conflicts of Interest

The authors declare no conflict of interest.

References

Cohen, S.M.; Kim, J.; Roy, N.; Asche, C.; Courey, M. Prevalence and causes of dysphonia in a large treatment-seeking population. Laryngoscope 2012, 122, 343–348. [Google Scholar] [CrossRef]
Hegde, S.; Shetty, S.; Rai, S.; Dodderi, T. A survey on machine learning approaches for automatic detection of voice disorders. J. Voice 2019, 33, 947.e11–947.e33. [Google Scholar] [CrossRef] [PubMed]
Kadiri, S.R.; Alku, P. Analysis and detection of pathological voice using glottal source features. IEEE J. Sel. Top. Signal Process. 2019, 14, 367–379. [Google Scholar] [CrossRef]
Wu, Y.; Zhou, C.; Fan, Z.; Wu, D.; Zhang, X.; Tao, Z. Investigation and Evaluation of Glottal Flow Waveform for Voice Pathology Detection. IEEE Access 2020, 9, 30–44. [Google Scholar] [CrossRef]
Meghraoui, D.; Boudraa, B.; Merazi, T.; Vilda, P.G. A novel pre-processing technique in pathologic voice detection: Application to parkinsons disease phonation. Biomed. Signal Process. Control 2021, 68, 102604. [Google Scholar] [CrossRef]
Little, M.; McSharry, P.; Hunter, E.; Spielman, J.; Ramig, L. Suitability of Dysphonia Measurements for Telemonitoring of Parkinson’s Disease. IEEE Trans. Bio-Med. Eng. 2009, 56, 1015. [Google Scholar] [CrossRef]
Gomez-Garcia, J.A.; Moro-Velazquez, L.; Godino-Llorente, J.I. On the design of automatic voice condition analysis systems. part ii: Review of speaker recognition techniques and study on the effects of different variability factors. Biomed. Signal Process. Control 2019, 48, 128–143. [Google Scholar] [CrossRef]
JTeixeira, P.; Fernandes, P.O.; Alves, N. Vocal acoustic analysis classification of dysphonic voices with artificial neural networks. Procedia Comput. Sci. 2017, 121, 19–26. [Google Scholar] [CrossRef]
Ding, H.; Gu, Z.; Dai, P.; Zhou, Z.; Wang, L.; Wu, X. Deep connected attention (dca) resnet for robust voice pathology detection and classification. Biomed. Signal Process. Control 2021, 70, 102973. [Google Scholar] [CrossRef]
Thompson, C.; Mulpur, A.; Mehta, V.; Chandra, K. Transition to chaos in acoustically driven flows. J. Acoust. Soc. Am. 1991, 90, 2097–2108. [Google Scholar] [CrossRef]
Thyssen, J.; Nielsen, H.; Hansen, S.D. Non-linear short-term prediction in speech coding. In Proceedings of the ICASSP’94 IEEE International Conference on Acoustics, Speech and Signal Processing, Adelaide, SA, USA, 19–22 April 1994; Volume 1, pp. 1–185. [Google Scholar]
Erath, B.D.; Plesniak, M.W. Three-dimensional laryngeal flowfields induced by a model vocal fold polyp. Int. J. Heat Flow 2012, 35, 93–101. [Google Scholar] [CrossRef]
Sarvestani, A.B.; Rad, E.G.; Iravani, K. Numerical analysis and comparison offlowfields in normal larynx and larynx with unilateral vocal fold paralysis. Comput. Methods Biomech. Biomed. Eng. 2018, 21, 532–540. [Google Scholar] [CrossRef] [PubMed]
Jiang, J.J.; Zhang, Y.; McGilligan, C. Chaos in voice, from modeling to measurement. J. Voice 2006, 20, 2–17. [Google Scholar] [CrossRef]
Jiang, J.J.; Zhang, Y.; Ford, C.N. Nonlinear dynamics of phonations in excised larynx experiments. J. Acoust. Soc. Am. 2003, 114, 2198–2205. [Google Scholar] [CrossRef]
Arias-Londono, J.D.; Godino-Llorente, J.I. Entropies from markov models as complexity measures of embedded attractors. Entropy 2015, 17, 3595–3620. [Google Scholar] [CrossRef]
Little, M.A.; Costello, D.; Harries, M.L. Objective dysphonia quantification in vocal fold paralysis: Comparing nonlinear with classical measures. J. Voice 2011, 25, 21–31. [Google Scholar] [CrossRef] [PubMed]
Tsanas, A.; Little, M.; McSharry, P.; Ramig, L. Accurate telemonitoring of parkinsons disease progression by non-invasive speech tests. Nat. Preced. 2009, 1. [Google Scholar] [CrossRef]
Vaziri, G.; Almasganj, F.; Behroozmand, R. Pathological assessment of patients speech signals using nonlinear dynamical analysis. Comput. Biol. Med. 2010, 40, 54–63. [Google Scholar] [CrossRef]
Eckmann, P.J.; Kamphorst, O.S.; Ruelle, D. Recurrence plots of dynamical systems. Europhys. Lett. 1987, 4, 973–977. [Google Scholar] [CrossRef]
Zbilut, J.P.; Webber, C.L., Jr. Embeddings and delays as derived from quantification of recurrence plots. Phys. Lett. A 1992, 171, 199–203. [Google Scholar] [CrossRef]
Marwan, N.; Romano, M.C.; Thiel, M.; Kurths, J. Recurrence plots for the analysis of complex systems. Phys. Rep. 2007, 438, 237–329. [Google Scholar] [CrossRef]
Acharya, U.R.; Sree, S.V.; Chattopadhyay, S.; Yu, W.; Ang, P.C.A. Application of recurrence quantification analysis for the automated identification of epileptic eeg signals. Int. J. Neural Syst. 2011, 21, 199–211. [Google Scholar] [CrossRef]
He, Q.; Huang, J. Multiwavelet scale multidimensional recurrence quantification analysis. Chaos Interdiscip. J. Nonlinear Sci. 2020, 30, 123109. [Google Scholar] [CrossRef] [PubMed]
Qian, Y.; Yan, R.; Hu, S. Bearing degradation evaluation using recurrence quantification analysis and kalmanfilter. IEEE Trans. Instrum. Meas. 2014, 63, 2599–2610. [Google Scholar] [CrossRef]
Marwan, N.; Wessel, N.; Meyerfeldt, U.; Schirdewan, A.; Kurths, J. Recurrence-plot-based measures of complexity and their application to heart-rate-variability data. Phys. Rev. E 2002, 66, 026702. [Google Scholar] [CrossRef]
Vieira, V.J.; Costa, S.C.; Correia, S.L.; Lopes, L.W.; Costa, W.C.D.A.; de Assis, F.M. Exploiting nonlinearity of the speech production system for voice disorder assessment by recurrence quantification analysis. Chaos Interdiscip. J. Nonlinear Sci. 2018, 28, 085709. [Google Scholar] [CrossRef] [PubMed]
Lopes, L.W.; Vieira, V.J.D.; Costa, S.L.d.N.C.; Correia, S.É.N.; Behlau, M. Effectiveness of recurrence quantification measures in discriminating subjects with and without voice disorders. J. Voice 2020, 34, 208–220. [Google Scholar] [CrossRef]
Al-Nasheri, A.; Muhammad, G.; Alsulaiman, M.; Ali, Z. Investigation of voice pathology detection and classification on different frequency regions using correlation functions. J. Voice 2017, 31, 3–15. [Google Scholar] [CrossRef]
Zhang, X.-J.; Zhu, X.-C.; Wu, D.; Xiao, Z.-Z.; Tao, Z.; Zhao, H.-M. Nonlinear features of bark wavelet sub-bandfiltering for pathological voice recognition. Eng. Lett. 2021, 29, 1–12. [Google Scholar]
Zhou, C.; Wu, Y.; Fan, Z.; Zhang, X.; Wu, D.; Tao, Z. Gammatone spectral latitude features extraction for pathological voice detection and classification. Appl. Acoust. 2022, 185, 108417. [Google Scholar] [CrossRef]
Hohmann, V. Frequency analysis and synthesis using a gammatone filter bank. Acta Acust. United Acust. 2002, 88, 433–442. [Google Scholar]
Smith, J.O.; Abel, J.S. Bark and erb bilinear transforms. IEEE Trans. Speech Audio Process. 1999, 7, 697–708. [Google Scholar] [CrossRef] [Green Version]
Patterson, R.D.; Holdsworth, J. A functional model of neural activity patterns and auditory images. Adv. Speech Hear. Lang. Process. 1996, 3 Pt B, 547–563. [Google Scholar]
Marwan, N.; Webber, C.L. Mathematical and computational foundations of recurrence quantifications. In Recurrence Quantification Analysis; Springer: Amsterdam, The Netherlands, 2015; pp. 3–43. [Google Scholar]
Takens, F. Detecting strange attractors in turbulence. In Dynamical Systems and Turbulence; Springer: Amsterdam, The Netherlands, 1981; pp. 366–381. [Google Scholar]
Fraser, A.M.; Swinney, H.L. Independent coordinates for strange attractors from mutual information. Phys. Rev. A 1986, 33, 1134. [Google Scholar] [CrossRef]
Kennel, M.B.; Brown, R.; Abarbanel, H.D. Determining embedding dimension for phase-space reconstruction using a geometrical construction. Phys. Rev. A 1992, 45, 3403. [Google Scholar] [CrossRef]
Mindlin, G.M.; Gilmore, R. Topological analysis and synthesis of chaotic time series. Phys. D Nonlinear Phenom. 1992, 58, 229–242. [Google Scholar] [CrossRef]
Thiel, M.; Romano, M.C.; Kurths, J.; Meucci, R.; Allaria, E.; Arecchi, F.T. Influence of observational noise on the recurrence quantification analysis. Phys. D Nonlinear Phenom. 2002, 171, 138–152. [Google Scholar] [CrossRef]
Webber, C.L., Jr.; Zbilut, J.P. Recurrence quantification analysis of nonlinear dynamical systems. Tutor. Contemp. Nonlinear Methods Behav. Sci. 2005, 94, 26–94. [Google Scholar]
Gao, J.; Cai, H. On the structures and quantification of recurrence plots. Phys. Lett. A 2000, 270, 75–87. [Google Scholar] [CrossRef]
Eye, M.; Infirmary, E. Voice Disorders Database; Version. 1.03 (cd-rom); Kay Elemetrics Corporation: Lincoln Park, NJ, USA, 1994. [Google Scholar]
Saenz-Lechon, N.; Godino-Llorente, J.I.; Osma-Ruiz, V.; Gomez-Vilda, P. Methodological issues in the development of automatic systems for voice pathology detection. Biomed. Signal Process. Control 2006, 1, 120–128. [Google Scholar] [CrossRef]
Marwan, N. Cross Recurrence Plot Toolbox for MATLAB, Version 5.24 (r34), Last Mod. Available online: https://tocsy.pik-potsdam.de/crp.php (accessed on 9 March 2022).
Fan, Z.; Wu, Y.; Zhou, C.; Zhang, X.; Tao, Z. Class-imbalanced voice pathology detection and classification using fuzzy cluster oversampling method. Appl. Sci. 2021, 11, 3450. [Google Scholar] [CrossRef]

Figure 1. The framework of MRQMs extraction for pathological voice detection and classification.

Figure 2. A 24-channel gammatone magnitude response.

Figure 3. Phase spaces and RPs of normal glottal signal and pathological glottal signal.

Figure 4. Three-dimensional scatter plots of MRQMs.

Figure 5. RPs of normal glottal signal and pathological glottal signal.

Figure 6. The confusion matrix for pathological voice multiclass classification with 10-fold cross validation.

Table 1. The center frequency

f_{i}

and bandwidth

B w_{i}

of the gammatone filter bank.

Table 1. The center frequency

f_{i}

and bandwidth

B w_{i}

of the gammatone filter bank.

Channel	$f_{i} (H z)$	$B w_{i} (H z)$	Channel	$f_{i} (H z)$	$B w_{i} (H z)$
1	0	25	13	1634	205
2	44	30	14	1990	244
3	96	36	15	2413	291
4	158	43	16	2917	346
5	232	51	17	3518	412
6	319	60	18	4233	491
7	424	72	19	5085	584
8	549	86	20	6096	696
9	697	102	21	7308	829
10	874	121	22	8746	987
11	1085	144	23	10,460	1176
12	1335	172	24	12,500	1400

Table 2. Interpretation of recurrent plot structures.

RP Structure	Interpretation
Diagonal Parallel Lines	Deterministic Process, Periodic aspect
Dispersion Isolated Points	Anti-correlation process, weighty fluctuations
Long diagonal lines	Periodic/quasi-periodic patterns
White bands	Non-stationary, brusque transitions
Vertical and horizontal line	Laminar states, unchangeable states or slowly changes
Long bowed line	Dynamic of the system could change

Table 3. Gender and age distribution of the recordings in the chosen subset of MEEI database.

	Subjects		Margin (Years)		Mean (Years)		Standard Deviation (Years)
	Male	Female	Male	Female	Male	Female	Male	Female
Normal	21	32	26–58	22–52	38.8	34.2	8.49	7.87
Pathological	70	103	26–58	21–51	41.7	37.6	9.38	8.19

Table 4. All voice disorders in detail.

Disorder	Vocal cord lesions	Sulcus vocalis	Vocal fold paralysis	Vocal Fatigue	Vocal Cords removed	Vocal cord oedema
Number	1	3	67	2	3	43
Disorder	Vocal palsy	vocal polyp	Vocal nodule	Increased thickness of vocal cords		Vocal tremor
Number	1	20	18	1		13

Table 5. The accuracy (%) of different thresholds

ε

(rr) under the four types of machine learning.

Table 5. The accuracy (%) of different thresholds

ε

(rr) under the four types of machine learning.

$ε$ (rr)\Classifiers	SVM	RF	BN	LWL	Average
1%	95.56	98.22	97.33	98.67	97.45
10%	96.44	98.22	99.11	97.33	97.78
20%	99.11	97.78	99.56	98.22	98.67
30%	98.22	99.11	99.11	98.67	98.78
40%	99.11	99.56	99.56	99.56	99.45
50%	99.11	99.11	99.11	99.11	99.11
60%	98.22	99.11	99.56	98.67	98.89
70%	98.22	99.56	99.56	99.56	99.23
80%	98.67	98.22	99.56	99.11	98.89
90%	96.00	99.11	99.11	98.67	98.22
99%	97.78	98.22	99.11	97.78	98.22

Table 6. Summary of selected studies conducted for detection of pathological voice in MEEI database with 10-fold cross-validation.

	Accuracy (%)	Kappa	Precision	Recall	F-Measure	ROC Area
RQMs [27]	91.56	0.773	0.919	0.916	0.917	0.953
RQMs [28]	87.11	0.658	0.878	0.971	0.874	0.921
MFCC	92.00	0.772	0.919	0.920	0.919	0.963
LLE&CD&HE&SE	75.56	0.334	0.760	0.756	0.758	0.786
MRQMs (proposed method)	99.56	0.988	0.996	0.996	0.996	0.995

Table 7. Multiclass classification of MRQMs.

Classifiers	Accuracy	Kappa	Precision	Recall	F-Measure	ROC Area
BN	86.57%	0.799	0.884	0.866	0.866	0.972
SVM	87.56%	0.813	0.899	0.876	0.878	0.929
RF	89.05%	0.836	0.899	0.891	0.890	0.962
LWL	87.56%	0.813	0.890	0.876	0.876	0.961

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, X.-C.; Zhao, D.-H.; Zhang, Y.-H.; Zhang, X.-J.; Tao, Z. Multi-Scale Recurrence Quantification Measurements for Voice Disorder Detection. Appl. Sci. 2022, 12, 9196. https://doi.org/10.3390/app12189196

AMA Style

Zhu X-C, Zhao D-H, Zhang Y-H, Zhang X-J, Tao Z. Multi-Scale Recurrence Quantification Measurements for Voice Disorder Detection. Applied Sciences. 2022; 12(18):9196. https://doi.org/10.3390/app12189196

Chicago/Turabian Style

Zhu, Xin-Cheng, Deng-Huang Zhao, Yi-Hua Zhang, Xiao-Jun Zhang, and Zhi Tao. 2022. "Multi-Scale Recurrence Quantification Measurements for Voice Disorder Detection" Applied Sciences 12, no. 18: 9196. https://doi.org/10.3390/app12189196

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Scale Recurrence Quantification Measurements for Voice Disorder Detection

Abstract

1. Introduction

2. Extraction of Multi-Scale Recurrence Quantification Measurements (MRQMs)

2.1. Gammatone Filter Bank

2.2. Recurrence Quantification Analysis (RQA)

2.2.1. Recurrence Plots (RPs)

2.2.2. Recurrence Quantification Measurements

3. Material and Evaluation

4. Results and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI