1. Introduction
With the development of society, more and more factors lead to a higher and higher possibility of people suffering from voice diseases. Voice diseases directly affect people’s phonation function and psychological health [
1], especially for some professionals who need to communicate frequently. The timely detection and early prevention of voice diseases can provide practical help in treating patients. Compared with traditional medical diagnosis methods, such as laryngeal electromyography, the automatic voice pathology detection system (AVPD) has the advantages of non-invasiveness, objectivity and portability. It is convenient for doctors to confirm the effectiveness of the treatment plans and will make it easier for patients to self-diagnose. AVPD usually consists of two procedures: the first is to extract the parameters that characterize the acoustic characteristics of the voice signal, and the second is to detect whether the voice is pathological or healthy through machine learning algorithms [
2]. The focus of the current work is the first procedure.
The acoustic features in speech signals can be roughly grouped into three categories: perturbation features, cepstral features and complexity measures [
3,
4]. Perturbation features describe the aspiration noise generated by the irregular vibration of the vocal folds, which is caused by voice diseases, such as jitter and shimmer [
5,
6]. Jitter represents the short-term perturbations of the fundamental frequency (F0), and shimmer represents the short-term perturbations in amplitude [
7]. Various statistics of jitter and shimmer, such as relative jitter, relative jitter average perturbation, absolute shimmer and relative shimmer, have been utilized in AVPD [
8]. However, the calculation of perturbation features depends on selecting the appropriate window length and accurate estimation of F0, which is difficult in pathological voice. It should be pointed out that the characteristics of the vocal tract system can be effectively captured by the property of cepstral feature [
3], such as mel-frequency cepstral coefficients (MFCC) [
9]. MFCC is the coefficient of the linear transformation of the logarithmic energy spectrum based on the nonlinear mel scale of frequency.
In the past, perturbation features and cepstral features are widely used in pathological voice detection. Both of them are based on the linear acoustic theory. Speech production with planar sound propagation is assumed in the linear source/filter model. With the development of nonlinear dynamics, Vazir et al. have pointed out that there are extensive nonlinear phenomena in the process of speech production. Thompson et al. used aerodynamic theory to study voice signals and found that voice production was not a deterministic linear process, nor a random process, but a non-linear process [
10]. The research of Thyssen et al. showed that the turbulence formed in the vocal tract during the speech production process is the reason for the chaotic characteristics of speech signals [
11]. The physical model of the throat [
12] and the fluid dynamics model [
13] also proved that a sound description by nonlinear fluid dynamics was more realistic. Different from perturbation features, spectral and cepstral features, complexity measures tend to represent the speech signal’s aperiodic, non-stationary and nonlinear characteristics [
14]. Voice diseases directly affect the vibration of the vocal folds, and non-linear dynamic analysis can effectively capture this change. Complexity measures [
15,
16] have been used in previous works, including the largest Lyapunov exponent (LLE), Hurst exponent (HE), sample entropy (SE), correlation dimension (CD), Shannon entropy, etc. These popular features tend to describe the signal’s dynamics, periodicity and regularity [
17,
18].
However, the non-stationary nature of speech causes significant fluctuations in the nonlinear representation obtained by calculation [
19]. Short-term analysis is used in traditional speech signal processing methods to avoid the nonlinear and non-stationary problems of speech signals. Eckmann [
20] et al. proposed recurrence plots for nonlinear data analysis, representing the global state correlation of signals on the full-time scale. The most essential feature of a particular dynamic system is the recurrence phenomenon. Recurrence phenomenon refers to the fact that some states of the system have similar characteristics at a specific time. Recurrence characteristics exist in both nonlinear and chaotic systems. It can study high-dimensional phase space trajectory periodicity through a two-dimensional representation. Therefore, from the perspective of RPs, the system’s evolution over time can be more deeply understood. Recurrence quantification analysis (RQA) technique developed by Zbilut and Webber [
21] is the most commonly used tool in non-stationary time series analysis. It can obtain quantitative information on nonlinear dynamical systems by analyzing the distribution of structures, such as diagonal lines and vertical lines in RPs. Compared with traditional nonlinear analysis methods, RQA has apparent advantages in sensitivity to changing dynamic characteristics, including non-stationarity. In recent years, RQA has been widely used in various fields, such as life sciences, earth sciences, finance and physics [
22,
23,
24,
25,
26]. In the field of pathological voice detection that we are interested in, Vieira and Costa et al. directly extracted recurrence quantification measurements (RQMs) from the original voice signal to identify normal and distinct pathological voices, with an average recognition of over 90% [
27]. Lopes et al. [
28] introduced the embedding dimension and delayed time into the RQMs and analyzed the accuracy of the combination of different RQMs in distinguishing individuals with and without voice disorders. The results show that RQA has substantial discriminative potential in pathological voice detection.
In addition, some studies have pointed out the effectiveness of multi-scale features in pathological voice detection [
29]. The perceptual characteristics of the auditory system prompt researchers to explore auditory features for pathological voice detection. Typically, the extracted auditory feature is MFCC, which meets the principles of human auditory perception in the mel scale. According to pathological voice energy distribution characteristics, Zhang [
30] used the frequency division method to improve the accuracy rate in the bark scale. Zhou [
31] extracted multi-scale nonlinear features GTSLs in the equivalent rectangular bandwidth scale. These multi-scale features provide better resolution on frequency scales.
In this paper, we propose Multi-scale Recurrence Quantification Measurements (MRQMs). The proposed feature can decompose non-stationary, nonlinear complex sequences into a set of frequency subsequence features by multi-scale auditory analysis. MRQMs can effectively resolve the significant fluctuations in nonlinear representations caused by the non-stationarity of speech. Furthermore, it can use the acoustic characteristics of the auditory system to improve the accuracy of its pathological voice detection.
3. Material and Evaluation
Pathological voice contains a wealth of information, including not only the original voice information to be conveyed but also the speaker’s age, gender and voiceprint. In order to evaluate the method proposed in this work, we utilized the Massachusetts Eye and Ear Infirmary (MEEI) database [
43], which contains the normal and pathological voice /a:/ with the patient’s expert diagnosis results, gender and smoking status. The vowel /a:/ sound will be more affected by acoustic characteristics.
This database selected 53 normal voice samples and 173 pathological voice samples with various disorders as a data subset. This subset was selected by considering different voice disorders, gender and age mean and standard deviation of normal and pathological voice databases [
44]. The statistics of the selected voice samples in the MEEI database in this work is shown in
Table 3. All voice disorders are listed up in a
Table 4. Among them, 53 normal voice samples, 20 cases of vocal fold polyp samples and 67 cases of vocal fold paralysis samples were selected as multi-class samples. All voice signals were sampled with 16-bit resolution. The sampling rate of healthy voice signals was 50 kHz, and that of pathological voice signals was 25 kHz or 50 kHz. Before all the processing, we had down-sampled all the voice signals to 25 kHz.
This work presented multi-scale recurrence quantification measurements for voice disorder detection and classification. We use Matlab 2020b software to extract this feature. Thereinto, the cross recurrence plot toolbox, was used to build recurrence plots and extract recurrence quantification measurements [
45]. All machine learning algorithms and testing are done using Waikato Environment for Knowledge Analysis software. Four traditional machine learning algorithms were utilized for pathological voice detection experiments. After many times of parameter adjustment and optimization, the final model parameters were determined as follows: (1) The support vector machine (SVM) utilized a PolyKernel kernel function. The penalty factor was one. (2) The random forest (RF) utilized two hundred trees, and the maximum depth was four. (3) The Bayesian network (BN) utilized the tree augmented naive Bayes algorithm. (4) The Local Weighted Learning (LWL) utilized random forest classifier and FilteredNeighbourSearch. In addition, the four machine learning algorithms were utilized for pathological voice classification experiments. (1) The support vector machine utilized a PUK kernel function, and the penalty factor was two. (2) The random forest utilized fifty trees, and the maximum depth was twenty. (3) The Bayesian network utilized the K2 search algorithm. (4) The Local Weighted Learning utilized random forest classifier and FilteredNeighbourSearch. Due to the problem of unbalanced pathological samples, the FC-SMOTE algorithm was adopted for sample balancing for multi-classification experiments [
46]. A 10-fold cross-validation was used to test the accuracy of the algorithm. All voice samples were divided into ten equal parts. In each validation, nine folds were selected as training sets and one-fold as test sets. The process was repeated ten times, so all ten pieces of data were all tested as test sets.
4. Results and Discussion
Although some recurrence states of pathological voice can be obtained by recurrence quantification analysis, the dynamic recurrence characteristics of voice signals cannot be described comprehensively. From the recurrence plot’s definition, the recurrence threshold selection plays a vital role in extracting MRQMs. When a larger or smaller recurrence threshold is selected, the state in RP will change, which is not conducive to the practical analysis of the system, thus affecting the recognition results. This paper sought a more appropriate threshold to make distinguishing the diagonal and vertical recurrence structures of pathological voices easier.
Table 5 adopted four machine learning algorithms in different thresholds to detect pathological voice: SVM, RF, BN and LWL.
It could be found from the table that the accuracy of pathological voice detection under the four machine learning classifiers had different results under different recurrence thresholds. However, the overall accuracy was considerable. It showed extensive and effective thresholds of multi-scale recurrence quantification measurements in pathological voice detection. The average accuracy showed an upward trend from 1%rr to 40%rr and a downward trend from 40%rr to 99%rr. In the case of 40%rr, the average accuracy of pathological voice detection under the four machine learning classifiers reached 99.45%, which reached the optimal result in different classifiers. Therefore, this article selected 40%rr as the recurrence threshold.
In order to further explore in detail that the recurrence threshold at 40%rr is more suitable for pathological voice detection than the other recurrence threshold, the first three multi-scale recurrence quantification measurements ranked by the FDR algorithm were selected to draw three-dimensional scatter plots. It was obvious to find that the first three multi-scale recurrence quantification measurements ranked by the FDR algorithm of the two types of samples can be highly differentiated from
Figure 4. The blue circles represented samples of normal voices, and the red crosses represented samples of pathological voice. When the recurrence threshold was 40%rr, the samples had a slight overlap. However, when the recurrence threshold was 1%rr, 80%rr or 99%rr, the degree of overlap was significantly more than 40%rr. It proved that the method proposed in this article could effectively separate the normal samples from the pathological samples, and the dichotomous classification performance was superior.
The RP structure of multi-scale recurrence quantification measurements also illustrated the difference when the threshold is 40%rr.
Figure 5 shows the RP of the normal and pathological glottal signals in the tenth frequency band. Among them, the diagonal recursion of the normal glottal source signal was obvious. The diagonals were parallel, without vertical and horizontal lines and no scattered points, indicating that the signal had periodicity. Pathological glottal source signals were mainly parallel to the diagonal, but there were some isolated dispersion points. Massive recurrence points on the diagonal indicate that the pathological voice still had a certain periodicity, but its vibration was abnormal. In addition, some isolated points on the RP of the pathological glottal signal indicated that the pathological voice signal was not stable, and there were anti-correlation processes and weighty fluctuations.
Compared with the method proposed in this article, the features from the other works in recent years were applied for pathological voice detection. They all used 10-fold cross-validation. Moreover, the experiment in the MEEI database adopted the same classifier: BN. The results are listed in
Table 6. All the evaluation indicators in the table are closer to 1, which indicates the better effect and robustness of the feature. Refs. [
27,
28] both showed an accuracy of nearly 90%, indicating that RQA has a strong potential in pathological voice detection. Nevertheless, they did not make any improvements to RQA. They just used and explored combined RQMs. The method proposed in this article analyzed the signal itself and selected the best recurrence threshold to draw a more accurate RP structure, using 13 RQMs to quantify recurrence features. The accuracy also showed that the method in this article was better than the method of directly extracting RQMs for pathological voice detection. The accuracy was improved by 8 % and 12.45%, respectively, for RQMs in [
27] and RQMs in [
28].
MFCC are typical characteristic parameters of acoustic tasks. They achieved 92% recognition results, respectively, from raw voice signals. Traditional cepstral characteristic parameters can characterize the differential changes of the vocal tract system. They cannot describe the changes in the vocal fold’s vibration mechanism directly caused by voice diseases, so they cannot characterize all pathological voice information. The accuracy of the nonlinear feature combination was only 75.56%, which is far inferior to the traditional MFCC. The short-term non-stationary characteristics of speech signals limited the application of traditional nonlinear time series analysis methods. The method proposed in this paper can quantitatively analyze the dynamic nonlinear behavior of different aspects of the speech dynamic system by using RQMs. MRQMs combined with machine learning achieved the best results in terms of accuracy rate, kappa value, precision, recall rate, F-Measure, ROC area and so on. MRQMs analyzed the glottal signal from the perspective of vocal fold vibration, filtering out lip radiation’s influence on pathological voice detection. They quantified the recurrence characteristics of different frequency bands under the ERB scale. MRQMs embodied excellent detection performance and application prospects.
In this paper, multi-classification experiments used normal voice samples, vocal folds paralysis samples and vocal folds polyp samples. From the confusion matrix in
Figure 6, the information of the multi-classification experiment can be further viewed. The true positive accuracy of normal voice and vocal folds paralysis was more than 90% in the multi-classification experiment. Vocal fold polyps were confused with other normal voices and vocal folds paralysis categories in the multi-classification experiment. Other indicators of multi-classification experiment results are also listed in the
Table 7.
Compared with the original voice signal, the glottal signal removes the effects of lip radiation and so on, so that it can better reflect the characteristics of vocal folds vibration. MRQMS used the glottal signal as the raw signal and used the human auditory perception characteristics to extract multi-scale recurrence features. From the point of view of nonlinear dynamic recurrence, although the pathological voice was a relatively deterministic signal, there was a break in the diagonal of the pathological voice signal. The relatively complete diagonal recursion structure of the RP of the pathological voice in each channel indicates that the vocal folds of the patient had lesions, but it still maintained approximately periodic vibration. According to the density of the diagonal, it can be known that the vibration cycle of the vocal folds disease in this case is larger than that of the normal voice vibration cycle, which may be caused by the slow vibration of the vocal folds disease. In addition, the diagonals of the RP of pathological voice were burr, indicating that vertical or horizontal recursion, different from diagonal recursion, existed near the diagonal. Compared with normal vocal folds vibration, rapid collision and separation occurred in patients with pathological voice. Therefore, pathological voice had more dense diagonal recurrence structure than normal voice, which reflected the phenomenon that its vibration period was shorter than normal voice. Periodic tensioning of vocal folds determines the fundamental frequency of voice signals. For different vocal folds diseases, the air pressure and muscle tension at the glottal diaphragm and the instability of the vocal folds state results in different voice signal waveforms. However, different vocal folds diseases also had similar recurrence characteristics directly, so the classification and detection of vocal folds diseases had not achieved ideal results.
5. Conclusions
In order to accurately quantify the nonlinear features in the speech signal to effectively detect and classify pathological voice, MRQMs were proposed. A recognition rate of 99.56% was achieved, and each evaluation index was the best, demonstrating the effectiveness of the proposed method. However, due to the high similarity of different vocal fold diseases, the method in this paper could not classify different pathological voice types well. However, it also achieved a three-class recognition rate of 89.05%. This article points out that the nonlinear analysis method can effectively detect pathological voices.
Cross-database training and multi-class recognition are still challenges for pathological voice detection. Considering the diversity of different pathological voice databases, pathological voice detection will be carried out in different databases in the future. In addition, we will continue to study MRQMs or other complex measures combined with the occurrence model to explore the phenomenon of turbulent noise in pathological voices to improve the multi-class recognition rate. With the development of the Internet of Things and artificial intelligence, Wise Information Technology of 120 is gradually entering the link of medical treatment. This work can provide ideas for the self-diagnosis of patients.