Next Article in Journal
Switchable Fiber Ring Laser Sensor for Air Pressure Based on Mach–Zehnder Interferometer
Next Article in Special Issue
Improving End-to-End Models for Children’s Speech Recognition
Previous Article in Journal
Nanophotonics and Integrated Photonics
Previous Article in Special Issue
Environment-Aware Knowledge Distillation for Improved Resource-Constrained Edge Speech Recognition
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Unmasking Nasality to Assess Hypernasality

by
Ignacio Moreno-Torres
1,*,
Andrés Lozano
2,
Rosa Bermúdez
3,
Josué Pino
4,
María Dolores García Méndez
1 and
Enrique Nava
2
1
Department of Spanish Philology, University of Málaga, 29071 Málaga, Spain
2
Department of Communication Engineering, University of Málaga, 29071 Málaga, Spain
3
Department of Personality, Evaluation and Psychological Treatments, University of Málaga, 29071 Málaga, Spain
4
Department of Speech-Language and Hearing Science, University of Chile, Santiago de Chile 9170022, Chile
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(23), 12606; https://doi.org/10.3390/app132312606
Submission received: 2 October 2023 / Revised: 15 November 2023 / Accepted: 21 November 2023 / Published: 23 November 2023
(This article belongs to the Special Issue Advances in Speech and Language Processing)

Abstract

:

Featured Application

The results of this study provide key information, both linguistic and technical, to use signals recorded close to the nose to evaluate hypernasality automatically. The results may also guide the improvement of the accuracy of hypernasality assessment tools by analyzing nose signals.

Abstract

Automatic evaluation of hypernasality has been traditionally computed using monophonic signals (i.e., combining nose and mouth signals). Here, this study aimed to examine if nose signals serve to increase the accuracy of hypernasality evaluation. Using a conventional microphone and a Nasometer, we recorded monophonic, mouth, and nose signals. Three main analyses were performed: (1) comparing the spectral distance between oral/nasalized vowels in monophonic, nose, and mouth signals; (2) assessing the accuracy of Deep Neural Network (DNN) models in classifying oral/nasal sounds and vowel/consonant sounds trained with nose, mouth, and monophonic signals; (3) analyzing the correlation between DNN-derived nasality scores and expert-rated hypernasality scores. The distance between oral and nasalized vowels was the highest in the nose signals. Moreover, DNN models trained on nose signals outperformed in nasal/oral classification (accuracy: 0.90), but were slightly less precise in vowel/consonant differentiation (accuracy: 0.86) compared to models trained on other signals. A strong Pearson’s correlation (0.83) was observed between nasality scores from DNNs trained with nose signals and human expert ratings, whereas those trained on mouth signals showed a weaker correlation (0.36). We conclude that mouth signals partially mask the nasality information carried by nose signals. Significance: the accuracy of hypernasality assessment tools may improve by analyzing nose signals.

1. Introduction

Hypernasality is a speech condition in which the speaker produces nasal sounds while attempting to produce oral ones [1]. Hypernasality may occur due to anatomic malformations (e.g., cleft palate) that let the air in the mouth cavity scape towards the nose cavity, or due to the lack of precise motor skills that result in patients failing to lower the velum [2]. Evaluating hypernasality is most relevant because it may guide the rehabilitation process, particularly in the case of cleft palate patients [3].
Traditionally, hypernasality evaluation was carried out by human experts, who produced a perceptual subjective measure of the degree of nasality. Despite its subjective, and hence variable, nature, this approach continues to be considered the golden standard to evaluate hypernasality [4]. In the last quarter of the 20th century, one acoustic instrument, the Nasometer, was proposed to supplement human judgments [5]. The Nasometer records the nose and mouth acoustic signals separately, which makes it possible to compute the nasalance, which is defined as the ratio of nasal acoustic energy to the sum of oral and nasal. The Nasometer has had relative success in clinical practice [4,6]. However, the reliability of this instrument is unclear; studies comparing the nasalance scores and perceptual scores have obtained correlations ranging from non-significant to strong (for a review see [7]).
In the last two decades, motivated by advances in speech processing technology and artificial intelligence, several proposals were made that combine novel acoustic features and machine learning algorithms [8]. One issue that has received little attention is the type of target signal for these analyses. To our knowledge, all previous studies have used standard monophonic signals, which are the result of combining (coupling) the two signals produced by humans (i.e., mouth and nose signals). However, as shown by nasalance studies, it is also possible to analyze these signals separately.
In the present study, we examine to what extent using either the mouth or the nose signal exclusively might be a better alternative to evaluate hypernasality than the traditional monophonic signal. Note that if this possibility is confirmed, it might have some practical implications. On the one hand, it might serve as an approach to increase the accuracy of advanced speech processing algorithms. On the other hand, it might point to a potential alternative to the Nasometer, a device that is notably expensive (>1500 USD in many countries), and hence unaffordable to many audiologists and speech therapists around the world.
There are some indications that nose signals might be an optimal target to explore nasality. For instance, a recent study [9] recorded the nose and mouth signals with a Nasometer, and simultaneously obtained a high-speed video of nasopharyngoscopy; then, the author correlated the two acoustic signals with the video signal and the correlation was significant only in the case of the nose acoustic signal. Also, experiments in our lab have shown that in the case of nasalized vowels, naïve listeners detect nasality easily when the sounds are recorded with a nose microphone, but not when the same sounds are recorded with a mouth microphone or a monophonic one. One potential explanation for these results is that the nose signal is the main carrier of nasality information [10], but this information is partially masked by the mouth signal. From a practical perspective, this means that hypernasality might be optimally detected using this signal. However, to our knowledge, no study analyzing automatic detection of hypernasality has explored this possibility.
The main aims of this study are to analyze to what extent the oral/nasal sound contrast is encoded optimally in nose signals, and to explore if hypernasality can be assessed using nose signals. To this end, the differences between different signal types will be analyzed from three perspectives: (1) a low-level acoustic perspective (i.e., the spectral contrast between oral and nasal sounds); (2) a phonetic perspective (i.e., information transmission); and finally, (3) a clinical perspective (i.e., the impact on the accuracy of one hypernasality assessment algorithm).

1.1. Acoustics Contrast between Oral and Nasal Sounds

Phonetic and speech processing textbooks [11,12] typically describe nasal sounds (e.g., /m, ẽ…/) by showing how the addition of the nasal resonance modifies the resonance of the corresponding oral sound (e.g., /b, e/). These modifications include, among others, the addition of an extra nasal formant around 250 Hz, increased spectral flatness, and reduced first formant amplitude.
While nose and mouth signals have been recorded in many nasalance studies in the past, to our knowledge, not much attention has been paid to nasal/oral contrast in each signal. However, as noted above, perceptual experiments carried out in our lab indicate that the contrast is more pronounced with the nose signals than with the mouth signals. Perceptual impressions can be confirmed through spectrographic analysis. As an example, we present in Figure 1 the Long-Term Average Spectrum (LTAS) of an oral vowel /e/ and its nasalized counterpart /ẽ/. The two vowels were recorded simultaneously with a standard monophonic microphone and a Nasometer device (with a nose and a mouth microphone). Note that, in Figure 1, the black and red lines are notably close to each other in the case of the mouth and monophonic microphones; in contrast, the two lines do not overlap in the nose microphone. This suggests that the nasal/oral contrast might be easier to recognize in the nose signal. However, this is just one example, for which a more detailed analysis is needed. One possible approach to compare pairs of speech sounds (e.g., e/ẽ) is computing the Euclidean distance based on some acoustic features. Here, we will compute the distance between the Mel-Frequency Cepstral Coefficients (MFCCs), as this measure tends to correlate with perceptual discriminability [11].

1.2. Phonetics

While the spectral distance may provide valuable information, it is not sufficient to confirm that two signals differ in their effectiveness in encoding phonetically relevant information (e.g., nasality). This is because speech encodes simultaneously multiple types of phonetically relevant information (e.g., nasality, voicing, place of articulation, intonation, sociolinguistic information, and emotional content [12]). Thus, it is necessary to examine the extent listeners (humans or machines) discriminate sound types with different signals.
In order to clarify the phonetic content of speech signals, researchers have analyzed the errors produced by listeners (humans or computers) when attempting to classify sounds [13,14]. Commonly, the target sounds and the responses of the speakers are tabulated in confusion matrixes from which various information can be obtained (e.g., error biases). Here, we will use this approach to examine oral versus nasal and vowel versus consonant errors.

1.3. Automatic Assessment of Hypernasal Speech

Many proposals have been made to evaluate hypernasality automatically, both in terms of the acoustic features used to process the speech signal and in terms of the automatic classification method. As for the features, these include spectral poles and zeros, increased spectral flatness, formant amplitude, the voice low-to-high tone ratio, and MFCCs [15,16,17,18,19,20]. While the classifiers include Support Vector Machines (SVMs), Gaussian mixture models (GMMS), and Deep Neural Networks (DNNs) [18,19,21,22]. Most studies have trained their classifiers using samples of healthy and hypernasal speech. However, this approach has two important limitations: (1) the size of the training database is relatively small (i.e., up to one or two hundred speakers producing a small number of utterances in most cases) and (2) most studies processed only sustained vowels [23,24] or vowel fragments that had been annotated manually in words or sentences [25,26,27,28]; this approach is not compatible with clinical protocols that emphasize the need to use diverse speech samples (e.g., running speech [29]).
Recently, Mathad et al. [30] proposed an innovative approach (see Figure 2) that overcomes the two above-mentioned limitations. In their model, training was carried out with speech samples from healthy speakers, which allowed them to use one freely available large database to train a DNN model. Specifically, they used 100 h from the Librispeech corpus [31]. To train the system, the speech samples were divided into short windows from which MFCCs are computed and based on phonetic transcriptions, classified into one of four classes: oral consonant (OC), oral vowel (OV), nasal consonant (NC), and nasal vowel (NV). During the testing phase, the system was fed with new MFFC vectors, which are classified as belonging to one of the four classes. This allows computing the severity of nasality as the number of frames that are nasal divided by the total number of frames (per speaker or utterance). The results showed that the correlation between the scores of the DNN and those produced by Speech and Language Pathology (SLP) experts was very high (r = 0.80).
One practical limitation of this approach is that to train their DNN, the authors used a large amount of training data (>100 h), and such large databases are available only for a very small number of languages. Despite such a limitation, it is relevant to compare their results with nasalance studies (as these also evaluate hypernasality based on running speech). By selecting only nasalance studies with a number of participants large enough to compute the Pearson correlation, we find that one study obtained a very high correlation of 0.88 [32], but in all other studies, the correlation is notably lower: 0.74 [33], 0.59 [34], and 0.55 [35]. Thus, it seems that the above-described DNN model might be a good alternative to the Nasometer. However, more studies are needed to confirm the reliability of this DNN model. It is also relevant to explore if the results are optimal using nose signals.

1.4. Experiments Carried Out in This Study

In this study, we address three questions. In the first place, we inquire to what extent the spectral differences between oral and nasalized vowels vary as a function of the target signal. Three types of signals were used for this analysis: nose, mouth, and monophonic. As the DNN model selected for the experiments is trained with MFFCs, we computed the Euclidean distance using these coefficients. We expected that the difference between the oral and nasal sounds would be higher in the case of the nose signal than in the mouth or monophonic signals.
Next, we asked to what degree nasality information is more optimally encoded in the nose signal than in the mouth or monophonic signal. To this end, we conducted one experiment consisting of creating multiple DNN models using a cross-validation approach. The models were trained, in each case, with one signal type. Based on the output of the DNN models, confusion matrixes were created with the four phonetic classes: OV/OC/NV/NC. The confusion matrixes were then used to compute the nasality errors (i.e., confusions between oral and nasal sounds) and the vowel/consonant errors (i.e., confusions between consonant and vowel sounds). We expected that nasality errors would be less frequent in the nose signal than in the mouth or monophonic signal.
Finally, we run a second experiment to compare the effectiveness of nose- and mouth-trained DNN models to evaluate hypernasal speech. The DNN models were used to obtain a hypernasality score for two groups of children, healthy and hypernasal. The output of each DNN evaluation was used to compute the Pearson correlation with the scores produced by SLP experts. Note that the DNN models were always trained and tested with only one signal type; the monophonic data were excluded from this experiment because the test database had been recorded exclusively with a Nasalance device. We expected that the scores of the human experts would correlate more strongly with the output of nose-trained DNN models than that of the mouth-trained models.

2. Databases

The present study analyzed speech samples from a large number of healthy and hypernasal speakers. All the participants, or the parents/tutors in the case of minors, signed an informed consent form. The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethics Committee of the University of Málaga (protocol code 67-2023-H, 13 June 2023). The Python scripts, as well as the databases used for the present study, can be downloaded from https://github.com/Caliope-SpeechProcessingLab/Unmasking-Nasality.git (accessed on 20 November 2023).

2.1. Training Databases for Experiment One

The speech data used for this experiment consisted of recordings from two groups of volunteers reading aloud different Spanish language texts. All the speakers were young female university students aged between 17 and 25 years old. All of them lived in South Spain and were native speakers of the Spanish language. This age group was selected because their fundamental frequency is not too different from that of younger children, who were recorded for experiment two.
One group of speakers (N = 55) was recorded with an icSpeech Nasometer (Rose Medical Solutions Ltd., Canterbury, UK) and the other (N = 189) was recorded with a Shure WH20XLR microphone (Shure Incorporated, Niles, IL, USA) connected to a Zoom H4n recorder (Zoom Corporation, Tokyo, Japan) (see Figure 3). Each speaker read aloud between one and three phonetically balanced texts [36,37,38], as well as a text containing multiple nasal sounds that were created for the present study.
Based on the Nasometer recordings, two databases were obtained by selecting just one channel of the two-channel audio (Figure 3). A third of the database was created with the monophonic recordings. Before feature extraction, each full signal (corresponding to one speaker) was normalized to 65 dB.
Once the speech databases had been collected, the speech samples were annotated using Praat v6.1.53 [39], and the Montreal Forced Aligner v1.0.0 tool [40] was used to obtain a phonetic transcription. Next, each signal was split into 25 ms frames (with a 10 ms step), and the MFFCs for each window frame were computed. Following [30], frames were classified into four major classes: OV, OC, NC, and NV. In the case of nasal consonants, oral consonants, and oral vowels, classification was automatic (because the Spanish language phonemic inventory, as that of the English language, includes these sound types). As for nasalized vowels, these are merely phonetic variants that occur when a vowel is in contact with a nasal consonant (as in the first vowel of “masa”). We classified the section of each vowel that was closest to the nasal consonants as nasal. As vowel length in Spanish is highly variable and nasality is more evident in the sections closest to the nasal consonant, the nasalized fragment was selected as follows: 30% of the vowel when the vowel was at least 100 ms long; 50%, when the vowel was between 60 and 100 ms long; and the full vowel if it was shorter than 60 ms. As the size of the database might impact the results, the same number of speech frames (~184 k) was selected in each database, and equally distributed among the different classes. This is equivalent to approximately 2 h of speech data.

2.2. Training Databases for Experiment Two

For this experiment, we created a larger database by combining the training database described in experiment one (N = 55) with a database recorded at the University of Chile (N = 16). The Chilean speech data were recorded using the same procedure described for experiment one, and the participants were of the same age range and gender. The total number of frames in the extended database was 263 k (approximately 3 h of speech data).

2.3. Test Databases for Experiment Two

Speech data from one group of hypernasal children (N = 34) and one group of healthy children (N = 21) were recorded to create a test database. This speech database was then processed to obtain two new training databases as shown in Figure 4.
All the children were aged between 5 and 15 years old. In the case of male participants, the speech samples were included in the database only if the fundamental frequency was 180 Hz or higher. The patients were children with hypernasal speech due to craniofacial abnormalities such as cleft palate or short velum. They belonged to two registered charity family-supporting organizations (ASAFiLAP, the Andalusian Cleft Palate Association, and the 22q11.2 Deletion Syndrome Andalusian Association). The healthy children were selected from multiple sources, including relatives or friends of the patients and staff from the University of Málaga.
The children’s recordings were obtained as part of an auditory repetition task that is routinely carried out in our laboratory to evaluate HN speech. The task consists of repeating 12 words without nasals sounds: boca, pie, llave, dedo, gafas, silla, cuchara, sol, casa, pez, jaula, and zapatos.
Each of the words in the test database was scored by two experienced SLPs using the following scale:
  • 0: No evidence of hypernasal speech.
  • 1: Evidence of at least one vowel nasalized.
  • 2: Evidence of at least one consonant nasalized.
In case of disagreement between the two SLPs, a third expert scored the utterance. Then, a score was obtained per child by averaging the 12-word scores.

3. Methods

3.1. Spectral Distances

The distance between oral and nasal sounds was computed using the Euclidean distance between MFCCs. The targets of these acoustic analyses were two series of nasal and oral vowel tokens recorded simultaneously with the monophonic microphone and the Nasometer. The nasal tokens were six instances of each of the Spanish vowel types (/a, e, i, o, u/) produced in the vicinity of a nasal sound (e.g., the “e” in the word “ten”). The oral tokens were six instances of vowels that were not in contact with nasal sounds (e.g., the vowel “e” in the word “peto”). Thus, 30 nasal tokens and 30 oral tokens were used. For each vowel type (a, e, i, o, u), the six oral tokens (N = 6) were compared to each of the corresponding nasal tokens (N = 6), which resulted in 36 measures per vowel type and a total of 180 distances per microphone. The MFCCs were computed using the same method described for experiments one and two, and the means were calculated per vowel token.

3.2. DNN Nasality Model

The architecture of the DNN nasality model is the same as described by Mathad and colleagues [30]. The input layer has 39 nodes, corresponding to the 39-dimensional MFCC input speech vector. The model is comprised of three hidden layers with 1024 neurons with rectified linear unit (ReLU) activation. The output layer consists of five softmax nodes, each interpreted as a posterior probability corresponding to the five classes—NC, OC, NV, OV, and silence (S). The network was trained using a batch size of 128 frames, for 100 epochs, with a learning rate of 0.00001.

3.3. Design of Experiment One

A (k = 5)-fold cross-validation approach was used to train and test the DNN models. The same analysis was repeated for three signal types: mouth, nose, and monophonic. Note that cross-validation results were obtained in five models for each condition (one per kfold). As the experiment was run twice, the total number of trained models was 10 per condition.
Next, we computed the confusion matrices for each model using four categories (NV, NC, OV, OC). The 10 confusion matrices of each condition were averaged to obtain three confusion matrices (for mouth, nose, and monophonic signals). These confusion matrices were used to examine the accuracy and the error biases. The three following measures were used for examining the accuracy: (1) by considering the four phonetic classes (NV, NC, OV, OC); (2) by considering the superclasses oral and nasal; and (3) by considering the superclasses vowel and consonant. As for the error biases, it is relevant that hypernasality assessment tests typically use oral sounds and count the number of nasalization errors; this means a relatively small number of false positives (i.e., items that are wrongly classified as nasal) may lead to wrong results, and in contrast, if false negatives represent a small percentage, the impact might be small. Accordingly, special attention was paid to the percentage of frames wrongly classified as nasal.

3.4. Design of Experiment Two

For experiment two, DNN models trained with the adult data were evaluated with the test database described above. Following [30], the posterior probabilities of the DNN tests were post-processed to classify each window frame as either nasal or oral. A frame was classified as nasal when either the posterior probability of being a nasal consonant was at least ten times larger than the posterior probability of being an oral consonant, or the posterior probability of being a nasal vowel was ten times larger than the posterior probability of being an oral vowel, as shown in Equation (1):
log (P(NC)/P(OC)) > 1 or log (P(NV)/P(OV)) > 1
The nasal ratio for a given speaker was computed as the number of nasal frames divided by the total number of oral and nasal frames. The Pearson correlation coefficient was computed between the nasality ratio obtained from the DNN models and the nasality score produced by the expert SLPs. Note that in the study by Mathad et al. [30]., the authors did not include the scores for healthy children in the Pearson correlations. As this information may be relevant (e.g., to identify false positives), we computed the Pearson correlation using two criteria: (1) exclusively for the hypernasal speakers in the test database; and (2) for all speakers in the test database.

4. Results

4.1. Euclidean Distances between Oral and Nasal Vowels

Figure 5 shows the Euclidean distances between the MFCC vectors computed for oral and nasal vowel tokens as a function of the signal type. For four of the vowels (/a, e, i, o/), the distance is the highest for the nose signal, and the distances in the mouth and monophonic signals are very similar. In the case of vowel /u/, the results are very similar in the three conditions.
To facilitate the interpretation of these results, Figure 6 illustrates how the spectra of nasal and oral vowels vary as a function of the recording device. Note that in the case of the nose signal, the nasal (red) line is above the oral line all along the spectrum. In contrast, in the case of the mouth and monophonic signals, the two lines are very close to each other.

4.2. Experiment One

Figure 7 shows the confusion matrices for the three recording conditions. Note that in each case, the results are the mean for 10 DNN models. The percentages of correct responses are almost identical in the three conditions (0.78–0.79). However, the errors differ, particularly for the ratio of oral vowels wrongly classified as nasal vowels; the error is only 0.07 in the case of the nose-trained DNNs, but three times larger in the other two conditions. It might be relevant that in the three conditions, the mean accuracy for vowels is lower than the accuracy for consonants. The values for nose, mouth, and monophonic are 0.74 and 0.83, 0.72 and 0.86, and 0.73 and 0.84, respectively.
Figure 8 summarizes the accuracy when considering the four classes and when considering only two of them (oral versus nasal or consonant versus vowel). When the four classes are considered, the accuracy is almost identical in all three conditions. However, with just two classes, the error patterns are not the same: for the oral/nasal class, the accuracy is higher in the nose condition than in the other two; while for the vowel/consonant class, the accuracy is higher in the mouth and monophonic conditions than in the nose condition.

4.3. Experiment Two

Figure 9 shows the Pearson correlation between the perceptual scores of the SLPs, and the ratios of nasality computed by two sets of DNN simulations (i.e., trained with nose and mouth signals, respectively). When only the data from patients were used, the correlation was very high and significant in the case of the nose signal (r = 0.82), and low and non-significant in the case of the mouth signal (r = 0.33) When the scores of both patients and healthy speakers were entered into the correlation analyses, the results were almost identical (r = 0.83 and r = 0.36). Note that the DNN models scored healthy speakers relatively high (i.e., some of them had scores similar to or above patients with perceptual scores in the range of 0.5–1.3).

5. Discussion

The present study aimed to clarify the potential interest of nose signals to evaluate hypernasality automatically. To this end, nose signals were compared to mouth and monophonic signals from three perspectives: (1) the spectral distance between oral and nasal sounds; (2) the success in transmitting nasality information; and (3) the effectiveness of evaluating hypernasality automatically.

5.1. Acoustic Differences

We began by computing the Euclidean distances in MFCCs obtained from oral vowels and their nasalized counterparts in three conditions: monophonic, nose, and mouth signals. The results showed that the distances were generally larger for the nose signal than for the other two signals. The only exception to this general rule was the vowel /u/. Below we propose an explanation for this result.
An explanation for the increased distance in the nose signal can be found in the LTAS shown in Figure 6. We can observe that in the nose signal, the nasal sound has notably more energy than the corresponding oral. In contrast, in the other two conditions, the spectral envelopes are notably similar for the nasal and the oral sounds. These patterns are in agreement with the known differences between oral and nasal sounds during articulation. Oral sounds are (mainly) oral, while nasal sounds are oronasal. This means that for the nose microphone, there will be a clear contrast between very weak oral sounds and strong nasal sounds. In contrast, for the mouth microphone, the differences will be relatively small (because both oral and nasal sounds will be easily audible). Finally, the fact that the distance is so small in the monophonic signal indicates that it is the mouth signal, the stronger one, which has the largest impact on the characteristics of the coupling signal. In other words, it confirms that the oral signal is masking the weak nasal signal.

5.2. Phonetic Differences

The results of experiment one show that, when considering four classes, the accuracy of the monophonic, mouth, and nose DNN models was almost identical. However, the models differed notably in the precise error types. Specifically, in the nose condition, nasality errors were clearly less frequent than vowel/consonant errors. In the other two conditions, the nasality errors were more frequent than the vowel/consonant errors. One explanation for the increased accuracy of the oral/nasal classification in the nose signal is that the classifier improves when the spectral distance between the classes is sufficiently clear. As shown above, the spectral distance for oral/nasal vowels is the highest in the nose signal, which would explain why accuracy increases in this case.
Another relevant result is that the ratio of oral vowels wrongly classified as nasal was three times larger in the monophonic and mouth conditions as compared with the nose condition. This result is relevant because, in the clinical context, hypernasal patients are typically evaluated with oral sounds; this means that false negatives may have a relatively small impact in computing hypernasality, but false negatives may result in healthy speakers (and in patients who have been treated successfully) being wrongly classified as hypernasal. Unfortunately, we do not have an explanation for the observed bias.
It is also relevant that in the three conditions, consonants were better classified than vowels. This result might be due to the criteria used to identify nasal sounds. Mathad et al. [30] classified 30% of the vowels in contact with nasal consonants as nasals. In this study, the vowel length was adapted slightly (by increasing that percentage in shorter vowels). However, phonetic research has shown that vowel nasalization varies as a function of multiple factors, including dialect, individual speaker, syllable stress, and word position [41,42]. This means that our criterion may have led to misclassifying many vowel frames, which might explain the relatively poor results as regards vowel classification.
Again, the monophonic signal seems to be more similar to the mouth signal than to the corresponding nose signal. This result is compatible with, and is explained by, the results of the MFFC distances: as the spectral distance is reduced for nasal/vowel sounds, nasality information transmission is less effectively transmitted. However, we are not saying that the information is not present in the monophonic signal, we merely consider that this information is better encoded in the nose signal, which implies that it is partially masked in the coupling process. This conclusion is compatible with evidence of the role of different acoustic cues in speech communication [10]. It is important to remember that the speech signal carries multiple types of phonetic information. These include, apart from the nasal/oral contrast, the following: consonant/vowel, manner of articulation, place of articulation, voicing, and many other intonational and rhythmic characteristics. It seems reasonable to assume that the speech signal must have evolved to give preference to developmentally or communicatively most relevant information. For instance, a fundamental step in speech development consists of differentiating vowel and consonant sounds; it helps children break the speech code by separating the signal into syllables, which are a keystone for phonological and language development [43]. Also, most of the acoustic cues are encoded in the mouth signal. To conclude, the oral/nasal contrast is relevant, but it certainly does not play such an important role as other cues, for which it seems more efficient for humans to have a dominant mouth signal. In other words, masking the nose signal might help ensure optimal speech communication.

5.3. Automatic Assessment of Hypernasality

Experiment two aimed to clarify if the nose signal is used for hypernasality assessment. The results showed that there was a high correlation between the scores obtained from DNN models trained with nose signals and the scores of human experts (r = 0.82). The scores were low and not significant in the case of the DNN models trained with mouth signals (r = 0.33). Thus, these results confirm that the nose signal is a better candidate than the mouth signal to evaluate hypernasality. Given the results presented above, the present results are not surprising: if the spectral contrast improves, the DNN may learn to more effectively classify oral and nasal sounds and, for the same reason, the accuracy in classifying healthy and hypernasal speakers increases.
Given these results, it seems relevant to compare them with those of speech processing studies using monophonic signals or with Nasometer studies. As to monophonic speech processing studies, the closest example is the proposal by Mathad et al. [30]. It is relevant that Mathad et al. and the experiment described here used the same DNN model; however, there were two differences between the studies: (1) Mathad et al. trained the DNN with 100 h of monophonic speech recordings, while we used only 3 h (i.e., our speech database was 30 times shorter); (2) Mathad et al. used monophonic signals, while we used nose (and mouth) signals. Despite using a notably smaller training database, the correlation with human experts in experiment two was slightly higher than that in the Mathad et al. [30] study (r = 0.82 versus r = 0.80). This suggests that nose signals might be a good alternative to monophonic-based speech processing algorithms.
As for nasalance data, among the studies that include Pearson correlation (i.e., they included a large number of participants), we find only one [32] with an r value higher than ours (r = 0.88 versus r = 0.82). All other studies obtained poorer results: 0.74 [33], 0.59 [34], or 0.55 [35]. Given the variability of nasalance data, it seems that more studies are needed that compare different instrumental techniques. However, our results suggest that analyzing nose signals might be an alternative to Nasometer measures.

6. Conclusions and Future Work

Altogether, the results of this study show the following: (1) the nose signals carry phonetic nasality information that is partially masked by the oral signal; (2) nose signals alone may serve to evaluate hypernasality. The results of this study also point to two aspects that should be further explored in the future, one related to the signal processing approach, and another related to the identification of subclasses during training. As to signal processing, it seems that in the case of nose signals, there is a clear contrast in intensity between nasal and oral sounds. It seems reasonable to assume that this may result in different patterns of temporal envelopes, particularly in the speech samples used in clinical contexts (i.e., with just oral sounds). In the case of moderately hypernasal speakers, the envelopes may show clear peaks and valleys (i.e., due to occasional nasalization); in the case of healthy speakers (i.e., no nasalization) and severe hypernasal speakers (i.e., systematic nasalization), variability might be minimized. Thus, a temporal analysis of these signals may help to separate healthy speakers from those moderately hypernasal, which are the ones for which the model used in this study was the least successful (see Figure 9).
As regards the annotation of training data, this study has used relatively simple criteria to identify nasal and oral vowels, and this might explain that independently of the signal type used, the accuracy for vowels was relatively poor. To improve these results, it might be worth exploring the use of more sophisticated approaches to identify vowel fragments that are nasalized (e.g., Nasometer data).
From a different perspective, given that the amount of data necessary to train a DNN with a nose signal is relatively small, it seems that the same approach might be replicated for multiple languages (or dialects); this would serve to clarify whether nasality can be tested with a language-universal device or, alternatively, language- (or dialect-) specific DNNs are needed.
Finally, this study may be relevant from a social perspective. To date, the instrumental method that is considered most reliable to evaluate hypernasality is the Nasometer. Unfortunately, this device is notably expensive, and cannot be afforded by many small clinics and speech therapists around the world. Note that the high cost is partly due to the need to use two balanced microphones. Since it is confirmed that one microphone is sufficient to achieve optimal results in hypernasality assessment, it seems that the cost to evaluate hypernasality might be reduced significantly, making it affordable to a larger number of SLPs or clinics around the world.

Author Contributions

Conceptualization, I.M.-T. and A.L.; methodology, I.M.-T.; software, A.L.; validation, I.M.-T., A.L. and E.N.; formal analysis, I.M.-T. and E.N.; investigation, I.M.-T.; data acquisition, M.D.G.M., R.B. and J.P.; data curation, I.M.-T. and R.B.; writing—original draft preparation, I.M.-T. and A.L.; writing—review and editing, R.B.; funding acquisition, I.M.-T. and E.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Spanish Ministerio de Ciencia, Innovación y Universidades, grant number PID2021-126366OB-I00.

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethics Committee of the University of Málaga, protocol code 67-2023-H, 13 June 2023.

Informed Consent Statement

All the participants, or the parents/tutors in the case of minors, signed an informed consent form.

Data Availability Statement

The Python scripts as well as the databases used for the present study can be downloaded from https://github.com/Caliope-SpeechProcessingLab/Unmasking-Nasality.git (accessed on 20 November 2023).

Acknowledgments

The authors would like to thank all the participants and families for taking part in this study. We would like to acknowledge Wanda Meschian Coretti (Málaga, Spain) for her support in collecting the data.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Mossey, P.A.; Catilla, E.E. Global Registry and Database on Craniofacial Anomalies: Report of a WHO Registry Meeting on Craniofacial Anomalies; WHO: Geneva, Switzerland, 2003. [Google Scholar]
  2. Kummer, A.W. Disorders of Resonance and Airflow Secondary to Cleft Palate and/or Velopharyngeal Dysfunction. In Seminars in Speech and Language; Thieme Medical Publishers: New York, NY, USA, 2011. [Google Scholar]
  3. Howard, S.; Lohmander, A. Cleft Palate Speech: Assessment and Intervention; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
  4. Bettens, K.; Wuyts, F.L.; Van Lierde, K.M. Instrumental assessment of velopharyngeal function and resonance: A review. J. Commun. Disord. 2014, 52, 170–183. [Google Scholar] [CrossRef]
  5. Fletcher, S.G.; Sooudi, I.; Frost, S.D. Quantitative and graphic analysis of prosthetic treatment for “nasalance” in speech. J. Prosthet. Dent. 1974, 32, 284–291. [Google Scholar] [CrossRef]
  6. Gildersleeve-Neumann, C.E.; Dalston, R.M. Nasalance scores in noncleft individuals: Why not zero? Cleft Palate-Craniofacial J. 2001, 38, 106–111. [Google Scholar] [CrossRef]
  7. Liu, Y.; Lee, S.A.S.; Chen, W. The correlation between perceptual ratings and nasalance scores in resonance disorders: A systematic review. J. Speech Lang. Hear. Res. 2022, 65, 2215–2234. [Google Scholar] [CrossRef] [PubMed]
  8. Dhillon, H.; Chaudhari, P.K.; Dhingra, K.; Kuo, R.-F.; Sokhi, R.K.; Alam, M.K.; Ahmad, S. Current applications of artificial intelligence in cleft care: A scoping review. Front. Med. 2021, 8, 676490. [Google Scholar] [CrossRef] [PubMed]
  9. Oren, L.; Rollins, M.; Padakanti, S.; Kummer, A.; Gutmark, E.; Boyce, S. Using high-speed nasopharyngoscopy to quantify the bubbling above the velopharyngeal valve in cases of nasal rustle. Cleft Palate-Craniofacial J. 2020, 57, 637–645. [Google Scholar] [CrossRef]
  10. Rosen, S. Temporal information in speech: Acoustic, auditory and linguistic aspects. Philos. Trans. R. Soc. London Ser. B Biol. Sci. 1992, 336, 367–373. [Google Scholar]
  11. Rabiner, L.; Schafer, R. Theory and Applications of Digital Speech Processing; Prentice Hall Press: Hoboken, NJ, USA, 2010. [Google Scholar]
  12. Stevens, K.N. Acoustic Phonetics; MIT Press: Cambridge, MA, USA, 2000; Volume 30. [Google Scholar]
  13. Miller, G.A.; Nicely, P.E. An analysis of perceptual confusions among some English consonants. J. Acoust. Soc. Am. 1955, 27, 338–352. [Google Scholar] [CrossRef]
  14. Tejedor-García, C.; Cardeñoso-Payo, V.; Escudero-Mancebo, D. Automatic speech recognition (ASR) systems applied to pronunciation assessment of L2 Spanish for Japanese speakers. Appl. Sci. 2021, 11, 6695. [Google Scholar] [CrossRef]
  15. Kummer, A.W. Evaluation of Speech and Resonance for Children with Craniofacial Anomalies. Facial Plast. Surg. Clin. N. Am. 2016, 24, 445–451. [Google Scholar] [CrossRef] [PubMed]
  16. Grunwell, P.; Brondsted, K.; Henningsson, G.; Jansonius, K.; Karling, J.; Meijer, M.; Ording, U.; Wyatt, R.; Vermeij-Zieverink, E.; Sell, D. A six-centre international study of the outcome of treatment in patients with clefts of the lip and palate: The results of a cross-linguistic investigation of cleft palate speech. Scand. J. Plast. Reconstr. Surg. Hand Surg. 2000, 34, 219–229. [Google Scholar] [PubMed]
  17. Henningsson, G.; Kuehn, D.P.; Sell, D.; Sweeney, T.; Trost-Cardamone, J.E.; Whitehill, T.L. Universal parameters for reporting speech outcomes in individuals with cleft palate. Cleft Palate-Craniofacial J. 2008, 45, 1–17. [Google Scholar] [CrossRef] [PubMed]
  18. Sell, D.; John, A.; Harding-Bell, A.; Sweeney, T.; Hegarty, F.; Freeman, J. Cleft Audit Protocol for Speech (CAPS-A): A comprehensive training package for speech analysis. Int. J. Lang. Commun. Disord. 2009, 44, 529–548. [Google Scholar] [CrossRef] [PubMed]
  19. Spruijt, N.E.; Beenakker, M.B.; Verbeek, M.B.; Heinze, Z.C.B.; Breugem, C.C.; van der Molen, A.B.M. Reliability of the Dutch cleft speech evaluation test and conversion to the proposed universal scale. J. Craniofacial Surg. 2018, 29, 390–395. [Google Scholar] [CrossRef] [PubMed]
  20. Orozco-Arroyave, J.R.; Arias-Londoño, J.D.; Vargas-Bonilla, J.F.; Nöth, E. Automatic detection of hypernasal speech signals using nonlinear and entropy measurements. In Proceedings of the Thirteenth Annual Conference of the International Speech Communication Association, Portland, OR, USA, 9–13 September 2012. [Google Scholar]
  21. Golabbakhsh, M.; Abnavi, F.; Elyaderani, M.K.; Derakhshandeh, F.; Khanlar, F.; Rong, P.; Kuehn, D.P. Automatic identification of hypernasality in normal and cleft lip and palate patients with acoustic analysis of speech. J. Acoust. Soc. Am. 2017, 141, 929–935. [Google Scholar] [CrossRef]
  22. Vikram, C.M.; Tripathi, A.; Kalita, S.; Prasanna, S.R.M. Estimation of Hypernasality Scores from Cleft Lip and Palate Speech. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018. [Google Scholar]
  23. Lee, G.-S.; Wang, C.-P.; Yang, C.; Kuo, T. Voice low tone to high tone ratio: A potential quantitative index for vowel [a:] and its nasalization. IEEE Trans. Biomed. Eng. 2006, 53, 1437–1439. [Google Scholar] [CrossRef]
  24. He, L.; Zhang, J.; Liu, Q.; Yin, H.; Lech, M.; Huang, Y. Automatic evaluation of hypernasality based on a cleft palate speech database. J. Med. Syst. 2015, 39, 61. [Google Scholar] [CrossRef]
  25. Vali, M.; Akafi, E.; Moradi, N.; Baghban, K. Assessment of hypernasality for children with cleft palate based on cepstrum analysis. J. Med. Signals Sens. 2013, 3, 209. [Google Scholar] [CrossRef]
  26. Mirzaei, A.; Vali, M. Detection of hypernasality from speech signal using group delay and wavelet transform. In Proceedings of the 2016 6th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 13–14 October 2016; IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
  27. Dubey, A.K.; Tripathi, A.; Prasanna, S.R.M.; Dandapat, S. Detection of hypernasality based on vowel space area. J. Acoust. Soc. Am. 2018, 143, EL412–EL417. [Google Scholar] [CrossRef]
  28. Wang, X.; Yang, S.; Tang, M.; Yin, H.; Huang, H.; He, L. HypernasalityNet: Deep recurrent neural network for automatic hypernasality detection. Int. J. Med. Inform. 2019, 129, 1–12. [Google Scholar] [CrossRef]
  29. John, A.; Sell, D.; Sweeney, T.; Harding-Bell, A.; Williams, A. The cleft audit protocol for speech—Augmented: A validated and reliable measure for auditing cleft speech. Cleft Palate-Craniofacial J. 2006, 43, 272–288. [Google Scholar] [CrossRef] [PubMed]
  30. Mathad, V.C.; Scherer, N.; Chapman, K.; Liss, J.M.; Berisha, V. A deep learning algorithm for objective assessment of hypernasality in children with cleft palate. IEEE Trans. Biomed. Eng. 2021, 68, 2986–2996. [Google Scholar] [CrossRef] [PubMed]
  31. Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; IEEE: Piscataway, NJ, USA, 2015. [Google Scholar]
  32. Khwaileh, F.A.; Alfwaress, F.S.D.; Kummer, A.W.; Alrawashdeh, M. Validity of test stimuli for nasalance measurement in speakers of Jordanian Arabic. Logop. Phoniatr. Vocology 2018, 43, 93–100. [Google Scholar] [CrossRef] [PubMed]
  33. Sweeney, T.; Sell, D. Relationship between perceptual ratings of nasality and nasometry in children/adolescents with cleft palate and/or velopharyngeal dysfunction. Int. J. Lang. Commun. Disord. 2008, 43, 265–282. [Google Scholar] [CrossRef]
  34. Brancamp, T.U.; Lewis, K.E.; Watterson, T. The relationship between nasalance scores and nasality ratings obtained with equal appearing interval and direct magnitude estimation scaling methods. Cleft Palate-Craniofacial J. 2010, 47, 631–637. [Google Scholar] [CrossRef]
  35. Keuning, K.H.; Wieneke, G.H.; Van Wijngaarden, H.A.; Dejonckere, P.H. The correlation between nasalance and a differentiated perceptual rating of speech in Dutch patients with velopharyngeal insufficiency. Cleft Palate-Craniofacial J. 2002, 39, 277–284. [Google Scholar] [CrossRef]
  36. Bruyninckx, M.; Harmegnies, B.; Llisterri, J.; Poch-Oiivé, D. Language-induced voice quality variability in bilinguals. J. Phon. 1994, 22, 19–31. [Google Scholar] [CrossRef]
  37. Ortega-Garcia, J.; Gonzalez-Rodriguez, J.; Marrero-Aguiar, V. AHUMADA: A large speech corpus in Spanish for speaker characterization and identification. Speech Commun. 2000, 31, 255–264. [Google Scholar] [CrossRef]
  38. Martínez-Celdrán, E.; Fernández-Planas, A.M.; Carrera-Sabaté, J. Castilian spanish. J. Int. Phon. Assoc. 2003, 33, 255–259. [Google Scholar] [CrossRef]
  39. Boersma, P. Praat, a system for doing phonetics by computer. Glot Int. 2001, 5, 341–345. [Google Scholar]
  40. McAuliffe, M.; Socolof, M.; Mihuc, S.; Wagner, M.; Sonderegger, M. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017. [Google Scholar]
  41. Bongiovanni, S. Acoustic investigation of anticipatory vowel nasalization in a Caribbean and a non-Caribbean dialect of Spanish. Linguist. Vanguard 2021, 7, 20200008. [Google Scholar] [CrossRef]
  42. Planas, A.M.F. A study of contextual vowel nasalization in standard peninsular Spanish. Onomázein 2020, 49, 225–256. [Google Scholar] [CrossRef]
  43. Vihman, M.M. Phonological Development: The Origins of Language in the Child; Blackwell Publishing: Hoboken, NJ, USA, 1996. [Google Scholar]
Figure 1. Long-term Average Spectrum of nasalized vowel /ẽ/ (red lines) and oral vowel /e/ (black lines), recorded simultaneously with three microphones.
Figure 1. Long-term Average Spectrum of nasalized vowel /ẽ/ (red lines) and oral vowel /e/ (black lines), recorded simultaneously with three microphones.
Applsci 13 12606 g001
Figure 2. Overview of DNN model used to assess hypernasal speech (see [30]).
Figure 2. Overview of DNN model used to assess hypernasal speech (see [30]).
Applsci 13 12606 g002
Figure 3. Training databases used in experiment one. The Nasometer (top left) records two-channel signals that were used to create three train databases. A conventional monophonic microphone was used to create the fourth database.
Figure 3. Training databases used in experiment one. The Nasometer (top left) records two-channel signals that were used to create three train databases. A conventional monophonic microphone was used to create the fourth database.
Applsci 13 12606 g003
Figure 4. Test databases for experiment two.
Figure 4. Test databases for experiment two.
Applsci 13 12606 g004
Figure 5. Euclidean distances between MFFCs of oral and nasal vowels recorded with nose, mouth, and monophonic microphones.
Figure 5. Euclidean distances between MFFCs of oral and nasal vowels recorded with nose, mouth, and monophonic microphones.
Applsci 13 12606 g005
Figure 6. Long-Term Average Spectrum (LTAS) of nasal (red) and oral (black) vowels /a/ produced by a young female adult. Each line (red and black) corresponds to the average of six vowel tokens.
Figure 6. Long-Term Average Spectrum (LTAS) of nasal (red) and oral (black) vowels /a/ produced by a young female adult. Each line (red and black) corresponds to the average of six vowel tokens.
Applsci 13 12606 g006
Figure 7. Confusion matrices for the three recording conditions and global accuracy (darker colour means higher value). Each confusion matrix is obtained from ten train/test simulations with the same speech data but a different recording device.
Figure 7. Confusion matrices for the three recording conditions and global accuracy (darker colour means higher value). Each confusion matrix is obtained from ten train/test simulations with the same speech data but a different recording device.
Applsci 13 12606 g007
Figure 8. Global accuracy by considering four classes (oral consonant, oral vowel, nasal consonant, and nasal vowel), and accuracy for only two classes (nasal versus oral and consonant versus vowel). Each simulation was run 10 times.
Figure 8. Global accuracy by considering four classes (oral consonant, oral vowel, nasal consonant, and nasal vowel), and accuracy for only two classes (nasal versus oral and consonant versus vowel). Each simulation was run 10 times.
Applsci 13 12606 g008
Figure 9. Pearson correlation between perceptual ratings and DNN scores computed with nose (left) and mouth (right) signals (***: Nose signal has the correlation r = 0.83 with p < 0.00001). Blue crosses: healthy children. Red dots: hypernasal patients.
Figure 9. Pearson correlation between perceptual ratings and DNN scores computed with nose (left) and mouth (right) signals (***: Nose signal has the correlation r = 0.83 with p < 0.00001). Blue crosses: healthy children. Red dots: hypernasal patients.
Applsci 13 12606 g009
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Moreno-Torres, I.; Lozano, A.; Bermúdez, R.; Pino, J.; Méndez, M.D.G.; Nava, E. Unmasking Nasality to Assess Hypernasality. Appl. Sci. 2023, 13, 12606. https://doi.org/10.3390/app132312606

AMA Style

Moreno-Torres I, Lozano A, Bermúdez R, Pino J, Méndez MDG, Nava E. Unmasking Nasality to Assess Hypernasality. Applied Sciences. 2023; 13(23):12606. https://doi.org/10.3390/app132312606

Chicago/Turabian Style

Moreno-Torres, Ignacio, Andrés Lozano, Rosa Bermúdez, Josué Pino, María Dolores García Méndez, and Enrique Nava. 2023. "Unmasking Nasality to Assess Hypernasality" Applied Sciences 13, no. 23: 12606. https://doi.org/10.3390/app132312606

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop