Estimation of the Underlying F0 Range of a Speaker from the Spectral Features of a Brief Speech Input

Zhang, Wei; Xie, Yanlu; Lin, Binghuai; Wang, Liyuan; Zhang, Jinsong

doi:10.3390/app12136494

Open AccessArticle

Estimation of the Underlying F0 Range of a Speaker from the Spectral Features of a Brief Speech Input

by

Wei Zhang

¹,

Yanlu Xie

²,

Binghuai Lin

³,

Liyuan Wang

³ and

Jinsong Zhang

^2,*

¹

Department of Linguistics, McGill University, Montreal, QC H3A 1A7, Canada

²

Department of Information Science, Beijing Language and Culture University, 15 Xueyuan Road, Haidian District, Beijing 100083, China

³

Smart Platform Product Department, Tencent Technology Co., Ltd., Beijing 100193, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(13), 6494; https://doi.org/10.3390/app12136494

Submission received: 20 May 2022 / Revised: 20 June 2022 / Accepted: 24 June 2022 / Published: 27 June 2022

(This article belongs to the Special Issue Machine Learning for Language and Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

This work could be used in phonetic analysis from a speaker’s limited speech. For example, in helping with the speech analysis of new users for a language learning application.

Abstract

From a very brief speech, human listeners can estimate the pitch range of the speaker and normalize pitch perception. Spectral features which inherently involve both articulatory and phonatory characteristics were speculated to play roles in this process, but few were reported to directly correlate with speaker’s F0 range. To mimic this human auditory capability and validate the speculation, in a preliminary study we proposed an LSTM-based method to estimate speaker’s F0 range from a 300 ms-long speech input, which turned out to outperform the conventional method. By two more experiments, this study further improved the method and verified its validity in estimating the speaker-specific underlying F0 range. After incorporating a novel measurement of F0 range and a multi-task training approach, Experiment 1 showed that the refined model gave more accurate estimates than the initial model. Based on a Japanese-Chinese bilingual parallel speech corpus, Experiment 2 found that the F0 ranges estimated with the model from the Chinese speech and the model from the Japanese speech produced by the same set of speakers had no significant difference, whereas the conventional method showed significant difference. The results indicate that the proposed spectrum-based method captures the speaker-specific underlying F0 range which is independent of the linguistic content.

Keywords:

underlying F0 range; F0 range estimation; spectral features; pitch perception

1. Introduction

In speech communication, pitch, defined as the auditory sensation of a sound on a scale from low to high [1], plays key roles in conveying various linguistic, paralinguistic and nonlinguistic information. Although fundamental frequency (F0) is the primary acoustic correlate of pitch, pitch perception is determined not simply by the absolute F0 value, but by the relative height of the F0 value in the entire F0 range of the particular speaker. For example, a given F0 value may correspond to a lower pitch target in the speech of a female speaker, or a higher pitch target in the speech of a male speaker. In other words, pitch perception involves a normalization within the underlying F0 range of the particular speaker [2].

The normalization process involving F0 range estimation is automatic in the human auditory system. Many recent studies have shown that human listeners have a high capability for pitch normalization by estimating the underlying F0 range of the speaker from a very brief speech [3,4,5,6,7,8]. A detailed discussion of these studies is included in Section 1.3.

Likewise, in speech technologies such as automatic speech recognition/understanding and automatic assessment of pronunciation, F0 normalization is very important and hence an automatic estimation of the speaker-specific F0 range is necessary, especially for tone languages where F0 plays more important roles than in non-tone languages [9,10]. However, in most previous studies a calculation of F0 range is accomplished by a direct statistical analysis of F0 variation from a lengthy speech input [10,11,12,13], which apparently does not coincide with the way human listeners estimate F0 range in the condition of a very brief speech input. To mimic the auditory mechanism of human listeners, a recent study of ours [14] proposed a model of estimating F0 range from spectral features. As a follow-up study, this paper proposes a refined model, and further investigation of the validity of the method in estimating the speaker-specific underlying F0 range.

1.1. F0 Range and F0 Normalization

When we refer to the ‘F0 range’ of a particular speaker, there can be diverse meanings. In clinical studies, there is a need to detect the F0 range of a given speaker to diagnose whether the patient has a voice disorder. To this end, [15] defined two types of F0 range: (a) The maximum phonational F0 range (MPFR), indicating the maximum F0 range that a speaker is capable of producing and sustaining with normal phonation (including modal and falsetto registers); (b) The range of speaking F0 (SF0), reflecting a speaker’s F0 range adopted in his/her running speech. The SF0 is expected to be at a roughly predictable location in the MPFR of the same speaker. Using whichever type of F0 range, pathologists are able to diagnose vocal fold lesions if the F0 range deviates from its typical setting.

Among these two types of F0 range, the range of speaking F0 (SF0) is more concerned in linguistic and paralinguistic studies of running speech. [15] did not further clarify the definitions of speaking F0 range, but as a matter of fact there are two ways to define the term. In many cases, the term represents the surface F0 range in the present sample of the speaker’s running speech, and thus it can simply be measured by the difference between the minimum and maximum F0 values observed in the speech sample. In other cases, the term refers to the underlying speaking F0 range of a speaker, which is usually not fully displayed in the present speech sample, but instead it is a characteristic of the speaker-specific use of vocal vibration [16,17].

In comparison, the surface F0 range is determined simultaneously by the linguistic factors associated with the content of the speech (e.g., in Mandarin, a syllable of the high-falling tone T4 has a wider F0 range than that of the high-flatting tone T1), the paralinguistic factors associated with the emotional/attitudinal state or speaking style of the speaker, and the nonlinguistic factors associated with the speaker-specific physiological traits (e.g., gender, age, and health condition) of the speaker [9,18,19,20,21,22,23,24,25] whereas the underlying speaking F0 range is only attributed to the speaker-specific nonlinguistic (mostly physiological) factors, regardless of what content he/she says and in what states/styles he/she says. Apparently, the larger and the more variable the speech sample is (i.e., when there are more linguistic and paralinguistic variations), the better the measured surface F0 range approximate the underlying speaking F0 range.

In linguistic/paralinguistic studies and speech technologies, there is often a need to estimate the underlying speaking F0 range of a given speaker for the purpose of pitch normalization. Because F0 contours of speech convey not only linguistic but also para/non-linguistic information, a decoding of the phonological pitch targets as well as a decoding of the emotional/attitudinal states needs to rule out the F0 variation attributed to individual difference. This process of normalization is based on an effective estimation of the speaker-specific underlying speaking F0 range. Although human listeners are effortless in pitch normalization, the internal perceptual mechanism is yet to be disclosed, and an efficient algorithm of F0 range estimation is yet to be developed for speech technologies.

1.2. Traditional Methods of F0 Range Estimation

There are different methods of estimating a speaker’s F0 range, depending on the type of definition of F0 range, and on the phonetic or phonological point of view.

The maximum phonational F0 range (MPFR), which is often used in clinical studies, is usually obtained by a ‘tone sweep’ method that starts from the speaker’s comfortable F0 level and then sweeps up or down to reach his/her highest or lowest sustainable F0s [3,15,26].

In linguistic/paralinguistic studies and speech technologies, estimation of the underlying speaking F0 range (for short, henceforth the underlying F0 range) is more meaningful, and thus is the aim of the present study. The simplest method to estimate the underlying F0 range, however, is a direct analysis of the surface F0 range. There are both acoustic-phonetic and phonological approaches to do this.

The most common acoustic-phonetic approach is the Long-Term Distributional (LTD) method, which calculates the robust maximum and minimum F0 values from a lengthy speech sample—the surface F0 range derived this way was then used to estimate the underlying F0 range. As for the length of speech to validate the LTD method, a review in [15] suggested that an English speech of 14 s or two to four sentences long resulted in an error of estimate below 3Hz, while a longer speech sample did not give a significantly higher accuracy.

The phonological approach associates the maximum and minimum F0s with the phonological pitch targets, including sentence-initial peaks (H), accent peaks (M), post-accent valleys (L) and sentence-final lows (Fmin) [27,28,29]. The F0 range is then derived from the F0 values of these phonological targets: F0 level is represented by the F0 value of Fmin, and F0 span is denoted by the F0 difference between H and L.

Although the phonological approach was reported to correlate better with human perceptual characteristics than the LTD method [28], it requires annotations based on intonational phonology, and hence is more difficult to implement than the LTD method. Therefore, LTD is more widely used in linguistic/paralinguistic and clinical studies, and various algorithms have been proposed to improve the robustness of LTD [10,11,12].

1.3. Pitch Range Perception from a Brief Speech Input

Over the years, there have been ample studies to explore how human listeners identify the pitch range of a given speaker. Similarly to the speaker-normalization mechanism in vowel perception [30,31], pitch perception also involves speaker normalization [32,33,34]. That is, a sufficient exposure to the speech samples of a given speaker (i.e., with a precursor phase) leads listeners to estimate the F0 range of the speaker so as to facilitate pitch perception in subsequent speech input. All the above-reviewed methods of F0 range estimation coincide with the general assumption that a precursor phase is necessary for estimating the underlying F0 range of a particular speaker.

Nevertheless, studies on tone recognition found that lexical tones in tone languages could be well recognized without any context information or familiarity with the speaker. This tendency has been reported both in Mandarin [6,35,36] and in Cantonese studies [8,34]. Specifically, tone recognition can even be accomplished from a part of the syllable without the presence of a complete F0 profile. For example, [37] showed that from a “silent-center” Mandarin syllable in which the majority of F0 contour was missing, tone could be recognized as accurately as from a complete syllable. [6] found that from a fricative and the initial six glottal voicing periods of a Mandarin syllable, native listeners could still distinguish high-onset tones from low-onset tones accurately.

Similar results were also observed in non-tone languages. For example, [3] reported that given an English speech sample as short as 500 ms, listeners were able to identify the pitch level within the speaker’s F0 range with some accuracy.

These studies indicate that the pitch range of an unfamiliar speaker can be roughly inferred by human listeners from a brief speech input. Because a brief speech cannot cover the speaker’s full range of F0 variation, it is speculated that there are cues of pitch range other than F0 in speech signals. Speech signals encode the characteristics of vocal folds vibration (source) and vocal tract articulation (filter) [38], both of which reflect speaker traits such as gender, health condition, height and weight [9,18,19,20,21,22,23,24,25]. The source and filter characteristics are not fully independent from each other but are correlated to some extent on the physiological basis. For example, due to the general gender difference in body size, male speakers tend to have both longer, thicker vocal folds and longer vocal tracts than female speakers; the former will result in a generally lower F0 in males and the latter will lead to generally lower formant frequencies in males [9,38]. Thus, when F0 is not adequately accessible, listeners may estimate or infer pitch range partly from the filter information. When exposed to a brief speech perceived as from a female speaker, listeners would expect a higher F0 range.

In the literature on speech and singing studies, modulation of spectral features has shown effects on pitch perception [39,40,41,42]. Among others, spectral locus [40], spectral centroid [42], and voice quality defined as the variability in the spectrum [3,7,43] have proven effects. Especially, the covariation between F0 and voice quality has been reported, e.g., a creaky voice or vocal fry is usually associated with the lower F0 range of a speaker, whereas a tense voice or falsetto is usually associated with the higher F0 range [7] (p. 201). Higher levels of jitter and shimmer are sometimes associated with the lower F0 [44,45].

Although phonation type or voice quality facilitates pitch perception, only a few specific parameters have been found to correlate directly with the perceptual judgment of pitch range. [7] reported that spectral balance tilt was an effective cue for pitch range, whereas other studies [3,4,6] showed that H1-H2, H1-A3, H1-H3 and F1 bandwidth had at most modest effects on the perceptual judgment of pitch range. This makes it hardly possible to estimate F0 range directly from voice quality.

Apparently, both F0 profiles and spectral features are signal-intrinsic cues of pitch range. Apart from these, the signal-extrinsic factors that contribute to pitch range perception have also been discussed [3,4,6,8]. Listeners may apply their prior knowledge about human voice to facilitate speech perception. It was speculated that listeners calibrate the perceived pitch according to the prototypes of F0 ranges that they have heard throughout their lives. It could also be possible that a set of ‘group ranges’, each corresponding to a group of speakers with a particular demographic nature (including gender, age, and so on), has been established in the listeners’ memory, and pitch normalization is conducted within the group that the unfamiliar speaker belongs to [3].

1.4. A Method for F0 Range Estimation from a Brief Speech Input

Unlike human listeners who are able to estimate the underlying pitch range of a speaker from a very brief speech, the traditional method of automatic estimation of the underlying F0 range in terms of a direct analysis of F0 profiles works only on a lengthy speech input from which the underlying F0 range can be well approximated by the surface F0 range (e.g., at least 14 s in English, according to [15]).

Based on the previous finding that spectral features contributed to human listeners’ pitch range perception, [14] proposed a method for automatic estimation of a speaker’s F0 range from the spectral features of a very brief speech. The method used the long short-term memory (LSTM) technique to map a set of FBANKs (logarithmic mel-filter bank coefficients) from a brief speech to two parameters, i.e., mean and standard deviation of F0, corresponding respectively to the average height and the span of F0 variation as defined in previous studies [27]. Due to its high ability of online adaptive estimation, LSTM was chosen to mimic the adaptive process of human auditory perception—pitch range estimation gets more accurate when more speech is fed in. The duration of the speech input was encoded in the net depth of the LSTM model. Experiments showed that an accurate estimate could be obtained when the speech input was 300 ms long, and the accuracy did not increase significantly with a longer speech input [14]. Most importantly, the proposed model gave a more accurate estimate than the traditional LTD method in the condition of a brief speech input.

1.5. The Motivation of the Current Study

Ref. [14] proposed for the first time that the F0 range of a speaker could be estimated from a set of spectral parameters of a very brief speech signal using the LSTM-based model. The model can in a sense mimic the human auditory mechanism of pitch perception. Nevertheless, there are still questions or flaws about the model, which motivate the current study.

For one thing, the measurement of F0 range with two parameters (the mean and standard deviation of F0) adopted in [14] is still debatable. This measurement is inherently associated with the assumption that the distribution of F0 is symmetric about the mean F0, as illustrated in Figure 1a, which however does not always coincide with the actual situation.

Previous studies found that the distribution of F0 above and below the mean was not really symmetric, with the upper interval wider than the lower interval [44]. In [26], results from a passage reading showed that the distance between mean F0 and floor F0 was smaller than that between ceiling F0 and mean F0. Moreover, the spectral characteristics for the upper and lower limits of F0 range were different. As reviewed in [7], falsetto and tense voice are associated with the perceptual upper limit of pitch, whereas creaky voice is associated with the perceptual lower limit of pitch. These suggest that quantifying the F0 span simply with a standard deviation is not adequate. Therefore, we further modified the model by substituting the estimation of the ceiling and floor F0s as shown in Figure 1b for the estimation of the standard deviation as shown in Figure 1a. In this way, there are three parameters in total which need to be estimated.

For another, it is unclear whether the model inherently estimated the surface F0 range of the present sample of running speech, or the speaker-specific underlying F0 range regardless of the linguistic and paralinguistic information conveyed in the present speech sample. In other words, it needs to be testified whether the estimated F0 range is closer to the surface F0 range or the underlying F0 range.

Since the underlying F0 range reflects the physiological traits of the speaker, it is deemed to be independent of the language being spoken. In contrast, the literature has shown that the surface F0 range of speech may vary systematically with language, depending on the phonological nature of the spoken language. For example, [26] found that the native speakers of Mandarin and English differed significantly in the speaking F0 range. [17] reported that English speech had a wider speaking F0 range than German speech. Moreover, [46,47] revealed that a speaker tended to have differing F0 manifestations when speaking different languages.

Therefore, a way to answer the question is to test the model on bilingual speech data from the same speaker to find out whether the estimated F0 ranges vary with the language being spoken.

With the above two motivations, we designed two experiments, with the first to refine the model setting, and the second to testify the estimation of underlying F0 range.

2. Experiment 1: The Refined Model

2.1. Corpora

As reviewed in the introduction, signal-extrinsic factors can also affect pitch range perception. Specifically, listeners’ hearing experience or templates in memory are supposed to facilitate pitch perception. Taking the model as a ‘listener’, one way to enhance the ability of the model is to involve more ‘experience’ or ‘templates’. Thus, the data of more speakers should be included in the training stage. To increase the diversity of training speech data, we used the following three corpora collected at different platforms:

Chinese National Hi-Tech Project 863 Corpus, with 110 h speech data from 166 speakers [48].
Open-source Mandarin Speech AISHELL Corpus, with 178 h speech data from 400 speakers [49].
Open-source Mandarin Speech THCHS-30 Corpus, with 30 h speech data from 60 speakers [50].

To reduce computational complexity, we randomly selected a quarter of data from each speaker in the three corpora to train the model. We took 90% of the data as the training set, and the remaining 10% as the test set.

2.2. Refined Model Setup

We used three parameters, i.e., mean, ceiling and floor, to describe a speaker’s F0 range. The ceiling and floor were defined, respectively, as the 95th and 5th percentiles of the logF0 values over all speech data of a given speaker. The raw F0s were extracted using Praat toolkits [51].

Previous studies have shown that for multiple related tasks the Multi-Task Learning (MTL) algorithm works better than separate single tasks. The multi-task learning structure contains one or more shared hidden layers that were connected to all tasks [52]. Thus, the commonalities across these tasks can be captured by the shared hidden layers, while the differences among the tasks can be captured by other task-specific layers. Moreover, training multiple related tasks jointly in one model is more efficient than training them in isolation. The MTL method has been commonly used in speech technologies [53,54]. In this refined model, we adopted the MTL-LSTM, of which the structure is shown in Figure 2, to estimate the three parameters (i.e., mean, ceiling, and floor) for F0 range jointly. There were one shared LSTM and one separate LSTM layers (each layer consisting of 64 units), after which one Dense layer (using linear activation) was added.

The model was implemented with the KERAS platform [55]. Because [14] showed that the use of the FBANKs + F0 features attained the best estimate of F0 range, in this experiment we again used 40-dimensional FBANKs and 3-dimensional F0 parameters as the input features, which were all extracted using the KALDI toolkits [56]. Following [14], the net depth of MLT-LSTM was set at 30 time steps, corresponding to the critical length of speech 300 ms. The model was fit using the batch size of 32, epoch of 50. Thirty percent of the data were split for validation. The optimizer was rmsprop. For the error criterion, we used the mean absolute percentage error (MAPE) defined as

M A P E = \frac{1}{N} \sum_{i = 1}^{N} | \frac{y - ŷ}{y} | \times 100

, where

N

is the total number of speech samples,

y

is the actual value, and

ŷ

is the estimated value.

2.3. Results

Table 1 shows the results of F0 range estimation with the refined model. The reported best MAPE of the mean F0 in [14] was 2.3%. A comparison with that of the refined model, i.e., 2.15%, indicates that the refined model works better than the initial model.

3. Experiment 2: F0 Range Estimation from a Bilingual Parallel Corpus

To verify whether the F0 range estimated with the proposed model is closer to the surface F0 range or the underlying F0 range, we employed a Japanese-Chinese bilingual parallel speech corpus to compare the parameters estimated from the Japanese (native language, henceforth L1) and Chinese (second language, henceforth L2) utterances produced by the same set of speakers. If the results of estimate from the L1 and L2 speech data differ significantly, the surface F0 range is in favor; otherwise, the underlying F0 range is supported.

3.1. Speech Data

The Conversational Chinese 301 Corpus contained 301 Chinese sentences designed for the purpose of learning Chinese as a second language (Kang and Lai, 2007). After the Chinese texts were translated (with slight adaptation) into Japanese, a parallel Japanese-Chinese speech corpus was collected from 18 native speakers of Japanese (nine female, nine male) who learned Chinese as a second language (Cao et al., 2010). Each speaker was asked to produce the 301 Japanese sentences and the 301 Chinese sentences.

3.2. Estimation from the L1 and L2 Speech by the Spectral-Baed Model and the Direct F0 Analysis

For each speaker, we randomly chose 20 pairs of Japanese and Chinese parallel utterances longer than 10 s. Each utterance was divided into 300 ms-long speech chunks and then was tested with the refined model. Thus, for each speaker, the refined model gave two F0 range estimations (ceiling, mean, and floor, averaged from all 300 ms results). One was from the Chinese input, and the other from the Japanese input.

To find out whether the surface F0 ranges were different between the Japanese and Chinese speech in these speakers, we also calculated two F0 range estimations (ceiling, mean, and floor) for each speaker by direct F0 analysis, one from the 20 L1 Japanese utterances and the other from the 20 L2 Chinese utterances, respectively. The direct F0 analysis was conducted with Praat toolkits [51]. If the speakers used different surface F0 ranges in their Japanese and Chinese utterances, the F0 ranges derived by direct F0 analysis in the two languages should differ.

Meanwhile, for each speaker we calculated (or approximated) the ‘actual values’ of the three parameters by a direct F0 analysis from all the 40 utterances (including 20 Japanese and 20 Chinese).

3.3. Results and Discussion

The results of estimates from the bilingual parallel speech corpus using the refined spectrum-based model and direct F0 analysis are compared in Table 2. Both the means and the standard deviations of the estimates over all speakers are shown.

A comparison between DFA-J and DFA-C indicates that these speakers used different surface F0 ranges in their Japanese and Chinese speech. Linear mixed effect regression (LMER) models were used to compare the estimations from DFA-J and DFA-C (LMER syntax (taking ceiling as an example): predicted_ceiling ~ Language + (1|speaker). Significant level: 95% (for all LMER models in this study)). Results showed that both ceiling and mean F0 were significantly higher in the L1 Japanese utterances than in the L2 Mandarin utterances (ceiling: estimate = 0.028, SE = 0.012, p = 0.036; mean: estimate = 0.027, SE = 0.008, p = 0.003), while the difference in the floor F0 was marginally significant (estimate = 0.012, SE = 0.006, p = 0.06). This finding coincides with the reports in previous studies on that bilinguals often show narrower SF0 ranges in their L2 [46,47].

In contrast, LMER models on SBM-J and SBM-C showed that there was no significant difference in any of the three F0 parameters: mean (estimate = 0.007, SE = 0.004, p = 0.147), ceiling (estimate = 0.006, SE = 0.004, p = 0.18), and floor (estimate = 0.008, SE = 0.005, p = 0.111). These results suggest that the estimation from the spectrum-based model is independent of the language being spoken, whether it is L1 Japanese or L2 Chinese.

We further examined whether the two methods statistically differ in terms of the independence of the speech content, using another LMER model. For both methods and for each speaker, we calculated the estimation differences from their L1 and L2 speech. The estimation difference was the response in the MLER model and Method was used as the predictor (LMER syntax (taking ceiling as an example): ceiling_difference~Method + (1|speaker)). Results revealed that the SBM showed significantly smaller estimation differences from L1 and L2 speech inputs on ceiling (estimate = −0.435, SE = 0.137, p = 0.006) and mean (estimate = −0.446, SE = 0.124, p = 0.002) parameters, than the DFA method. The estimation difference on parameter floor was non-significant (estimate = −0.049, SE = 0.123, p = 0.696). Overall these results indicated that SBM is less affected by the speech content than DFA in estimating a speaker’s F0 range.

Moreover, the results of the refined model for each of the 18 speakers are shown in Figure 3, where for most individuals the estimates from the L1 Japanese utterances (in dots) and from the L2 Chinese utterances (in triangles) are both close to the actual values (in black bars).

Figure 3 also shows that the estimates are more accurate in male than in female speakers. In particular, the estimated F0 values of the female speakers are generally lower than the actual values. Since the speech data we used to train the model were almost gender-balanced, it could be because the mapping between F0 and spectral features is gender dependent. Future studies are needed to better explore this gender difference in model-based estimation.

A comparison of the results verifies that the estimate of F0 range from a very brief speech using the spectrum-based model is less influenced by surface F0 variation (especially, by the language being spoken) than direct F0 analysis. This suggests that the proposed model is capable of capturing the underlying F0 range of a given speaker, regardless of the surface F0 manifestation and the linguistic content of the speech sample.

4. Conclusions

Conventionally, automatic estimation of the speaking F0 range of a given speaker is conducted simply by a direct F0 analysis, which is effective only when a lengthy speech input is available. In contrast, human listeners are able to estimate the underlying pitch range of a given speaker from even a very brief speech and hence to normalize their pitch perception.

Although there was an earlier speculation that spectrum plays role in pitch perception, it was not yet clear which spectral parameters can be used directly for the automatic estimation of F0 range until a recent study of ours proposed an LSTM-based model to estimate F0 range using the FBANKs from a brief speech input as short as 300 ms. This model can in a sense mimic the human auditory mechanism of pitch perception.

The present study further refined the model by using a modified measurement of F0 range (with three parameters: ceiling, mean and floor) which can better describe the distribution of F0. Experimental results showed a more accurate estimate than the initial model.

Through another experiment conducted on a Japanese-Chinese parallel speech corpus, we found that the F0 range parameters estimated from the same speaker’s L1 and L2 speech data were nearly identical, which verified that the estimated F0 range is closer to the speaker-specific underlying F0 range instead of the surface F0 range of the present speech sample.

It was also found that the accuracy of estimation in female speakers was lower than in male speakers, though the training data were almost gender-balanced. Thus, in future study we may set different model configurations to male and female speakers and train separate models for them to further improve the estimation accuracy.

In sum, the present study has demonstrated that the proposed spectrum-based model is capable of capturing the underlying F0 range of a speaker from a 300 ms-long speech input. In principle, this method can potentially be applied to speech technologies where there is a need for F0 normalization, but no lengthy speech data is available; for example, to help a Mandarin learning application with the F0 normalization for its first-time users. F0 normalization is vital for the evaluation of tone productions but for a new user there is no previous speech which can be used for F0 range estimation.

Author Contributions

Conceptualization, W.Z., Y.X. and J.Z.; methodology, W.Z., Y.X., B.L. and L.W.; formal analysis, W.Z.; writing—original draft preparation, W.Z.; writing—review and editing, W.Z., Y.X., B.L., L.W. and J.Z.; supervision, J.Z.; Resources, Y.X. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Science Foundation and Special Program for Key Basic Research fund of Beijing Language and Culture University (the Fundamental Research Funds for the Central Universities)(20YJ04002), advanced Innovation Center for Language Resource and Intelligence (KYR17005), and Wutong Innovation Platform of Beijing Language and Culture University (19PT04).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The open-accessed AISHELL Corpus can be found here: https://www.openslr.org/33/ (accessed on 24 June 2022), and the THCHS-30 corpus here: https://arxiv.org/abs/1512.01882 (accessed on 24 June 2022).

Acknowledgments

Part of this work was presented at the virtual conference of the 10th International Conference on Speech Prosody, Tokyo, 2020. We thank the discussions from the audience there.

Conflicts of Interest

The authors declare no conflict of interest.

References

Crystal, D. A Dictionary of Linguistics and Phonetics; John Wiley & Sons: Hoboken, NJ, USA, 2011; Volume 30. [Google Scholar]
Trask, R. A Dictionary of Phonetics and Phonology; Routledge: London, UK, 1996. [Google Scholar]
Honorof, D.N.; Whalen, D.H. Perception of pitch location within a speaker’s F0 range. J. Acoust. Soc. Am. 2005, 117, 2193–2200. [Google Scholar] [CrossRef]
Bishop, J.; Keating, P. Perception of pitch location within a speaker’s range: Fundamental frequency, voice quality and speaker sex. J. Acoust. Soc. Am. 2012, 132, 1100–1112. [Google Scholar] [CrossRef] [Green Version]
Mo, Y.; Cole, J.; Lee, E.-K. Naïve listeners’ prominence and boundary perception. In Proceedings of the Speech Prosody 2008, Campinas, Brazil, 6–9 May 2008; pp. 735–738. [Google Scholar]
Lee, C.-Y. Identifying isolated, multispeaker Mandarin tones from brief acoustic input: A perceptual and acoustic study. J. Acoust. Soc. Am. 2009, 125, 1125–1137. [Google Scholar] [CrossRef]
Kuang, J.; Liberman, M. Integrating Voice Quality Cues in the Pitch Perception of Speech and Non-speech Utterances. Front. Psychol. 2018, 9, 2147. [Google Scholar] [CrossRef]
Lai, W.; Kuang, J. The effect of speaker gender on Cantonese tone perception. J. Acoust. Soc. Am. 2020, 147, 4119–4132. [Google Scholar] [CrossRef]
Van Dommelen, W.A.; Moxness, B.H. Acoustic parameters in speaker height and weight identification: Sex-specific behaviour. Lang. Speech 1995, 38, 267–287. [Google Scholar] [CrossRef]
Edlund, J.; Heldner, M. Underpinning/nailon/: Automatic Estimation of Pitch Range and Speaker Relative Pitch. In Speaker Classification II; Müller, C., Ed.; Springer: Berlin/Heidelberg, Germany, 2007; Volume 4441, pp. 229–242. ISBN 978-3-540-74121-3. [Google Scholar]
Looze, C.D.; Hirst, D. Detecting changes in key and range for the automatic modelling and coding of intonation. In Proceedings of the Speech Prosody 2008, Campinas, Brazil, 6–9 May 2008; p. 4. [Google Scholar]
Ambrazaitis, G. Revisiting intonational pitch accents in Swedish: Evidence from lexical accent neutralization. In Proceedings of the TIE4, the Fourth European Conference on Tone and Intonation, Stockholm, Sweden, 9–11 September 2010; pp. 69–70. [Google Scholar]
Mahmoodzadeh, A.; Abutalebi, H.R.; Soltanian-Zadeh, H.; Sheikhzadeh, H. Determination of pitch range based on onset and offset analysis in modulation frequency domain. In Proceedings of the 2010 5th International Symposium on Telecommunications, Tehran, Iran, 4–6 December 2010; IEEE: Tehran, Iran, 2010; pp. 604–608. [Google Scholar]
Zhang, W.; Zhang, Q.; Xie, Y.; Zhang, J. LSTM-Based Pitch Range Estimation from Spectral Information of Brief Speech Input. In Proceedings of the 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei City, Taiwan, 26–29 November 2018; IEEE: Taipei City, Taiwan, 2018; pp. 349–353. [Google Scholar]
Baken, R.J.; Orlikoff, R.F. Clinical Measurement of Speech and Voice; Cengage Learning: Hong Kong, China, 2000. [Google Scholar]
Laver, J.; John, L. Principles of Phonetics; Cambridge University Press: Cambridge, UK, 1994. [Google Scholar]
Mennen, I.; Schaeffler, F.; Docherty, G. Cross-language differences in fundamental frequency range: A comparison of English and German. J. Acoust. Soc. Am. 2012, 131, 2249–2260. [Google Scholar] [CrossRef] [Green Version]
Brockmann, M.; Storck, C.; Carding, P.N.; Drinnan, M.J. Voice loudness and gender effects on jitter and shimmer in healthy adults. J. Speech Lang. Hear. Res. 2008, 51, 1152–1160. [Google Scholar] [CrossRef]
Deliyski, D. Effects of aging on selected acoustic voice parameters: Preliminary normative data and educational implications. Educ. Gerontol. 2001, 27, 159–168. [Google Scholar] [CrossRef]
Stathopoulos, E.T.; Huber, J.E.; Sussman, J.E. Changes in Acoustic Characteristics of the Voice Across the Life Span: Measures From Individuals 4–93 Years of Age. J. Speech Lang. Hear. Res. 2011, 54, 1011–1021. [Google Scholar] [CrossRef]
Awan, S.N. The aging female voice: Acoustic and respiratory data. Clin. Linguist. Phon. 2006, 20, 171–180. [Google Scholar] [CrossRef]
Shipp, T.; Huntington, D.A. Some acoustic and perceptual factors in acute-laryngitic hoarseness. J. Speech Hear. Disord. 1965, 30, 350–359. [Google Scholar] [CrossRef]
Hecker, M.H.; Kreul, E.J. Descriptions of the speech of patients with cancer of the vocal folds. Part I: Measures of fundamental frequency. J. Acoust. Soc. Am. 1971, 49, 1275–1282. [Google Scholar] [CrossRef]
Cooper, M. Spectrographic analysis of fundamental frequency and hoarseness before and after vocal rehabilitation. J. Speech Hear. Disord. 1974, 39, 286–297. [Google Scholar] [CrossRef]
Murry, T.; Doherty, E.T. Selected acoustic characteristics of pathologic and normal speakers. J. Speech Lang. Hear. Res. 1980, 23, 361–369. [Google Scholar] [CrossRef]
Keating, P.; Kuo, G. Comparison of speaking fundamental frequency in English and Mandarin. J. Acoust. Soc. Am. 2012, 132, 1050–1060. [Google Scholar] [CrossRef] [Green Version]
Patterson, D.; Ladd, D.R. Pitch range modelling: Linguistic dimensions of variation. In Proceedings of the ICPhS, San Francisco, CA, USA, 1–7 August 1999; pp. 1169–1172. [Google Scholar]
Patterson, D.J. Linguistic Approach to Pitch Range Modelling. Ph.D. Thesis, University of Edinburgh, Edinburgh, UK, 2000. [Google Scholar]
Mennen, I.; Schaeffler, F.; Docherty, G. A methodological study into the linguistic dimensions of pitch range differences between German and English. In Proceedings of the 4th Conference on Speech Prosody, Campinas, Brazil, 6–9 May 2008. [Google Scholar]
Peterson, G.E.; Barney, H.L. Control methods used in a study of the vowels. J. Acoust. Soc. Am. 1952, 24, 175–184. [Google Scholar] [CrossRef]
Ladefoged, P.; Broadbent, D.E. Information conveyed by vowels. J. Acoust. Soc. Am. 1957, 29, 98–104. [Google Scholar] [CrossRef]
Leather, J. Speaker normalization in perception of lexical tone. J. Phon. 1983, 11, 373–382. [Google Scholar] [CrossRef]
Moore, C.B.; Jongman, A. Speaker normalization in the perception of Mandarin Chinese tones. J. Acoust. Soc. Am. 1997, 102, 1864–1877. [Google Scholar] [CrossRef]
Wong, P.C.M.; Diehl, R.L. Perceptual Normalization for Inter- and Intratalker Variation in Cantonese Level Tones. J. Speech Lang. Hear. Res. 2003, 46, 413–421. [Google Scholar] [CrossRef]
Whalen, D.H.; Xu, Y. Information for Mandarin tones in the amplitude contour and in brief segments. Phonetica 1992, 49, 25–47. [Google Scholar] [CrossRef] [Green Version]
Yang, S. A preliminary study on the perceptual center of tones in Standard Chinese. Acta Psychol. Sin. 1992, 3, 247–253. [Google Scholar]
Gottfried, T.L.; Suiter, T.L. Effect of linguistic experience on the identification of Mandarin Chinese vowels and tones. J. Phon. 1997, 25, 207–231. [Google Scholar] [CrossRef]
Fant, G. Acoustic Theory of Speech Production; Walter de Gruyter: Berlin, Germany, 1970. [Google Scholar]
Warrier, C.M.; Zatorre, R.J. Influence of tonal context and timbral variation on perception of pitch. Percept. Psychophys. 2002, 64, 198–207. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Singh, P.G.; Hirsh, I.J. Influence of spectral locus and F0 changes on the pitch and timbre of complex tones. J. Acoust. Soc. Am. 1992, 92, 2650–2661. [Google Scholar] [CrossRef]
Russo, F.A.; Thompson, W.F. An interval size illusion: The influence of timbre on the perceived size of melodic intervals. Percept. Psychophys. 2005, 67, 559–568. [Google Scholar] [CrossRef] [Green Version]
Allen, E.J.; Oxenham, A.J. Symmetric interactions and interference between pitch and timbre. J. Acoust. Soc. Am. 2014, 135, 1371–1379. [Google Scholar] [CrossRef] [Green Version]
Swerts, M.; Veldhuis, R. The effect of speech melody on voice quality. Speech Commun. 2001, 33, 297–303. [Google Scholar] [CrossRef]
Carlson, R.; Elenius, K.; Swerts, M. Perceptual judgments of pitch range. In Proceedings of the Speech Prosody 2004, International Conference, Nara, Japan, 23–26 March 2004. [Google Scholar]
Verstraete, J.; Forrez, G.; Mertens, P.; Debruyne, F. The Effect of Sustained Phonation at High and Low Pitch on Vocal Jitter and Shimmer. Folia Phoniatr. Logop. 1993, 45, 223–228. [Google Scholar] [CrossRef]
Ullakonoja, R. Comparison of pitch range in Finnish (L1) and Russian (L2). In Proceedings of the ICPhS, Saarbrücken, Germany, 6–10 August 2007. [Google Scholar]
Zimmerer, F.; Jügler, J.; Andreeva, B.; Möbius, B.; Trouvain, J. Too cautious to vary more? A comparison of pitch variation in native and non-native productions of French and German speakers. In Proceedings of the 7th Speech Prosody Conference, Dublin, Ireland, 20–23 May 2014; pp. 1037–1041. [Google Scholar]
Gao, S.; Xu, B.; Zhang, H.; Zhao, B.; Li, C.; Huang, T. Update progress of Sinohear: Advanced Mandarin LVCSR system at NLPR. In Proceedings of the ICPhS, Beijing, China, 16–20 October 2000. [Google Scholar]
Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proceedings of the 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Seoul, Korea, 1–3 November 2017; pp. 1–5. [Google Scholar]
Wang, D.; Zhang, X. Thchs-30: A free chinese speech corpus. arXiv 2015, arXiv:1512.01882. [Google Scholar]
Boersma, P. Praat: Doing Phonetics by Computer. 2006. Available online: http://www.praat.org/ (accessed on 24 June 2022).
Caruana, R. Multitask learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
Zhang, Q.; Cao, C.; Li, T.; Xie, Y.; Zhang, J. Pitch range estimation with multi features and MTL-DNN model. In Proceedings of the 2018 14th IEEE International Conference on Signal Processing (ICSP), Beijing, China, 12–16 August 2018; pp. 939–943. [Google Scholar]
Lin, J.; Gao, Y.; Zhang, W.; Wei, L.; Xie, Y.; Zhang, J. Improving Pronunciation Erroneous Tendency Detection with Multi-Model Soft Targets. J. Signal Processing Syst. 2020, 92, 793–803. [Google Scholar] [CrossRef]
Charles, P.W.D. Keras. GitHub Repository. 2013. Available online: https://keras.io/ (accessed on 24 June 2022).
Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011. [Google Scholar]

Figure 1. The measurement of F0 range: (a) in the initial model; (b) in the refined model.

Figure 2. The network structure of the refined model.

Figure 3. Results of the refined model for the 18 speakers (on the abscissa, where JF and JM indicate female and male Japanese speakers, respectively), with estimates from the Japanese utterances (in triangles) and from the Chinese utterances (in dots). The actual values (in bars) are calculated by direct F0 analysis from all speech data. The three panels from top to bottom indicate the estimates of ceiling, mean and floor, respectively.

Table 1. MAPE for F0 range estimation with the refined model.

F0 Range Parameter	MAPE (%)
Ceiling	2.19
Mean	2.15
Floor	2.52

Table 2. Means and standard deviations (in parentheses) of F0 range parameters (in the logarithmic scale) estimated from the Japanese-Chinese parallel speech corpus. SBM-J and SBM-C represent the results of the refined spectrum-based model estimated from the 20 Japanese utterances and from the 20 Chinese utterances, respectively. DFA-J and DFA-C indicate the results of direct F0 analysis from the Japanese and Chinese utterances, respectively, while DFA-JC indicates the result of direct F0 analysis from all 40 utterances, which is here regarded as the actual value of the underlying F0 range.

	DFA-JC (Taken as the Actual Value)	SBM-J	SBM-C	DFA-J	DFA-C
Ceiling	2.410 (0.12)	2.387 (0.09)	2.380 (0.09)	2.402 (0.12)	2.374 (0.13)
Mean	2.243 (0.12)	2.206 (0.08)	2.199 (0.09)	2.256 (0.12)	2.229 (0.12)
Floor	2.084 (0.13)	2.056 (0.08)	2.048 (0.08)	2.100 (0.13)	2.088 (0.13)

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, W.; Xie, Y.; Lin, B.; Wang, L.; Zhang, J. Estimation of the Underlying F0 Range of a Speaker from the Spectral Features of a Brief Speech Input. Appl. Sci. 2022, 12, 6494. https://doi.org/10.3390/app12136494

AMA Style

Zhang W, Xie Y, Lin B, Wang L, Zhang J. Estimation of the Underlying F0 Range of a Speaker from the Spectral Features of a Brief Speech Input. Applied Sciences. 2022; 12(13):6494. https://doi.org/10.3390/app12136494

Chicago/Turabian Style

Zhang, Wei, Yanlu Xie, Binghuai Lin, Liyuan Wang, and Jinsong Zhang. 2022. "Estimation of the Underlying F0 Range of a Speaker from the Spectral Features of a Brief Speech Input" Applied Sciences 12, no. 13: 6494. https://doi.org/10.3390/app12136494

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Estimation of the Underlying F0 Range of a Speaker from the Spectral Features of a Brief Speech Input

Abstract

Featured Application

Abstract

1. Introduction

1.1. F0 Range and F0 Normalization

1.2. Traditional Methods of F0 Range Estimation

1.3. Pitch Range Perception from a Brief Speech Input

1.4. A Method for F0 Range Estimation from a Brief Speech Input

1.5. The Motivation of the Current Study

2. Experiment 1: The Refined Model

2.1. Corpora

2.2. Refined Model Setup

2.3. Results

3. Experiment 2: F0 Range Estimation from a Bilingual Parallel Corpus

3.1. Speech Data

3.2. Estimation from the L1 and L2 Speech by the Spectral-Baed Model and the Direct F0 Analysis

3.3. Results and Discussion

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI