Voice Maps as a Tool for Understanding and Dealing with Variability in the Voice

Ternström, Sten; Pabon, Peter

doi:10.3390/app122211353

Open AccessPerspective

Voice Maps as a Tool for Understanding and Dealing with Variability in the Voice

by

Sten Ternström

^1,*

and

Peter Pabon

²

¹

Division of Speech, Music and Hearing, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, 100 44 Stockholm, Sweden

²

Institute of Sonology, Royal Conservatoire, Turfhaven 7, 2511 DK The Hague, The Netherlands

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(22), 11353; https://doi.org/10.3390/app122211353

Submission received: 30 September 2022 / Revised: 28 October 2022 / Accepted: 2 November 2022 / Published: 9 November 2022

(This article belongs to the Special Issue Current Trends and Future Directions in Voice Acoustics Measurement)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Featured Application

To “map the voice” means to collate and visualize hundreds of voice measurements per second over a dense grid in fundamental frequency and sound level. This makes it possible to account for the strong influences of these parameters on most voice metrics, influences which otherwise may obscure other effects under investigation. The measurement perspective is changed from a legacy “keyhole” view to a “landscape” view, extending the scientific, clinical and pedagogical reach of voice measurements.

Abstract

Individual acoustic and other physical metrics of vocal status have long struggled to prove their worth as clinical evidence. While combinations of metrics or “features” are now being intensely explored using data analytics methods, there is a risk that explainability and insight will suffer. The voice mapping paradigm discards the temporal dimension of vocal productions and uses fundamental frequency (f_o) and sound pressure level (SPL) as independent control variables to implement a dense grid of measurement points over a relevant voice range. Such mapping visualizes how most physical voice metrics are greatly affected by f_o and SPL, and more so individually than has been generally recognized. It is demonstrated that if f_o and SPL are not controlled for during task elicitation, repeated measurements will generate “elicitation noise”, which can easily be large enough to obscure the effect of an intervention. It is observed that, although a given metric’s dependencies on f_o and SPL often are complex and/or non-linear, they tend to be systematic and reproducible in any given individual. Once such personal trends are accounted for, ordinary voice metrics can be used to assess vocal status. The momentary value of any given metric needs to be interpreted in the context of the individual’s voice range, and voice mapping makes this possible. Examples are given of how voice mapping can be used to quantify voice variability, to eliminate elicitation noise, to improve the reproducibility and representativeness of already established metrics of the voice, and to assess reliably even subtle effects of interventions. Understanding variability at this level of detail will shed more light on the interdependent mechanisms of voice production, and facilitate progress toward more reliable objective assessments of voices across therapy or training.

Keywords:

voice analysis; voice range profile; voice mapping; variability; reproducibility; representativeness; electroglottography; elicitation; real-time voice analysis

1. Introduction

1.1. Background

When we talk about voice and voice production, as distinct from speech, we refer to the biological apparatus in our body that generates sound. Voice clinicians, voice pedagogues and voice scientists are concerned with caring for, training, and understanding this instrument. Compared to machines, humans are still much better at understanding not only what is meant (language) and what is said (speech), but also how the sounds are produced (voice).

The voice has a wide range of different behaviours, for instance, from soft to loud and from low notes to high notes. These behaviours originate in a biomechanical system with fluctuating properties whose function depends on such diverse circumstances as geometries, material properties of tissues, muscular forces, aerodynamics, and acoustics, to name only the primary ones. Furthermore, as individuals we have vocal folds and vocal tracts that are just as unique to the individual as our faces. These unique differences leave imprints on the sound and also on other signals that we can get from a voice, and this large variability makes the voice a rich and wondrous channel of communication.

As any voice clinician or researcher knows, this complexity also poses great challenges to the objective, quantitative assessment of vocal status and function. So far, subjective assessment has remained very important, but to be reliable it requires very careful staff training and considerable clinical experience. A great deal of work has therefore been invested in the search for objective metrics that could be reliable indicators of a person’s vocal status. Such work has shown that, when measuring physical entities from vocal signals and images, the clinical evidence obtained from individual metrics tends to be weak [1,2,3].

1.2. The Voice Range Profile

A special case of voice maps has been around for a very long time, namely, the voice range profile (VRP; also known as the phonetogram) [4,5] (Chapter 3). A VRP is a voice map that is acquired in order to determine the widest possible range that a person’s voice can produce: softest, loudest, lowest, and highest. In the VRP paradigm, f_o and SPL are dependent variables, and changes in their distribution statistics (minimum, mean, maximum, range etc.) are hypothesized to be related to the outcome of a clinical intervention or training. Typically, one then assesses mainly the contour or “coastline” of this “reachable region”, and the metric that one observes is simply the presence or absence of phonation. In many cases, a vocal problem will result in a reduced range, and there is a large body of literature that is concerned with eliciting and interpreting such contours. However, reliable elicitation of the extremes takes a fair amount of time in the clinic. Furthermore, using one’s voice at its extremes is in itself an acquired skill, which can increase even during one session, especially with the interactive real-time feedback afforded by modern systems. Clinicians are sometimes sceptical of the VRP, because of the time-with-patient and the operator training that it requires. Close adherence to a standard protocol is necessary for good reproducibility [6,7,8]. Cutting corners will incur confounding variability and inconsistent results. On the other hand, many clinical assessments do not require the full voice range to be mapped. For instance, if only the contour is sought, a shorter protocol may suffice [9]. In daily speech, the voice traverses only a very small part of the full range. Such a range can be quantified by the speech range profile (SRP) [10], so if the main interest is to assess ecologically valid productions in the speech range only, then reading a text for a few minutes may be enough [11].

As far back as 1988 [12], author P.P. and co-workers realized that a signal-processing system that extracts SPL and f_o for making VRPs can also extract other metrics at the same time, and render them in that frame of reference, using a gray scale or colour mapping. This was the birth of voice-mapping, if under a different name. It was commercialized by author P.P. and attracted a substantial user base among speech-language pathologists. It also inspired author S.T. to include the VRP in the voice analysis software that he was developing in the 1990s, and which has seen extensive use in Swedish voice clinics. In 2011, S.T. and students began exploring how the acoustic and electroglottographic (EGG) signals might be combined so as to give a more complete picture of how the voice operates over its range, using non-invasive measurements. This ongoing work has persuaded both of us that the variability in voices needs more research attention than it has so far received, which is why we have written this paper. Specifically, the object of the present article is the challenge presented by variability in the human voice to the task of making meaningful measurements on meaningful scales. The primary purpose of the article is to illustrate how the voice mapping paradigm, by coordinating voice measurements in a non-temporal frame, can rise to that challenge. The subject of research is the use of two-dimensional scalar fields to represent variations in voice metrics as functions of two primary independent variables: fundamental frequency (f_o) and voice sound pressure level (SPL). The objectives of the article are: to illustrate how individual metrics can vary across the voice range, or particular parts of it; to illustrate how individual voices when portrayed on voice maps can be very different yet tend to be consistent with themselves; to point out some implications of these observations for the acquisition of measurements; and to suggest several applications of voice mapping for more reliable quantitative assessment of voice interventions. The main contribution of this article is to show why many quantitative measurements of voice-derived signals often fail to supply much evidential value, and to give some examples of how resorting to a voice-mapping paradigm can improve on this situation. The long-term goal of this line of research is to improve the reliability of objective voice measurements in general.

1.3. Voice Map Definitions

Any map needs a coordinate system. For voice maps, we propose that the coordinate system be called the voice field, with fundamental frequency f_o given in semitones on the horizontal axis, and sound pressure given as the sound pressure level SPL (dB) on the vertical axis [13] (Chapter 1). (The German term Stimmfeld has sometimes been used as a name for the voice range profile proper, rather than for its coordinate system. We hope that this will not cause confusion.) Over this plane, we can arrange values derived from a scalar property of the sound, according to the f_o and SPL at any given moment. Such a representation is often called a scalar field. On a topographical map of a geographical area, the scalar value is often the height above sea level, encoded using a colour scale. Conventionally, the metric values are averaged in bins or “cells” that are one semitone wide and one decibel high. In effect, the observations in each and every cell will have a statistical distribution, with a mean, standard deviation, etc. On a voice map, it is usually the per-cell distribution mean value that is encoded on the colour scale. The graphics are plotted in real time, so both the clinician and the patient can see the “land” on the map growing as the patient’s voice is exercised over the range of the task. In practice, we are not content to measure just one metric; rather, several selected physical entities are measured many times per second during phonation, each using a well-defined metric. Each metric is then represented on a separate “layer” of the voice map.

Although many voice metrics will be mentioned in this article, it is not the details of how they are defined, measured, or interpreted that is at issue, but rather how they are collected, coordinated, and presented. Fundamentally, semitones and decibels are both logarithmic representations, and in most cases, a logarithmic representation also of each metric is preferred [13] (pp. 40–70). Log-log-log systems have useful mathematical properties, such as invariance under scaling, and they have a practical similarity to human sensory perception scales. They also raise mathematical issues, such as how to define a standard deviation when observations are made on a log scale, but those issues will be considered to be out of scope for this article.

In Figure 1, we see maps of four colour-coded metrics of the EGG and audio signals, all from the same recording of the same person, a vocally healthy male amateur singer singing in the modal register (M1) only [14]. The task here was to sing soft-loud-soft on an /a/ vowel only, over a limited range. Without going into the specifics of each metric [15] it is clear that all of them vary over this range: considerably so in the vertical direction (which represents sound level) and to some extent also horizontally (which represents the fundamental frequency f_o). It is also clear that while the four metrics exhibit some co-variation, they report on different aspects of this voice. This example was chosen for its relative simplicity in its trends over SPL and f_o. Typically, a map over the full range of a voice would be more complex.

Imagine now that any of the maps in Figure 1 is a topographical map of a geographical area, with height mapped to colour. An intriguing thing happens when we ask the participant to repeat the same task many times: the features on the map gradually become clearer and smoother; but they do not change, so long as the voice remains the same. The “landscape” that is being “surveyed” only becomes more well-defined. This means that the topographical features on the map reflect such properties of voice production that are specific, not to sounds being produced in the moment, but to physiological aspects of the mechanism that is producing them. The maps do not just show pretty colours: each metric represents hard facts about what is going on inside the body and in the air. There is a physiological reason for the observed co-variation: the metrics reflect the state of an underlying mechanism. The metrics have been chosen for good reasons, based on what is known of how vocal function manifests itself in physical measurements. In this article, we do not attempt to interpret the mechanism from the data, although that of course is the greater purpose. The objective here is to demonstrate the localizing principle, by which the voice map representation, relative to the independent variables of SPL and f_o, reveals a consistency that stems from the functioning of a particular voice at the time of acquisition. This notion is central to the rationale for voice mapping.

1.4. People Are Different, but Consistent

Not only do practically all voice metrics depend a great deal on SPL and on f_o [16,17], but these dependencies can also be quite different from person to person. Fortunately, the variation in a metric over the voice field (not over time) tends to be systematic and reproducible, so long as we are making repeated observations of the same individual. As an example, Figure 2 shows three repeated takes of running speech by four adult normophonic (vocally healthy) males, recorded for Patel and Ternström [18]

From Figure 2 it will be clear that, while the map can be very different from that of another similar person performing the same task in the same conditions, people can consistently reproduce their own voice map for a given task. Another example, from eight trained female singers singing a nursery rhyme many times, can be found in [19] (Supplement B), which shows how within participants, EGG waveforms change very consistently on replications over the range of the song, but different singers exhibit very different regions for their typical waveforms. This means that the maps are showing relationships of metrics to SPL and f_o that are specific to a particular voice, i.e., that are properties of the voice production mechanism, and that would not readily emerge from point-wise sampling of single vowels. It means also that, in general, the effects of interventions can be assessed directly only within persons. Still, that possibility is of great practical significance. We will show how the effects of interventions can be visualized by making maps pre- and post-intervention, and then constructing a new map of the differences.

1.5. The Need for Dense Sampling

Conventionally, for clinical voice assessment one tries to eliminate as much as possible of the variability in the voice, by eliciting only a few sustained vowels at “comfortable” (i.e., unspecified) loudness and pitch. While this may be clinically expedient, we submit that it may amount to a gross under-sampling, with a high risk of being poorly representative of the person’s vocal status. It is rather like trying to assess a person’s face from only a few scattered pixels, at ill-defined locations in a photograph. In effect, we are taking sparse samples from an unknown function of SPL and f_o, when it is actually the function itself that carries the sought information. Measurements that are based on a few sustained vowel productions are susceptible to differences in elicited SPL and f_o. These differences can easily be so large that the associated trends in the observed metrics will mask the effect that one is trying to measure. Some real-life examples will be given later in this article. We believe that this is one important reason why the evidential value of many individual physical voice metrics appears to be weak, as found by numerous studies [1,2,3]. It is often not because the metrics are unsuitable as indicators of vocal status, but rather because we have not been collecting and collating them effectively enough.

1.6. Combining Multiple Metrics and Data Analytics

The large variability observed in metrics of the voice implies, for instance, that we should not expect to find any single metric whose quantity will discriminate accurately between normal and pathological voice, nor indeed between various types of normal phonation. Regardless of whether we are talking about the cepstral peak prominence (CPP), the spectrum balance (SB), or the EGG contact quotient (Q_ci)—to name but a few—any one of these, on its own, is unlikely to give reliable evidence as to whether or not a particular voice is fully functional [1]. Many researchers have found that, when taken in combination, a set of several metrics can be more informative [20]. This insight has resulted in a number of proposed indexes that combine several metrics, such as the Acoustic Voice Quality Index [21], the Göttingen hoarseness diagram [22], and also in combination displays, such as the Multi-Dimensional Voice Program (MDVP) [23]. Typically, though, the constituent metrics of such indices are themselves highly dependent on SPL and/or f_o. The AVQI, for instance, has six such components. One may also expect such indexes to exhibit considerable intra-subject variation over the voice field, and so the issue of appropriate sampling remains.

The potential of combined metrics has lately triggered an avalanche of investigations into using machine-learning methods for classifying voices and vocal status on the basis of multiple “features”. For an in-depth appraisal of such methods, see the separate article by Gomez et al. in this issue [24]. As Gomez et al. point out, the data sets hitherto used in such studies are rarely, if ever, calibrated for SPL and f_o. For an example of a novel method that combines several metrics in voice maps using only statistical clustering, see the article by Cai et al. in this issue [25].

2. Aspects of Variability in Voice

In this section we first consider the voice field, being the frame of reference with the independent variables on its two axes. We then describe several sources of variations: system noise, variability in the studied voices including vocal “registers”, and variability in participant responses to the conventional elicitation of vocal tasks. We also consider how an elicited response can be deemed representative of the participant’s normal vocal behaviour.

2.1. The Independent Variables: SPL and f_o

In voice signals, SPL and f_o are the two variables that are the most readily observable, while at the same time being under the volitional control of the speaker or singer. Controlling SPL and f_o adequately is essential, where “adequately” implies different criteria for different styles of speech and of singing. This degree of control makes SPL and f_o useable as the two primary independent variables for voice mapping, i.e., for observing a dependent metric over a relevant range.

As noted above, many quantitative studies of the voice, including those that use the VRP, take f_o and SPL as the dependent variables, and in such cases, the primary interest lies in the observed distribution shapes and statistics of those two variables. In voice mapping, f_o and SPL are instead seen as control parameters, i.e., independent variables. The distributions of these two variables are then important only for ensuring that an adequate number of observations of the metric(s) under study are collected at all relevant fundamental frequencies and voice sound levels. In practice, this means that the total time spent phonating in any particular cell on the map is not important. Further on, we will elaborate on the criteria for an “adequate number of observations”.

In speech, f_o and SPL co-vary mainly because (1) both increase with subglottal pressure, and (2) acoustically, the voice becomes a more efficient sound source as f_o increases [26]. This co-variation is so familiar that we hardly think about it. It is also the reason for the generally upward-slanting, oblique appearance of phonated areas in voice maps and VRPs. In a singer, the softest usable performance voice at high pitch can even have a higher sound level than the loudest possible voice at low pitch [5] (p. 43). Perceptually, the SPL is much more a correlate of distance than of vocal effort [27]. The SPL depends also on the size of the mouth opening by some ±5 dB [28], and there are other not-so-direct relationships between vocal effort and vocal tract articulation that play into the co-variation of f_o and SPL as well. It would seem, then, that SPL is not an ideal metric to use as a coordinate on a voice map.

Nonetheless, we normally employ the SPL as an indirect proxy for a more informative metric: the acoustical power (in watts) radiated by a voice, which is hard to measure directly. The source power can however be estimated from the SPL, if we account carefully for the mouth-to-microphone distance. There is an intuitive optical analogy here: the radiated acoustic power is analogous to the actual height of a person, while the SPL corresponds to the imaged full-figure height that one measures on a photograph. The latter decreases with distance to the camera, and increases with the magnification of the lens, which corresponds to the audio gain. One of several reasons for choosing a standard microphone distance of 0.3 m is that, at this distance, the sound pressure level in dB and the power level in dB radiated by the mouth, which does not depend on the distance, happen to be approximately the same [29]. Because many voice metrics are highly dependent on the voice output power, accurate calibration of the SPL is crucial when making voice maps that are to be compared to each other. As for the fundamental frequency, fortunately, the temporal precision and stability of digital audio interfaces is very high, and hence calibration of f_o measurements is rarely necessary, so long as all relevant sampling rates are properly matched.

Once a voice map has been made, it is static. The temporal context is discarded, which reinforces the distinction between voice (the machine) and speech (its output). By design, then, voice maps are helpful for analyzing voices rather than speech.

2.2. Types of Phonation

The pair of human vocal folds is a non-linear dynamic oscillating system that can vibrate in several distinct ways, which are often called phonation types or registers or mechanisms. The two most commonly observed are modal/chest voice (M1) [30], which is normally used in speech and low-pitch singing; and head/falsetto voice (M2), which is typically used in high-pitch singing, there is usually a bistable pitch region within which a person can choose to phonate in M1 or M2, with a little training. This choice will switch the voice into another regime entirely. The skilled control of such switching is considered desirable in singing, while frequent involuntary switching, such as in voice mutation during male puberty, is considered to be undesirable in speech. Several voice metrics will present different scalar fields depending on whether phonation was done in M1 or M2 [31,32]. In practice, this means that we have a third important independent variable, namely the mechanism. This variable can assume the discrete values M1, M2 and in unusal cases also M0 (“glottal fry”) or M3 (“whistle”). Since this third dimension represents a small number of discrete categories, it can be (and has been) implemented using a separate 2D map layer for each category. So as not to make the present overview unduly complicated, this third “control parameter”, which can take any of the values {M0 | M1 | M2 | M3}, will be disregarded here, and we will assume that the voice stays in M1, unless otherwise noted.

2.3. Variations: Noise or Data?

The abundant dependencies on f_o and SPL are only two aspects of the many sources of variation in metrics of the voice. Other aspects include the physics (turbulence, hysteresis, chaotic systems), the biomechanics (glottal posturing, lung volume, heart rate, etc.) [33], the neuromotor control (reaction times, intrinsic feedback loops, etc.), the phonetics (dependency on phonemes, speech rate, etc.), and the representation of the measured signal (artefacts of quantization and sampling). With all this variability, voice metrics appear to be inherently more or less noisy. However, all of the variation is not stochastic noise; much of it is simply due to factors that have not been controlled for. Often, the distribution of a metric can be informative as to which type of variation we are dealing with. Briefly, distributions that do not appear to conform to the central limit theorem, by approaching a normal distribution, are not stochastic; they are instead likely to indicate that there is a (re-)distributing mechanism that is producing the observed variation. Such analysis of the distribution of a metric along its own scale is unfortunately beyond the scope of this article, even though the data acquisition framework offered by the voice map paradigm will support it.

To obtain good information on the shape of the scalar field of a metric, we need to sample the metric as densely as necessary over the voice field, at small intervals in f_o and SPL. In fact, the standard resolution of one semitone and one decibel is purely pragmatic; whether or not it is sufficient has not been shown. In the following sections, however, we will show that it is a great improvement on sparse sampling at a few effort levels. Importantly, we do not need to make many repeated measurements in each point (cell), so long as the metric in question is well defined for each observation, nor do the data points have to be collected in any particular temporal sequence. The shape of the field will not depend on the order of acquisition, so long as the observed voice does not change appreciably over the course of the recording.

Most scalar voice metrics are defined in relation to the phonatory cycle, which constitutes an elemental time step for new information [34]. Hence f_o is usually the maximum data update rate, provided that observations are made synchronously with the phonatory cycles. Only rarely will it be meaningful to make more than one observation per cycle. This implies a data rate of one or two hundred observations per second for adult speech, and up to over a thousand observations per second in female singing. When analysing period-to-period variations, which are known to be of clinical interest, the more contiguous cycles we can record, the better, since such metrics require a comparison of multiple cycles.

In this context, the rate of observations should not be confused with the sampling rate of the signal. When we sample an acoustic or other signal at the consumer audio standard rate of 44,100 times per second, it is not the values of all those samples that interest us, but rather the shapes of the waveforms that the samples describe. Similarly, in high-speed video endoscopy, with image rates of 2–20 kHz, it is usually not the individual images that provide the interesting information, but rather the temporal behaviour in those image sequences of features that typically are evaluated for each phonatory cycle. The more data we can collect, the better such features can be resolved. Once the features of the voice map have emerged clearly enough, it is redundant to have the informant prolong the task. A key issue in the design of voice map protocols is therefore how to acquire an adequate density of phonatory cycles across the relevant voice range. We will return to this issue, further on.

2.4. Variability over the Voice Range

In the vocological literature we often encounter mathematical models that serve to represent some aspect of how the voice behaves as its driving conditions are varied. Such models are valuable and necessary, and, if well formulated, they can provide important insights and predictions [35]. However, they tend also to impose a mind-set according to which every voice can be expected to follow similar rules, on the whole. This is of course true, but the complete set of rules and constraints operating on any given voice is as yet unknown, and—as any professional voice user will attest—it is more complex and diverse than is often appreciated or covered by any particular model. Therefore, it is important to observe reality as it is, and try to describe it without immediately imposing a model. Voice maps are a useful tool for doing so.

In numerous studies, first on voice range profiles [4], or phonetograms, and more recently on voice maps in general [5,13,18,19], it has been demonstrated that the variability within and between individual voices is surprisingly large. Presenting the full extent of this variability here would require a prohibitive amount of space. We strongly encourage the reader instead to browse the extensive albums of voice maps that are contained in data sets supplementary to articles already published. Direct links are provided with the references [13,18,19] at the end of this article.

2.5. Reproducibility and Elicitation

Legacy assessment paradigms tend to rely on the measurement of properties of individual sustained vowel sounds. It is common practice to elicit a vowel at “comfortable pitch and loudness”, without noting the SPL or f_o that results. If this is repeated, perhaps weeks later, without any intervention, how similar will the second production be to the first? Surprisingly, this appears to have been studied only sparingly. Brown et al. [36] studied the repeatability in elicited “comfortable” SPL and f_o of 16 speakers across several days and found it to be poor, with the standard deviation in “relative” SPL within speakers being 4.7 dB and about 10 dB between speakers; while spontaneous f_o exhibited a standard deviation of some 1–2 semitones. They recommended tighter control of SPL and f_o when eliciting phonation, and to include them as experimental factors—but few have since taken notice. A later article on the reliability [37] of test-retest at “comfortable loudness and pitch” after one week (30 ♂, 30 ♀) found that reliability was very good at the group level; however, the authors noted that the group level is not relevant for assessing a change in a given patient. Another study, on the test-retest reliability of aerodynamic measures [38] found that the intra-subject minimum difference MD (the smallest observed difference to be regarded as real for a given subject, given the variability in elicitation) was 5 dB in SPL and 3 semitones in f_o. Voice mapping reveals that even such apparently modest differences can significantly influence metrics that trend with SPL or f_o, and most of them do. In other words, test-retest reliability cannot be assumed to be good. Awan et al. [39] examined changes in the CPPs as related to SPL and f_o at three levels of elicitation. Some of their SPL results are shown in Figure 3 (round markers). For the present article, we have looked also at a database that was collected for other purposes, of isolated sustained vowel sounds elicited at “soft”, “comfortable” and “loud” effort levels. That data collection was done according to best practice. The raw data have kindly been made available to us by the researchers, who have published several peer-reviewed studies based on that data set [16,17]. The collection was done pre-intervention (therapy or surgery), and then post-intervention. Looking at a subset of this database (85 female voice patients, /a/ vowels only, pre-intervention), we consider first the means and standard deviations in SPL for the three elicitations (Figure 3, square markers).

The two cohorts shown in Figure 3 responded similarly to the template elicitations, with a 6 dB or 8 dB increment between the steps, although with rather large standard deviations, resulting in considerable SPL overlap between the effort levels. Many of the patients in cohort 2 had an intervention and then returned for another recording session. Figure 4 shows the correlation plots of what 58 such participants did in the pre- and post-intervention recording sessions #1 and #2, for the same elicitations, in SPL and in f_o.

The SPL produced by any given participant showed a mean absolute difference of 4.3 dB between the two occasions, while the mean absolute difference in f_o was 2.4 semitones. There was no effect of intervention #1 versus #2 on f_o (+0.37 semitones, p = 0.53) and only a weak effect on SPL (−0.76 dB, p = 0.049). Given that the task was the same in the two sessions, one might expect some correlation between the responses, but these correlations are low, especially in SPL. It can be seen in Figure 4 that repeated productions of a participant could differ by as much as ±10 dB or ±10 semitones between sessions, for the same elicitation. Given that most voice metrics depend considerably on SPL and/or f_o, these very large and random differences in the independent variables, uncorrelated with the intervention, will render pre-post comparisons for effects of intervention practically meaningless, unless the number of observations is increased by orders of magnitude. Also, there is a considerable overlap between the elicitation categories—even within participants, a few of whom produced, for instance, a lower SPL for “loud” than for “comfortable”. A forthcoming retrospective study [41] of a subset of these data showed that the elicitation instructions (soft/comfortable/loud) produced robust and meaningful variations in the phonatory metrics, while any effects of intervention (surgery or therapy) were undetectable. The problem here is that no attempt was made to control for physical SPL and f_o during elicitation. The resulting excess variability has greater consequences than is typically recognized, as will be further elaborated upon below. In summary, several studies indicate that the phenomenon, which we might call test-retest “elicitation noise”, is on the order of ±5 dB and ±2–3 semitones. When this variability is viewed in the context offered by voice maps, we must conclude that it is a problematic aspect of the “comfortable loudness and pitch” data collection paradigm.

2.6. Representativeness and Elicitation

It is rarely a good idea to reduce any given aspect of a voice to a single number. Still, if we wish to do so, such as for making group comparisons, then it is necessary to minimize the noise from as many sources as possible, and to ensure that the number obtained really is representative of the given person’s current vocal status. So, to what degree is an elicited vowel representative of the person’s habitual voice? This question was addressed by Murry et al. [40], who found that sustained vowels tended to be produced at a somewhat different f_o than that of habitual speech, with age and sex effects as well (−2.6 to + 1.1 semitones difference). SPL was not reported in that study. The authors ventured that there may be a “performance” aspect to sustaining a vowel that might bias its production away from that in habitual speech. Others have found that test-retest reliability appears to be greater for a reading task than for elicited sustained vowels, as shown for f_o by Fitch [42].

A reading task is more like normal speech, and so we may expect it to be more ecologically valid than isolated sustained vowels. So, is it possible to measure things from running speech, even though its inherent variability is a concern? The speech range profile (SRP), of course, does this in what regards limits and ranges in f_o and SPL. The long-time average spectrum (LTAS) can be used for assessing some voice properties, if the averaging is performed over a sufficiently long and homogenous task. Some studies indicate that the CPP is applicable to the assessment of voice problems using reading tasks, although problems remain with separating voiced from unvoiced segments [43]. For singing, it has been found [19] that EGG waveforms of trained female singers were distributed similarly over the entire range of a song, regardless of whether the singer’s task was to sing on sustained /a:/ vowels, on/pa:/ syllables, or on lyrics; although the EGG waveforms were somewhat more variable for the latter, as would be expected.

3. Voice Maps in Practice

In this section we show how creating difference maps coordinates the observations of a scalar metric in a way that accounts for individual trends and elicitation noise over SPL and f_o, thereby greatly clarifying the effects of interventions. We consider some practical aspects of acquisition duration, as well as interpolation and smoothing of maps. We demonstrate how maps can be used to find a metric value that is known to be representative of a participant’s speech, and why this is important. We then consider local gradients of metrics, and how they contribute to variability in measurements. Finally, we briefly bring up a number of ancillary points of practical interest.

3.1. Mapping Effects of Interventions

By using voice mapping, and ensuring that productions elicited pre- and post-intervention will overlap each other in the voice field, at least within a relevant range, it is possible to compare outcomes within subjects at matched SPL and matched f_o. In so doing, we can elegantly eliminate the influence of “elicitation noise”, and at the same time account for the person-specific trends in the given voice.

An example is given in Figure 5. A transgender patient read a standard text for about two minutes, before feminization therapy, and then was recorded again 3 months post-therapy. It is very important when making such recordings to keep the setup, gain calibration, protocol and/or task identical, pre- and post-therapy. The top row shows the density, i.e., the number of cycles phonated in each cell. It can be seen that her speech range and its distribution mode were translated upwards, in both sound level and f_o. The difference map (c) at the right shows the cell-by-cell differences in the overlapping region, i.e., in those cells that were populated in both the pre- and post-therapy maps. Here red signifies fewer cycles; green signifies more cycles. Clearly, after therapy, the patient spent more of the task time at high f_o; and we would not have needed a voice map to establish that. However, consider now the bottom row in Figure 5, which shows the average spectrum balance of the voiced segments only. In the difference map (f), the consistency of the green area signifies that there has been a physiological change, resulting in more energy above 2 kHz, in the post-therapy condition. The average difference in all the cells in (f) is +4 dB. Such a change could have been brought on by a regrouping of formant frequencies, and/or by a somewhat more pressed use of the voice. Thanks to the mapping, however, we know that the cause of this increase is not a higher SPL nor a higher f_o (both of which are known also to affect the spectrum balance), since the SB in each cell of the “post” map (e) is compared to the corresponding cell in the “pre” map (d). Also, measuring only a few sustained vowels would not have given such a clear outcome.

Another application of difference maps is to investigate subtle changes in highly controlled tasks, as with professional singers. In this situation, there is little or no elicitation noise, and the effect sizes may be very small, yet of interest. An example can be found in a recent study on using the flow-ball as a feedback device for learning breath control [44]. Trained singers did a soft-loud-soft exercise on the vowel /a:/, first normally, then while phonating into a flow-ball (the intervention), and then immediately again without the flow-ball. This was repeated in whole-tone steps increasing over 1½ octaves, so there was one intervention for every tone in the scale. The full voice range was not explored. The top row of Figure 6 shows maps of the EGG contact quotient Q_ci on a colour scale, for a trained baritone. The pre- and post-intervention maps look quite similar, but the third map reveals the differences after the intervention. Here, green means an increase and red means a decrease, in this case in Q_ci. We see that Q_ci mostly decreased (red) except on the loudest tones around 220 Hz (green). Returning to the analogy with a landscape, we see that while the overall topography pre and post is similar, the elevation has changed somewhat, and differently so in different places. The circumstance that many adjacent cells show the same direction of change gives us confidence that this is a systematic effect. Again, this effect would have been very difficult to demonstrate convincingly with just a few sustained vowels. They would have been vulnerable both to elicitation noise and to the small fluctuations from various incidental sources of variation in the voice production mechanism, as mentioned in the introduction. By sampling this mechanism over a dense grid across SPL and f_o, the mapping operation helps to expose trends in what are inherently rather noisy data.

The issue now becomes to develop and validate a methodology for using difference maps such as these, since they appear to hold promise for making clinical assessments. In particular, more applied experience will be needed in order to establish the effect sizes needed to indicate clinically relevant changes, in any given metric. This is of course always the case, also for more conventional assessment methods, but mapping shows us that it might not be meaningful to reduce such effect sizes to a single number, for comparison with a given threshold. An intervention is more likely than not to have different effects in different parts of the range of the voice. One may hypothesize that interventions should result in features on the voice map changing position, for instance a lower SPL for the threshold of vocal fold collision in soft voice, or a raised SPL for the onset of pressed phonation in loud voice. New techniques for assessing temporal changes in scalar fields (such as pre-post intervention) have recently been presented by Köpp et al. [45], and we plan to test these techniques in future studies.

3.2. Time Needed for Collection

Given that mapping the voice necessarily takes time, it would be useful to estimate how short a measurement time can be, yet still make a voice map that appears smooth to the eye. This will depend on whether the given metric represents some aspect based on individual cycles, or on sequences of several cycles, as for perturbations. A rule-of-thumb can be derived that about 10 observations (phonatory cycles) per cell will suffice for metrics that are defined on individual cycles. For metrics that quantify cycle-to-cycle variability, more cycles may be needed. To be on the safe side, we might adopt a conservative minimum of 20 cycles. Even so, that means that the informant needs to spend a total of only 0.2 s in each cell at f_o = 100 Hz, and proportionately less at higher f_o.

We conclude that it is expedient to be as mobile as possible in f_o and SPL. The bottleneck is not that of acquiring sufficient data points, but rather to get the informant to control their voice such that the relevant part of the voice range is traversed in a short time. In principle, 2–3 min could be enough. Phonating on stationary vowels will only take unnecessary time, without providing more information.

3.3. Interpolation and Smoothing

Should we require all relevant cells to be visited in a finished map? It does take a fair amount of time to get participants to cover the entire interior of their voice range without gaps; typically 10 min if they are trained, 20 min or more if they lack prior vocal training. If the map does not have to be perfectly filled, acquisition time can be saved. Also, cells containing fewer than the nominal 20 observations (cycles) will exhibit small incidental cell-to-cell variations that add noise to the general picture, as discussed in the preceding section. The time saved in acquisition may justify a modest amount of smoothing and interpolation across vacant cells. The bottom row in Figure 6 illustrates the effect of such an interpolate-and-smooth operation, using a convolution kernel of 3 × 3 cells.

3.4. Finding a Representative Value

Often a value that is intended to be characteristic of an individual is constructed by measuring on several sustained vowels and reporting the average. But such vowels are often elicited in a way that might not be representative, as discussed above. Instead, voice mapping of connected speech or some other relevant task can be completed. The “density” layer provided by most voice mapping systems is in effect a 2D histogram of SPL and f_o. Using the bespoke voice-mapping system FonaDyn (Appendix A, and [46]) that analyzed only the voiced segments, Patel and Ternström [18] recorded multiple readings of the Rainbow Passage. In the resulting 2D histograms they defined, as the most representative speech region, the 50% of phonations around the main peak of the distribution, and arbitrarily named it the “Γ region”. A metric mean within this rather small and most visited region should be somewhat more representative than the mean of the entire reading. An example is shown in Figure 7.

But why go to such lengths, just to get one representative value? For one thing, the mapping reveals whether or not the independent variables f_o and SPL have a unimodal distribution. We stated earlier that the distribution of f_o and SPL, as shown by the “density” layer in Figure 7a,b, has little effect on the topography of the metric’s scalar field. However, if we are asking which values of the metric can be considered typical or representative in connected speech, then this distribution will matter. Especially, if the distribution has several modes, then any descriptive statistics will be called into question; which of the modes is the relevant one? This must be an informed, manual judgment.

Assuming that the distribution is indeed unimodal, will using a distribution mean from a voice map make a difference, compared to the legacy sampling of a small number of sustained vowels? Well, it can, and a follow-up example shows how. In [18], recordings such as the one shown in Figure 7 were made of 26 adults (13 ♂, 13 ♀, group A) and 22 prepubertal children (group C). Figure 8 shows the descriptive statistics of the EGG contact quotient Q_ci over Γ for all those participants, based on the subset within their respective Γ only, of /a:/ productions sustained for the VRP. For each participant, these statistics are based on the contents of on average 22 cells in their own Γ region, which contained data from 1300–3400 phonatory cycles per person (c.f. the black grid in Figure 7c).

From Figure 8 it is very clear that there is considerable inter-individual variation. This is typical for any metric that we measure from the voice. Note that the standard deviations in Q_ci in Figure 8 were computed from the per-cell means in the Γ region only—not over the entire SRP. We see also that there seems to be a difference in Q_ci between the groups, but we would like to subject this to a robust statistical test. We now have estimates of the mean and standard deviation for each participant. Let us assume a normal distribution within participants, even though this is overly simplistic; this assumption makes our example a best-case scenario, while reality is even worse. We can now draw one sample at random from each of the 48 participant distributions, as if we had analyzed only one properly elicited sustained vowel from each participant. We can then test for group differences on the basis of the outcome of these random draws, using a two-sided t-test, and compute the p-value for that test. By repeating such a draw, say, a thousand times, we can estimate the likelihood of detecting the real group difference based on the p value. Figure 9 shows the outcome of such a Monte Carlo simulation, using this data set.

For this particular case, the 1000 averages from the random draws confirm our initial impression that there is a difference between the groups: in Figure 9, the ensembles of outcomes indicated by (A, blue) and (C, red) are visually distinct. However, the right panel in that figure shows that if we apply a criterion of p < 0.05, the probability of confirming a difference between these groups would be only 51.7%, if based on only one observation per participant. Using instead the per-participant means over their Γ region gave a corresponding p of 0.002 and a probability of confirming the difference of 93.8%. What we have done is in effect to increase the number of observations, not by sustaining more vowels, but by finding the metric’s distribution within a relevant voice range for each participant and taking the mean of that distribution as the sole observation of that participant. This illustrates why obtaining a truly representative value per participant can be very important, especially if the intent is to test for group differences on the basis of p values. We recall here that such a test alone tells us nothing about the practical importance or clinical relevance of any observed difference. For a discussion of the interpretation of statistics, see the tutorial article by Anders Sand in this special issue [47].

3.5. Local Gradients

Provided that the Γ region extends across at least a few cells along both axes in the map, we can perform linear regression to obtain the gradients along the axes of f_o and SPL across this region, for any metric. Performing linear regression across an entire voice map is rarely meaningful, because the gradient function of a metric is typically not linear over large ranges. For a small region such as Γ, however, a linear trend can be expected to be a fair characterization of a voice change within the speech range. Taking again an example from the data set collected for Patel and Ternström [18], Figure 10 shows the group averages of such local gradients for three EGG metrics, for three groups of 13 adult males, 13 adult females and 20 children aged 5–7 of both sexes. (Two of the twenty-two children were excluded, because their Γ regions had very small extents on the f_o axis.)

In Figure 10 it can be seen that the three exemplary metrics all exhibit different trends across Γ, when their slopes are averaged by group. Furthermore, the individual slopes were extremely variable, but there is not room here to report them. Note in Figure 10c the especially dramatic SPL dependency of the inverse perturbation metric CSE, whose steep descent indicates that phonation rapidly stabilizes as vocal output power increases. As pointed out by many [12,16,17], perturbation metrics (jitter, shimmer, CPP, etc.) are sensitive to SPL, and especially so in the rather narrow transition band between breathy and modal voice, in which VF contact is established. For the cohort of normophonic, untrained participants examined here, the habitual speech region of the voice range tended to sit right across that band. When such is the case, even small pre-post differences in elicited SPL or f_o could easily mask the sought effects of an intervention. We conclude that the existence of steep and personal gradients in some metrics again speaks strongly in favour of making voice maps, and of comparing them only within individuals.

For the same study [18], Patel and Ternström had participants make full voice range profiles on /a:/, and then extracted the small subsets of those VRPs that were enclosed by the Γ regions that had been obtained from the speech task. This allowed a comparison of the variances explained by the trends, in connected speech and in productions on one vowel only.

It can be seen in Figure 11 that in the vowel-only condition, much more of the total variance over Γ was explained by the local trends. Given that connected speech gives rise to several additional sources of variation, this is of course exactly what might be expected. The figure also shows that, in the task of traversing a range on the /a/ vowel, fully 30–50% of the total variance could be attributed to local dependencies on SPL and f_o, even in the small Γ region. Again, these are group averages; individuals exhibit different personal trends. Ultimately, one’s research question must dictate whether or not vocalization on /a:/ is preferable to a connected speech task.

At this point, we are not recommending clinicians actually measure gradients of voice metrics. Rather, the point is that analyses of voice map data reveal that such gradients can be locally steep, and highly individual, even within the “most habitual” speech range. This exacerbates the issue of the variable outcomes of elicitations. It might seem reasonable to assume that things will remain pretty much the same within the rather small range of habitual speech; but unfortunately this is not true in the general case. Svante Granqvist (personal communication) has pointed out that it could be optimal for the efficiency of communication to speak in a voice range such that the smallest change in effort gives the largest dividend in modified output. It is not unlikely that this is what we do.

3.6. Mapping of Group Data

When working with voice maps, one is struck by the inter-individual variation, as in many other clinical disciplines. Still, the need is clear for making group comparisons, either pre- and post-intervention, or of two groups following different regimes. Only if one has a reasonably homogenous group of participants is it meaningful to make voice maps that represent the average of a group [18,31,48]. This can be done in several ways. If participants should be weighted equally, one may compile a map of all participants, where the per-cell mean of each participant counts as one observation in each cell, and the number of cycles observed for each participant is discarded. If the participants should be weighted according to the time they spent, then all cycle observations can simply be accumulated per cell across participants, and new means computed for all metrics.

Although it might be tempting to adopt such an averaged map as being somehow normative for a group, we caution against it. A map that is accumulated from many, without any warping to align local features, will tend to smooth out the individuals’ local gradients, which actually reflect the essence of the underlying physiological mechanisms.

3.7. Mapping of Other Statistics

So far we have considered mostly the mean of a metric, per voice map cell. However, any other per-cell statistic, such as standard deviations, metric gradients, or inter-metric correlations can in themselves be taken as scalar fields that can also be represented on maps. For instance, the standard deviation of the EGG contact quotient often has a local maximum in the transition region, where VF contacting sets in, due both to oscillatory instability and to hysteresis. Maps can be made also of the correlation between metrics of phonation and other physiological signals acquired in parallel, such as vertical larynx height, lung volume, subglottal pressure, etc. Examples can be found in [19].

3.8. Colour Scales

Standardized colour scales can be presumed to be important for achieving wide uptake of voice maps. In our work, we have experimented with making the mapping of colours specific to certain metrics, so that one at a glance can infer which metric is being displayed. Clearly, it will be important to standardize how metric values should be mapped to colours, or users will have to re-learn from one system to another. However, this effort will have to be deferred until some consensus is reached on which metrics are proven to be useful, and which are not.

Existing voice mapping systems tend to use a red-yellow-green or red-green-blue colour space for mapping metric values, which can be troublesome for people with limited colour vision. Exploiting the full colour space maximizes resolution for those with the colour vision of the majority, but is actually detrimental for a minority. For practical systems, it may be necessary to develop a universal set of colour scales, or at least to provide the option of selecting alternative scales. Still, some colour is invaluable; the eye detects intensity contrasts rather than absolute intensities and is easily fooled by monochromatic scales based on intensity only.

3.9. Hearing-Seeing-Feeling in Real Time

When one is recording a voice map, a moving cursor indicates the current position in f_o and SPL in real time. More than the exact scale values, it is the momentary f_o/SPL location on the map that exemplifies a characteristic physiological setting, such as a particular laryngeal posturing, and with it the extent of the region over which a distinct quality manifests itself. Even if the local topography of the mapped metric shows no change, a mere enlargement of the mapped area may still report that an increased control over the voice has been gained. Patients generally appreciate such immediate feedback, which can be included in a therapeutic regime [49].

The same voice may have a different quality, be it desired or not, at different locations within the voice range. This notion needs no explanation to singers or voice patients. They see, or expect to see, such qualities revealed on the map during an interactive recording session. Many voice users already have a mental map of how their voice works. During recording with real-time feedback, they want to connect the visuals on the screen with their proprioception. To them, a goal of reducing all information to a single representative numerical value, detached from its production settings, is counterintuitive, as it implies a gross reduction of the recorded information. A voice that is considered to be pathological may exhibit an aberration over a part of the voice range, yet still function well over much of the remaining range. Highly trained singers may encounter specific phonatory problems only when forced to go to certain points in their range, and even then may skilfully avoid them.

It is fascinating and instructive to watch a voice map emerge, while listening at the same time. It extends one’s mental model of the voice, not only for the clinician/pedagogue but also for the patient/student. Real-time visual feedback makes the patients/students more aware of their vocal control, while creating a record of their performance and progress [49]. This in turn can provide an incentive for adhering to a training regime, for instance. It is actually fun to “paint with one’s voice”.

The FonaDyn system [46] allows the user to listen post hoc to the sounds that contributed to the map in a mouse-clicked location, regardless of when in the recording those sounds occurred. This can help to reinforce one’s understanding of the map representation, and of the roles of the constituent metrics. It can also be used to conduct auditory-perceptual validations of what the maps are showing.

3.10. Assessing Pathological Voices

When a new patient presents with a voice problem, the clinician usually does not have baseline data with which to compare the current vocal status. Because voices can be so different, it is not realistic to have as baselines normative, “reference” voice maps based on a population. Only a comparison with the same person in good voice will be truly informative, especially for the often subtle vocal problems of voice professionals. One possibility could be to incorporate the collection of a full-range voice map as part of a regular health check-up, if only for persons that depend on their voice for their livelihood. Should they encounter voice problems later, then such a desirable reference would be available.

While the legacy voice range profiles have proven their usefulness in assessing voice problems that affect the vocal range, the focus of most recent voice mapping studies, on the interior of the contour, has so far been on healthy voices. This has given many valuable insights into probing the mechanisms of voice production, but we are still in the beginning. The time has now become ripe for applying these techniques to the assessment of impaired voices. We know from the literature on VRPs that voice pathologies manifest themselves in very diverse ways on the contours. With vocal pathologies present, the diversity of phenomena portrayed by (filled) voice maps can only be expected to increase.

3.11. Subtle Voice Changes

Voice maps can reveal changes that are local to a small sub-range of the voice, which might otherwise have eluded a more conventional assessment. This can be of particular relevance to singers [5], with voice problems or performance targets that can be very subtle. To the practicing singer, voice maps provide a useful personal frame of reference for monitoring one’s own vocal status and progress [50,51], even in a small part of the voice range.

4. Discussion and Conclusions

4.1. Choosing the Metrics for Voice Maps

Intentionally, we have largely avoided the topic of the interpretation of the metric scalar fields, asserting simply that “voices are more different than you think”. The central topic of the present article has been understanding and dealing with variability in the voice. This is a necessary but hitherto rather neglected prerequisite to measuring and understanding the voice itself. Following that, one will need to focus on selecting those metrics that are of the greatest relevance to the domain of study and to the chosen interpreting models. The domain could be clinical, pedagogical or research-oriented. For instance, when assessing vocal pathologies, the literature shows that metrics of irregularity of phonation will be needed, while for assessing elite singing, metrics of the resonance properties of the voice may be more relevant. For addressing specific research questions about voice production or voice perception, again the selection of metrics will be guided by the choice of models. For instance, for production studies, classical metrics such as contact quotient, mfdr, P_sub, etc., are perfectly useable, while for perceptual studies one could map metrics such as loudness [52] or other psychoacoustic entities. The common principle, however, is to find a small set of relevant metrics with a low degree of mutual residual information, once the universally strong covariations that are linked to SPL and f_o have been accounted for. This search is greatly facilitated by voice mapping, which does precisely this accounting, for each individual.

Additionally, if metrics are chosen such that they can act also as synthesis parameters [13] (Chapter 2.4), then their role and relative contributions can be evaluated by listening to the output of a synthesis model. If the re-synthesized voice sounds correct in all relevant respects, then this is a validation that the employed set of metrics is complete, at least perceptually so. This may seem to be a trivial assertion, but the voice map format facilitates the auditory exploration of how the perceptual roles of different parameters also change across the voice range.

One may then ask if any metric exists that is directly comparable across voices; one that would represent a basic physiologic aspect of phonation, such that all voices show a similar variation over f_o and SPL. In our experience, the best candidate for such a “reference” metric is the power of the fundamental partial, relative to the summed power of the remaining partials. Several variants occur in the literature, such as HRF for Harmonic Richness Factor [53], when referring to the spectrum of the glottal flow, or L_Rest/H1 [31], referring to the radiated spectrum. L_Rest/H1 is distributed in a similar way for all voices phonating in the same mechanism (M1 or M2). The audiophile would say that it represents the amount of harmonic distortion in a sound system. This metric represents a fundamental aspect of the voice physiology and the acoustic voice production mechanism, namely how much the voice source deviates from being a linear, time-invariant system, in the formal sense of the term. There is however no link of absolute metric values to f_o and SPL that will apply to all individuals.

4.2. Limitations of Voice Mapping

A basic prerequisite for making a voice map at all is that the phonation must be periodic enough for a fundamental frequency to be estimated. This would seem to run at odds with the fact that severely pathological voices are often characterized by low or even absent periodicity. There will of course be cases in which it is not possible to make a map, and then it is not relevant to do so, either. Once a minimum range of phonation has been regained, however, mapping can document its extent and its character. As yet, we have had too little experience of mapping severely pathological voices to say more about this.

Often, when first encountering voice maps, clinicians will ask, “well that’s all very nice, but how do I get a number from all these visualizations?” That will depend entirely on the research question; and with rich visual and auditory feedback of a given phenomenon or relationship, one is in a much better position to pose the right research questions. A voice map is made entirely of numbers, arranged so as to supply the pixel values in an image. As the radiologist examines a scanned image of some body part, so the voice clinician can examine and interpret the images from the voice map, and this of course takes experience. With increased experience, it will be possible to design numerical scores that summarize some aspects of maps that are especially relevant in a particular clinical or pedagogical context. However, this remains to be seen.

Making a voice map necessarily takes more time than sustaining a few vowels. For several kinds of clinical assessment, making a full voice map would certainly be overkill. It will be important to develop protocols that are expedient, yet still yield enough data to resolve the intervention effects in the face of the copious variability in the voice. This work will continue, and is likely to require separate approaches for different pathologies, for instance.

4.3. Data Analytics?

Exploring the voice with maps does tend to generate large amounts of data and images that can easily become overwhelming. Voice maps can of course be processed as images, for which an expanding plethora of data-driven techniques is available. In future work, we hope to try out, for instance, the feasibility of applying techniques similar to facial expression recognition, for the identification and localization of features in the maps. Data reduction techniques such as bottleneck hidden layers in neural networks are potentially helpful for identifying how underlying voice production mechanisms manifest themselves in the maps. This is a task that seems to be especially well suited to ML/AI techniques. A major challenge here is that a physiological ground truth is often difficult to establish from externally acquired signals.

Some will argue that with deep learning techniques, an end-to-end system is possible in which such feature detection and even voice maps would be unnecessary intermediate steps. However, we believe that explainability is key to acceptance and to furthering of our knowledge of the human voice. Explainability is a key advantage of voice mapping, but it is not yet well supported by data analytics. At this stage, we therefore prioritize the rearranging of multiple time-series of metric values into the non-temporal voice map representation, in the conviction that it will both promote human understanding and facilitate learning by machines. We have high hopes for how data analytics might be applied to the distillation of insights from voice maps. However, we must first have a good understanding of how the maps best can be made, to avoid a “garbage in, garbage out” situation.

4.4. Concluding Remarks

Voice mapping is not in itself a kind of measurement. It is a paradigm for the coordinated acquisition of data and visualization of multiple concurrent measurements, in which the momentary SPL and f_o are continuously accounted for, much more data is collected for every individual, and the co-variation of different metrics can be made explicit. Thus, voice mapping offers opportunities for accounting for many of the sources of variation in the voice, thereby improving on more conventional data collection paradigms. We hope to have shown that the potential of voice mapping goes well beyond that of the legacy voice range profile, including:

offering the clinician, researcher, and informant a richer yet equally compact presentation of vocal status, in real time;
mapping even small or local changes across interventions, while accounting for their trends over SPL and f_o, even if those trends are non-linear and personal;
finding the metric values that are the most representative of the individual’s vocal status, in ecologically valid tasks;
assessing the range of variation in a metric, in a given task;
assessing the sensitivity of a metric to changes in SPL and f_o.

Although these are all improvements, it is clear that a number of tasks remain if we are to make substantial progress towards the long-term goal of improving the general reliability of non-invasive objective voice measurements. Some of these tasks are practical, others theoretical, and they will iteratively inform each other, so concurrent progress will be needed on all fronts. A list of tasks for the next few years could read as follows:

Testing statistical clustering of several metrics, applied in real time so as to automatically and quantitatively identify regions of different types of phonation [25]. This should facilitate making the connection to perceptual assessment and ultimately to a deeper functional assessment of voices. Even though we strive for objective rather than subjective assessment, a grounding in current clinical perceptual methodology is very important both for understanding and for uptake;
Identifying compact sets of voice metrics that are especially relevant for assessing voice signals in the context of particular vocal pathologies and/or training objectives, in other words, what to measure in which situation;
Outreach: voice scientists presenting and discussing voice mapping with clinicians and voice pedagogues in workshops and courses, and learning more about their priorities [54];
In-the-field applications: Solving the technical problem of SPL calibration in poorly controlled situations such as telehealth, which could increase the influx of research data; implementations for mobile devices;
Performing pre-post intervention studies, so as to gain experience with clinically or pedagogically relevant effect sizes, for the interpretation of difference maps;
Making quantitative voice assessment more functionally oriented and clinically relevant by mapping the location and extent of phonation type regions across the voice field, rather than by assessing voice quality using statistically derived cut-off values for metrics and composite indexes.

Most importantly, voice mapping reveals things about the voice that we did not already know. This is our incentive to continue the pursuit. Each new observation is an opportunity for questioning our models, and for making more complete models that offer better explanations. Voice mapping provides a fertile ground for making new observations, as well as for placing both the expected and the unexpected into a frame of reference that can be shared and built upon.

Author Contributions

Conceptualization and methodology, P.P. and S.T.; software, S.T.; writing—original draft preparation, S.T. and P.P.; writing—review and editing, S.T. and P.P.; visualization, S.T. All authors have read and agreed to the published version of the manuscript.

Funding

The writing of this overview article received no external funding.

Institutional Review Board Statement

Not applicable. The writing and preparation of examples for this overview paper did not involve humans or animals.

Informed Consent Statement

Informed consent is not applicable for this overview paper. No persons participating in the quoted studies can be identified here.

Data Availability Statement

Extensive albums of voice maps made for different purposes can be found in the supplementary files of previously published studies [18,19,31].

Acknowledgments

Numerous studies have been published that use voice mapping, such that we now have the incentive to present this overview. We are grateful to our collaborators, doctoral students and master’s students for their engagement and enthusiasm. In particular, we are indebted to Rita Patel, Filipa Lã and Meike Brockmann-Bauser, who were lead investigators in recent collaborative studies, from which many of the present examples were taken. Maria Södersten, Ulrika Nygren and Svante Granqvist kindly provided the example for Figure 5. Naomi Iob and Huanchen Cai post-processed the data from [16] for making Figure 3 and Figure 4. Sara D’Amario collaborated in [19], on the first application of voice maps of differences and correlations. David House gave valuable feedback on the manuscript.

Conflicts of Interest

Author P.P. is the sole owner of Voice Quality Systems, a company that produced the Voice Profiler system in co-operation with re-sellers. This product is no longer commercially available. Author S.T. was involved in the development of the Phog system (see Appendix A), but since 2005 he has had no financial interests in any commercial products.

Appendix A. Implementations

At the time of writing, there are several commercially available systems that construct voice range profiles, most of which render and analyze the contour only, although a few also map the density and a small selection of other voice metrics, but without deeper analysis such as outlined in the present article. They can be found with an internet search for “voice range profile” or “phonetogram” and have not been listed here.

Most of the studies here used as examples were made using the then current versions of the freeware FonaDyn [46], which was first written by Dennis Johansson at KTH in 2015 for his M.Sc. project in computer science [55]. Author S.T. has continued to develop FonaDyn, with incidental contributions from students. FonaDyn is released to the public domain (https://www.kth.se/profile/stern, accessed on 1 November 2022), as an open-source, proof-of-concept workbench for analyses based on voice mapping; not as a commercial, supported product. It is in continuing development.

For clinical use, Svante Granqvist has written RecVox, which is freely downloadable from www.tolvan.com (accessed on 1 November 2022). It maps six metrics and is currently the standard acquisition system for clinical voice recording at the Karolinska University Hospital at Huddinge, Stockholm, Sweden.

Commercial systems that perform mapping of metrics over the voice field include Voice Profiler (by author P.P.), which is no longer commercially available but still in use, especially in The Netherlands. Voice Profiler maps from 4 to 26 metrics, depending on the version. Another commercial system is Phog, which is part of the VoiceJournal system from Neovius Signalsystem, www.neovius.se (accessed on 1 November 2022) and reportedly is still available. Phog maps only the metrics density, crest factor and a perturbation metric.

References

Carding, P.N.; Wilson, J.A.; MacKenzie, K.; Deary, I.J. Measuring voice outcomes: State of the science review. J. Laryngol. Otol. 2009, 123, 823–829. [Google Scholar] [CrossRef]
Roy, N.; Barkmeier-Kraemer, J.; Eadie, T.; Sivasankar, M.P.; Mehta, D.; Paul, D.; Hillman, R. Evidence-Based Clinical Voice Assessment: A Systematic Review. Am. J. Speech Lang. Pathol. 2013, 22, 212–226. [Google Scholar] [CrossRef] [Green Version]
Lopes, L.W.; da Silva, J.D.; Simões, L.B.; da Silva Evangelista, D.; Silva, P.O.C.; Almeida, A.A.; de Lima-Silva, M.F.B. Relationship Between Acoustic Measurements and Self-evaluation in Patients with Voice Disorders. J. Voice 2017, 31, 119.e1–119.e10. [Google Scholar] [CrossRef] [PubMed]
Ternström, S.; Pabon, P.; Södersten, M. The voice range profile: Its function, applications, pitfalls and potential. Acta Acust. United Acust. 2016, 102, 268–283. [Google Scholar] [CrossRef]
Lamarche, A. Putting the Singing Voice on the Map: Towards Improving the Quantitative Evaluation of Voice Status in Professional Female Singers. Ph.D. Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2009. Available online: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-9976 (accessed on 1 November 2022).
Sanchez, K.; Oates, J.; Dacakis, G.; Holmberg, E.B. Speech and Voice Range Profiles of Adults with Untrained Normal Voices: Methodological Implications. Logop. Phoniatr. Vocology 2014, 39, 62–71. [Google Scholar] [CrossRef] [PubMed]
Printz, T.; Rosenberg, T.; Godballe, C.; Dyrvig, A.K.; Grøntved, Å.M. Reproducibility of Automated Voice Range Profiles, a Systematic Literature Review. J. Voice 2018, 32, 273–280. [Google Scholar] [CrossRef] [Green Version]
Printz, T.; Sorensen, J.R.; Godballe, C.; Grøntved, Å.M. Test-Retest Reliability of the Dual-Microphone Voice Range Profile. J. Voice 2018, 32, 32–37. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Rychel, A.K.; van Mersbergen, M. The Voice Range Profile-A Shortened Protocol Pilot Study. J. Voice 2021. [Google Scholar] [CrossRef]
D’Alatri, L.; Marchese, M.R. The speech range profile (SRP): An easy and useful tool to assess vocal limits. Acta Otorhinolaryngol. Ital. 2014, 34, 253–258. [Google Scholar]
Ma, E.; Robertson, J.; Radford, C.; Vagne, S.; El-Halabi, R.; Yiu, E. Reliability of Speaking and Maximum Voice Range Measures in Screening for Dysphonia. J. Voice 2007, 21, 397–406. [Google Scholar] [CrossRef] [Green Version]
Pabon, J.P.H.; Plomp, R. Automatic Phonetogram Recording Supplemented with Acoustical Voice-Quality Parameters. J. Speech Hear. Res. 1988, 31, 710–722. [Google Scholar] [CrossRef] [PubMed]
Pabon, P. Mapping Individual Voice Quality over the Voice Range: The Measurement Paradigm of the Voice Range Profile. Ph.D. Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2018. Available online: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-235824 (accessed on 1 November 2022).
Selamtzis, A.; Ternström, S. Investigation of the Relationship between Electroglottogram Waveform, Fundamental Frequency, and Sound Pressure Level Using Clustering. J. Voice 2017, 31, 393–400. [Google Scholar] [CrossRef]
Ternström, S. Normalized Time-Domain Parameters for Electroglottographic Waveforms. J. Acoust. Soc. Am. 2019, 146, EL65–EL70. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Brockmann-Bauser, M.; Bohlender, J.E.; Mehta, D.D. Acoustic Perturbation Measures Improve with Increasing Vocal Intensity in Individuals with and without Voice Disorders. J. Voice 2018, 32, 162–168. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Brockmann-Bauser, M.; Van Stan, J.H.; Carvalho Sampaio, M.; Bohlender, J.E.; Hillman, R.E.; Mehta, D.D. Effects of Vocal Intensity and Fundamental Frequency on Cepstral Peak Prominence in Patients with Voice Disorders and Vocally Healthy Controls. J. Voice 2021, 35, 411–417. [Google Scholar] [CrossRef]
Patel, R.R.; Ternström, S. Quantitative and Qualitative Electroglottographic Wave Shape Differences in Children and Adults Using Voice Map–Based Analysis. J. Speech Lang. Hear. Res. 2021, 64, 2977–2995. Available online: https://asha.figshare.com/articles/journal_contribution/EGG_wave_shape_differences_in_children_and_adults_Patel_Ternstr_m_2021_/15057345 (accessed on 1 November 2022). [CrossRef]
Ternström, S.; D’Amario, S.; Selamtzis, A. Effects of the Lung Volume on the Electroglottographic Waveform in Trained Female Singers. J. Voice 2018, 34, 485.e1–485.e21. Available online: https://data.mendeley.com/datasets/txfn8ng4cc/1 (accessed on 1 November 2022). [CrossRef] [Green Version]
Ma, E.P.M.; Yiu, E.M.L. Multiparametric Evaluation of Dysphonic Severity. J. Voice 2006, 20, 380–390. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Maryn, Y.; De Bodt, M.; Roy, N. The Acoustic Voice Quality Index: Toward Improved Treatment Outcomes Assessment in Voice Disorders. J. Commun. Disord. 2010, 43, 161–174. [Google Scholar] [CrossRef]
Fröhlich, M.; Michaelis, D.; Strube, H.W.; Kruse, E. Acoustic Voice Analysis by Means of the Hoarseness Diagram. J. Speech Lang. Hear. Res. 2000, 43, 706–720. [Google Scholar] [CrossRef]
Deliyski, D.D. Acoustic model and evaluation of pathological voice production. In Proceedings of the 3rd European Conference on Speech Communication and Technology (Eurospeech 1993), Berlin, Germany, 19–23 September 1993; pp. 1969–1972. [Google Scholar]
Gómez-Vilda, P.; Gómez-Rodellar, A.; Palacios-Alonso, D.; Rodellar-Biarge, V.; Álvarez-Marquina, A. The Role of Data Analytics in the Assessment of Pathological Speech—A Critical Appraisal. Appl. Sci. 2022, 12, 11095. [Google Scholar] [CrossRef]
Cai, H.; Ternström, S. Mapping Phonation Types by Clustering Multiple Metrics. Appl. Sci. 2022. submitted. [Google Scholar]
Titze, I.R. Vocal Intensity in Speakers and Singers. J. Acoust. Soc. Am. 1992, 91, 2936–2946. [Google Scholar] [CrossRef] [PubMed]
Eriksson, A.; Traunmüller, H. Perception of Vocal Effort and Distance from the Speaker on the Basis of Vowel Utterances. Percept. Psychophys. 2002, 1, 131–139. [Google Scholar] [CrossRef] [Green Version]
Gramming, P.; Sundberg, J. Spectrum Factors Relevant to Phonetogram Measurement. J. Acoust. Soc. Am. 1988, 83, 2352–2360. [Google Scholar] [CrossRef] [PubMed]
Švec, J.G.; Granqvist, S. Tutorial and Guidelines on Measurement of Sound Pressure Level in Voice and Speech. J. Speech Lang. Hear. Res. 2018, 61, 441–461. [Google Scholar] [CrossRef] [PubMed]
Roubeau, B.; Henrich, N.; Castellengo, M. Laryngeal Vibratory Mechanisms: The Notion of Vocal Register Revisited. J. Voice 2009, 23, 425–438. [Google Scholar] [CrossRef]
Pabon, P.; Ternström, S. Feature Maps of the Acoustic Spectrum of the Voice. J. Voice 2020, 34, 161.e1–161.e26. Available online: https://data.mendeley.com/datasets/j9t57ks24n/1 (accessed on 1 November 2022). [CrossRef] [PubMed] [Green Version]
Lamesch, S.; Doval, B.; Castellengo, M. Toward a More Informative Voice Range Profile: The Role of Laryngeal Vibratory Mechanisms on Vowels Dynamic Range. J. Voice 2012, 26, 672.e9–672.e18. [Google Scholar] [CrossRef]
Zhang, Z. Mechanics of Human Voice Production and Control. J. Acoust. Soc. Am. 2016, 140, 2614–2635. [Google Scholar] [CrossRef] [Green Version]
Chen, C.J. Elements of Human Voice; World Scientific Publishing: Singapore, 2016; 214p, ISBN 9789814733892. [Google Scholar]
Titze, I.R.; Lucero, J.C. Voice Simulation: The Next Generation. Appl. Sci. 2022. submitted. [Google Scholar]
Brown, W.S.; Murry, T.; Hughes, D. Comfortable Effort Level: An Experimental Variable. J. Acoust. Soc. Am. 1976, 60, 696–699. [Google Scholar] [CrossRef]
Brown, W.S.; Morris, R.J.; Murry, T. Comfortable Effort Level Revisited. J. Voice 1996, 10, 299–305. [Google Scholar] [CrossRef]
Awan, S.N.; Novaleski, C.K.; Yingling, J.R. Test-Retest Reliability for Aerodynamic Measures of Voice. J. Voice 2013, 27, 674–684. [Google Scholar] [CrossRef]
Awan, S.N.; Giovinco, A.; Owens, J. Effects of Vocal Intensity and Vowel Type on Cepstral Analysis of Voice. J. Voice 2012, 26, 670.e15–670.e20. [Google Scholar] [CrossRef] [PubMed]
Murry, T.; Brown, W.S.; Morris, R.J. Patterns of Fundamental Frequency for Three Types of Voice Samples. J. Voice 1995, 9, 282–289. [Google Scholar] [CrossRef]
Iob, N.A.; He, L.; Ternström, S.; Cai, H.; Mehta, D.; Brockmann-Bauser, M. Effects of Speech Characteristics on Electroglottographic and Acoustic Voice Analysis Parameters in Women with Structural Dysphonia Before and After Treatment. University of Zürich: Zürich, Switzerland, 2022; manuscript in preparation. [Google Scholar]
Fitch, J.L. Consistency of Fundamental Frequency and Perturbation in Repeated Phonations of Sustained Vowels, Reading, and Connected Speech. J. Speech Hear. Disord. 1990, 55, 360–363. [Google Scholar] [CrossRef]
Murton, O.; Hillman, R.; Mehta, D. Cepstral Peak Prominence Values for Clinical Voice Evaluation. Am. J. Speech Lang. Pathol. 2020, 29, 1596–1607. [Google Scholar] [CrossRef]
Lã, F.M.B.; Ternström, S. Flow Ball-Assisted Voice Training: Immediate Effects on Vocal Fold Contacting. Biomed. Signal Process. Control 2020, 62, 102064. [Google Scholar] [CrossRef]
Köpp, W.; Weinkauf, T. Temporal Merge Tree Maps: A Topology-Based Static Visualization for Temporal Scalar Data. IEEE Trans. Vis. Comput. Graph. 2022, 1–11. [Google Scholar] [CrossRef] [PubMed]
Ternström, S.; Johansson, D.; Selamtzis, A. FonaDyn—A System for Real-Time Analysis of the Electroglottogram, over the Voice Range. SoftwareX 2018, 7, 74–80. Available online: https://www.kth.se/profile/stern/ (accessed on 1 November 2022). [CrossRef]
Sand, A. Inferential Statistics Is an Unfit Tool for Interpreting Data. Appl. Sci. 2022, 12, 7691. [Google Scholar] [CrossRef]
Pabon, P.; Stallinga, R.; Södersten, M.; Ternström, S. Effects on Vocal Range and Voice Quality of Singing Voice Training: The Classically Trained Female Voice. J. Voice 2014, 28, 36–51. [Google Scholar] [CrossRef] [PubMed]
Holmberg, E.; Ihre, E.; Södersten, M. Phonetograms as a Tool in the Voice Clinic: Changes across Voice Therapy for Patients with Vocal Fatigue. Logop. Phoniatr. Vocol. 2007, 32, 113–127. [Google Scholar] [CrossRef]
Mürbe, D.; Sundberg, J.; Iwarsson, J.; Pabst, F.; Hofmann, G. Longitudinal Study of Solo Singer Education Effects on Maximum SPL and Level in the Singers’ Formant Range. Logop. Phoniatr. Vocol. 1999, 24, 178–186. [Google Scholar] [CrossRef]
Lã, F.M.B.; Fiuza, M.B. Real-Time Visual Feedback in Singing Pedagogy: Current Trends and Future Directions. Appl. Sci. 2022, 12, 10781. [Google Scholar] [CrossRef]
Titze, I.R.; Palaparthi, A. Vocal Loudness Variation with Spectral Slope. J. Speech Lang. Hear. Res. 2020, 63, 74–82. [Google Scholar] [CrossRef]
Childers, D.G.; Lee, C.K. Vocal quality factors: Analysis, synthesis, and perception. J. Acoust. Soc. Am. 1991, 90, 2394–2410. [Google Scholar] [CrossRef]
Brockmann-Bauser, M.; de Paula Soares, M.F. Do we get what we need from clinical acoustic voice measurements? Appl. Sci. 2022. submitted. [Google Scholar]
Johansson, D. Real-Time Analysis, in SuperCollider, of Spectral Features of Electroglottographic Signals. Master’s Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2016. Available online: https://www.diva-portal.org/smash/get/diva2:945805/FULLTEXT01.pdf (accessed on 1 November 2022).

Figure 1. A healthy male amateur choir singer sang repeated messa di voce exercises (soft-loud-soft on constant pitch) over a limited range of one octave, and four voice metrics were mapped. (a) EGG contact quotient Q_ci [15], (b) normalized peak EGG derivative Q_Δ, (c) EGG cycle-rate sample entropy, and (d) audio spectrum balance. The resulting images can be thought of as four different “layers” of one composite voice map. This voice map shows averaged data from over 60,000 phonatory cycles and took about six minutes to record. For these graphs, a minimum of five phonatory cycles was required in each cell. For semitones, the MIDI scale is used, in which middle C4 (261 Hz) is assigned the scale number 60. (Previously unpublished data, collected for [14].)

Figure 2. Voice maps of a given task tend to be similar within persons, but can be very different from one individual to another. As an example, these maps illustrate the variation with SPL, f_o, and individual in the EGG quotient of contact by integration Q_ci [15], over the habitual speech range. Four vocally healthy adult males (columns) performed three consecutive readings (rows) of the full Rainbow Passage. Each map results from about two minutes of reading, in a quiet recording booth. The Q_ci was measured for every phonatory cycle above a regularity threshold and averaged within each cell of 1 dB and 1 semitone on the map, in real time. The per-cell average Q_ci is colour-coded as in the horizontal bars (Previously unpublished data, collected for [18], where the acquisition is described in detail).

Figure 3. Means and standard deviations in SPL, by elicitation of sustained /a/ vowels in two cohorts. Round markers: cohort 1, male (N = 42) and female (N = 50) healthy young speakers, from [40]. Square markers: cohort 2, female voice patients (N = 85) at baseline prior to intervention [16]. The mean SPL of elicited productions differed by 6–8 dB per level, with a standard deviation of ±5–7 dB.

Figure 4. Correlation plots of sustained vowel /a/ productions pre- and post-intervention (N = 58) for the same elicitations; (left panel) SPL, (right panel) f_o, both averaged over the central part of the vowel. Each participant is represented by three markers ■▲●.

Figure 5. Examples of maps across an intervention. A patient read for two minutes from The North Wind and the Sun, in Swedish, before and after feminization therapy. Top row: the density, i.e., the number of cycles phonated (a) before therapy, (b) after therapy, and (c) the difference. Bottom row, the spectrum balance SB (level > 2 kHz – level < 1.5 kHz), (d) before therapy, (e) after therapy, and (f) the difference in the overlapping region. The ovals approximating the overlap region are all in the same place, given for visual reference only. In that region, each cell contained an average count of 22 phonatory cycles. After therapy (f), the mean SB had increased by 4 dB in the region of the voice range that was visited both pre- and post-intervention.

Figure 6. Top row: maps of the EGG contact quotient Q_ci (a) pre-intervention, (b) post-intervention and (c) of the difference, to facilitate a comparison across the intervention. The horizontal color bars show the color scale for Q_ci (left, center) or for the change in Q_ci (right). In (c), the color red signifies a reduction in Q_ci after the intervention. Bottom row: the same data, after 2D interpolation and smoothing with a convolution kernel of 3 × 3 cells. Interpolation into empty cells is done only if there are non-empty neighbouring cells on opposing sides or corners. Although this does not add more information, the graph is easier to interpret, and the difference map (f) from the two interpolated maps (d,e) is less vulnerable to missing data in either of (a) or (b). Data adapted from [44].

Figure 7. Finding the “most habitual” speech: a vocally healthy adult male read the Rainbow Passage 3 times, in a quiet environment. Panels (a,b) show the number of cycles phonated in each cell, on a log gray scale. This is the SRP, in the form of a 2D histogram. In (b), a 5-cycle minimum threshold per cell has been imposed, to suppress outliers. The black grid in (b) marks the Γ (gamma) region that delimits the central quartiles, that is, the most populated cells that contain 50% of the total cycle count. Panel (c) shows a voice map of the same person’s full voice range on the vowel /a:/ (note the change of scale), with Γ overlaid, and colour representing the metric Q_Δ (normalized peak dEGG). Here, the full green colour represents absence or near-absence of vocal fold contacting. We note in passing how Γ, the “core” of the habitual speech, lies just above the green area. Presentation adapted from [18] (Supplement 1).

Figure 8. The descriptive statistics of the EGG contact quotient Q_ci over Γ for group A (adults) and group C (children), taken from sustained /a:/ productions within each participant’s Γ only. Filled circles are the means, vertical bars are the standard deviations. (Previously unpublished data, collected for [18].)

Figure 9. Left: Monte Carlo simulation of one thousand possible outcomes of randomly sampling two groups, of which the individual participant distributions are known, assuming normality (from Figure 8). To the right, the same data, but sorted by increasing p. On the left-hand vertical axis is the p value for the group difference observed for each draw; on the right-hand vertical axis are the sample averages (A) and (C) of groups A and C. The solid horizontal line indicates p = 0.05, while the dashed one indicates the p = 0.002 obtained when using the participant means instead of a random sample from each participant’s distribution.

Figure 10. Gradients and other group-level statistics of three EGG metrics Q_ci, Q_∆ and CSE, from VRPs on sustained /a:/, but evaluated only for the small Γ region that was derived from connected speech. Three groups of vocally healthy participants are represented: ■ men, ● women, ◊ children. The marker locations represent the group means. Each sloping line represents the gradient across Γ, averaged within the group. (a–c) gradients versus SPL (C-weighted, re. 20 µPa @ 30 cm); (d–f) gradients versus f_o (semitones re. 110 Hz). (Previously unpublished data, collected for [18].)

Figure 11. A comparison of variance explained by linear trends over Γ, for running speech (horizontal) and single vowel /a/ (vertical). Each marker represents a pair of group averages for one of the three participant groups in Figure 10. The aim of the figure is to show that for all the four EGG metrics shown, trends explain a much greater proportion of the total variance when producing only the /a/ vowel than in running speech. (Q_si is a metric of EGG pulse asymmetry.) (Previously unpublished data, collected for [18].)

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ternström, S.; Pabon, P. Voice Maps as a Tool for Understanding and Dealing with Variability in the Voice. Appl. Sci. 2022, 12, 11353. https://doi.org/10.3390/app122211353

AMA Style

Ternström S, Pabon P. Voice Maps as a Tool for Understanding and Dealing with Variability in the Voice. Applied Sciences. 2022; 12(22):11353. https://doi.org/10.3390/app122211353

Chicago/Turabian Style

Ternström, Sten, and Peter Pabon. 2022. "Voice Maps as a Tool for Understanding and Dealing with Variability in the Voice" Applied Sciences 12, no. 22: 11353. https://doi.org/10.3390/app122211353

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Voice Maps as a Tool for Understanding and Dealing with Variability in the Voice

Abstract

Featured Application

Abstract

1. Introduction

1.1. Background

1.2. The Voice Range Profile

1.3. Voice Map Definitions

1.4. People Are Different, but Consistent

1.5. The Need for Dense Sampling

1.6. Combining Multiple Metrics and Data Analytics

2. Aspects of Variability in Voice

2.1. The Independent Variables: SPL and fo

2.2. Types of Phonation

2.3. Variations: Noise or Data?

2.4. Variability over the Voice Range

2.5. Reproducibility and Elicitation

2.6. Representativeness and Elicitation

3. Voice Maps in Practice

3.1. Mapping Effects of Interventions

3.2. Time Needed for Collection

3.3. Interpolation and Smoothing

3.4. Finding a Representative Value

3.5. Local Gradients

3.6. Mapping of Group Data

3.7. Mapping of Other Statistics

3.8. Colour Scales

3.9. Hearing-Seeing-Feeling in Real Time

3.10. Assessing Pathological Voices

3.11. Subtle Voice Changes

4. Discussion and Conclusions

4.1. Choosing the Metrics for Voice Maps

4.2. Limitations of Voice Mapping

4.3. Data Analytics?

4.4. Concluding Remarks

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Implementations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.1. The Independent Variables: SPL and f_o