1. Introduction
The concept of phonemes is well developed in speech recognition and derives from a definition in phonetics as “the smallest sound one can articulate” [
1]. Phonemes are analogous to atoms—they are the building blocks of speech. While they are an approximation, in practice that approximation has been remarkably robust [
2]. Not only are phonemes used by linguists and audiologists to describe speech, they are widely used in large-vocabulary speech recognition as the acoustic classes, or ‘units’, to be recognized [
2,
3,
4]. Sequences of unit estimates can be strung together to infer words and sentences.
Comprehending visual speech, or lipreading, is much less well developed [
5]. The units considered to be equivalent to phonemes are called visemes [
6] but, even in English, there is no clear agreement on the visemes [
7], and in [
8] for example, it is noted that there are at least 120 proposed viseme sets. This large number arises because some authors take vowels [
9], and others consonants [
10], but also because, of the proposed sets, some are derived from linguistic principles [
11,
12], some are the results of human lipreading experiments [
13,
14], others are data-derived [
8,
15], and others still are hybrids of these approaches [
16].
Despite the challenges, a number of lipreading systems have been built using visemes ([
17,
18] for example). When building a viseme recognizer a complication is that multiple phonemes will map onto a single viseme [
8]. A common example is the
,
, and
bilabial sounds which are often grouped into one viseme [
19,
20,
21]. Attempts to draw mappings between the phonemes and visemes have been tested [
8,
22] but to date these mappings have not yet proven to improve machine lipreading significantly.
On the other hand, there is an emerging body of work [
23,
24] that, despite the caveats above, is demonstrating that phoneme lipreading systems can outperform viseme recognizers. In essence it is a tradeoff: does one use viseme units which are tuned to the shape of the lips but suffer with inaccuracies caused by visual confusions between words that sound different but look identical [
23]; or does one stick to phonetic units knowing that many of the phonemes are difficult to distinguish on the lips?
These visual confusions are called homophenes [
25]. We demonstrate the homophenous word difficulty, with some examples in
Table 1 from [
23]. In this example, Jeffers visemes [
26] have been used to translate the phonemes into viseme strings.
However, as we shall show in this paper, it need not be an either/or approach to phonemes or visemes; we develop a novel method that allows us to vary the number of classes/visual units. This means we can tune the visual units as an intermediary state between the visual and audio spaces and we can also optimize against the competing trends of homopheneiosity [
27,
28] and accuracy [
29]. Thus, in this work, we use the term visemes for the traditional visemes, and the term visual units for our new intermediary units which we propose will improve phoneme classifiers.
We are motivated in our work because lipreading is a difficult challenge from speech signals. Speech signals are bimodal (that is they have two channels of information, audio and visual) and significant prior work uses both. For example [
30] uses audio-visual speech recognition to demonstrate cross modality learning. However in our case, lipreading, which is useful for understanding speech when audio speech is too noisy to recognize easily, is classifying speech from only the visual information channel in speech signals thus, as we shall present, we use a novel training method which uses new visual units and phonemes in a complimentary fashion.
This paper is an extended version of our prior work [
5,
31], this work is relevant to all classifiers since the choice of visual unit matters and is made before the classifier is trained. In other words, the choice of visual units must be made early in the design process and a non-optimal choice can be very expensive in terms of performance.
The rest of this paper is structured as follows; we summarize prior viseme research for lipreading by both humans and machines, and describe the state-of-the-art approaches for lipreading systems in a background section. Then we present an experiment in which we demonstrate how we can find the optimal number of visual units within a set; this is an essential preliminary test to define the scope of the second task. We present the data for all experiments within this section. The preliminary test includes phoneme classification and clustering for new visual unit generation before analyzing the results to find the optimal visual unit sets.
These optimal visual unit sets are used to test our novel method for training phoneme-labeled classifiers by using these sets as an initialization stage in the training phase of a conventional lipreading system. As part of this second task, we also present a side task of deducing the right units for lipreading language models used in the lipreading system. Finally, we present the results of the new training method and draw conclusions before suggesting future work. Thus, we have three main contributions:
a method for finding optimal visual units,
a review of language model units for lipreading systems,
a new training paradigm for lipreading systems.
2. Background
Table 2 summarizes the most common viseme sets in the literature used for both human and machine lip reading. The range of set sizes is from four (Woodward [
12]) to 21 (Nichie [
21]). Note that not all viseme sets represent the same number of phonemes. Furthermore some of these use American English and others British English so there are minor variations in the phoneme sets. (American English phonemes tend to use diacritics [
32].)
Lipreading systems can be built with a range of architectures. Conventional systems are adopted from acoustic methods, often using Hidden Markov Models, for example as in [
36]. More modern systems exploit deep learning methods [
37,
38]. Deep learning has been deployed in two configurations: (i) as a replacement for the GMM in the Hidden Markov Models (HMM) and (ii) in a configuration known as end-to-end learning.
However, the high-level architectures have similarities: first the face of the speaker must be tracked or located; then some form of features are extracted; then a classification model is trained and tested on unseen data, optionally using a language model to improve the classification output (e.g., [
39]). Throughout this process one must translate between the words spoken (and captured in the training videos), to their phonetic pronunciation, to their visual representation on the lips, and back again for a useful transcript.
3. Finding a Robust Range of Intermediate Visual Units
In our first example we use the RMAV dataset [
40] and the BEEP pronunciation dictionary [
41].
Figure 1 shows a high-level overview of the first task. We begin with classification using phoneme-labeled classifiers. The output of this task is a set of speaker-dependent confusion matrices. The data in these are used to cluster together single phonemes (monophones) into subgroups of visual units, based upon confusions.
However, conversely to the approach in [
8] we implement an alternative phoneme clustering process (described in detail in
Section 4). The key difference between the ad-hoc viseme choices compared in [
8] and our new clustering approach, is our ability to choose the number of visual units, whereas in prior viseme sets, this is fixed.
With our new algorithm, we create a new phoneme-to-viseme (P2V) mapping every time a pair of classes is re-classified into a new class, thus reducing the number of classes in a set by one each time. In the phonetic transcripts of our 12 speakers, there is a maximum of 45 phonemes, therefore we can create at most 45 P2V maps for each speaker. We note that the real number of maps we can derive depends upon the number of phonemes classified during step one of
Figure 1. During this preliminary phoneme classification, should a phoneme not be classified, either incorrectly or correctly, then it is an omission in the confusion matrix from which our visual units are created. Thus, we have
up to 45 sets of visual unit labels per speaker with which to label our classifiers.
There is the option to measure performance using phoneme, viseme, or word error. Here we choose word error [
42] because viseme error varies as the number of visemes varies which leads to unfair comparisons and phoneme error is not as close to what we believe to be of interest to users which is transcript error.
3.1. Data
The RMAV dataset (formerly known as LiLIR) consists of 20 British English speakers (we use the 12 speakers who had tracked features available; seven male and five female) and up to 200 utterances per speaker of the Resource Management (RM) sentences which totals between 1362 and 1802 words each. The sentences selected for the RMAV speakers are a subset of the full RM dataset [
43] transcripts. They were selected to maintain as much coverage of all phonemes as possible as shown in
Figure 2 and realistic to English conversation [
40]. The original videos were recorded in high definition (
) and in a full-frontal position at 25 fs
. Individual speakers are tracked using Linear Predictors [
44] and Active Appearance Model [
45] features of concatenated shape and appearance information have been extracted.
3.2. Linear Predictor Tracking
Linear Predictors (LP) are a person-specific and data-driven facial tracking method. Devised primarily for observing visual changes in the face during speech, these make it possible to cope with facial feature configurations not present in the training data by treating each feature independently.
The linear predictor is the central point around which support pixels are used to identify the change in position of the central point over time. The central point is observed as a landmark on the outline of a feature. In this method both the shape (comprised of landmarks) and the pixel information surrounding the linear predictor position are intrinsically linked. Linear predictors have been successfully used to track objects in motion, for example [
46].
3.3. Active Appearance Model Features
AAM features [
45] of concatenated shape and appearance information have been extracted. We track using a full-face model (
Figure 3(left)) but the final features are reduced to information from the lip area alone (
Figure 3(right)). Shape features (
1) are based solely upon the lip shape and positioning during the duration of the speaker speaking. The landmark positions can be compactly represented using a linear model of the form:
where
is the mean shape and
are the modes. The appearance features are computed over pixels, the original images having been warped to the mean shape. So
is the mean appearance and appearance is described as a sum over modal appearances:
Combined features are the concatenation of shape and appearance after PCA has been applied to each independently. The AAM parameters for each speaker is in
Table 3 (MATLAB files containing the extracted features can be downloaded from
http://zenodo.org/record/2576567).
5. Optimal Visual Unit Set Sizes
It is important in this case to weight the chance of guessing by visual homophenes as these vary by the size of the visual unit set. Visual unit sets which contain fewer visual units produce sequences of visual units which represent more than one word. These are homophenes. The effect of homophenes can be seen on the left side of
Figure 5 and the graphs in
Appendix A with visual unit sets with fewer than 11 visual units where homophenes become noticeable and language model can no longer correct these confusions.
An example of a homophene in the RMAV data are the words ‘tonnes’ and ‘since’. If one uses Speaker 1’s 10-visual unit P2V map, both words transcribe into visual units as ‘
’. In practice a language model, or word lattice, will tend to reduce such confusions since the lattice models the probability of word
N-grams which means that probable combinations such as “metric tonnes” will be favored over “metric since” [
23].
We see all our word correctness scores are significantly above guessing albeit still low. There is variation between speakers, but there is a clear overall trend. Superior performance is to be found with larger numbers of visual units. An important point is some authors report viseme accuracy instead of word correctness [
42]. This is unhelpful as it masks the effect of homophenous words on performance. Had we reported this then the positive effect of larger visual unit sets would not be visible.
In
Figure 5 we highlight in red the class sets which, for any speaker, have shown a significant classification improvement (with non-overlapping error bars) over the adjacent set of units on its right side along the
x-axis. Error bars overlap once the correctness is averaged so
Table 5 lists these combinations for each speaker. These red points show where we can identify the pairs of classes which, when merged into one class, significantly improve classification. If we refer to the speaker demographic factors such as gender or age, we find no apparent pattern through these visual unit combinations. So, we have further evidence to reinforce the idea that all speakers have a unique visual speech signal, [
52]. In [
53] this is suggested to be due to how the trajectory between visual units varies by speaker, due to such things as rate of speech [
54]. This is how difficult finding a set of cross-speaker visual units can be when phonemes need alternative groupings for each individual [
27].
6. Discussion
In
Figure 5 we have plotted mean word correctness,
C, over all 12 speakers and weighted guessing (
) in green. Here we see that within one standard error, there is a monotonic trend. Small numbers of units perform worse than phonemes and which supports the claim that phonemes are preferred to visemes but, it would be an oversimplification to assert that higher accuracy lipreading can be achieved with phonemes as this has not been shown in our results with significance. Rather we say that, generally, visual unit sets with higher numbers of visual unit classes outperform the smaller sets. In [
8] the authors reviewed 120 of previous phoneme-to-viseme (P2V) maps, typically these consist of between 10 and 35 visual units [
55]. For example the Lee set consists of six consonant visemes and five vowel visemes [
15] and Jeffers [
26] group phonemes into eight vowel and three consonant visemes.
In
Figure A1,
Figure A2,
Figure A3,
Figure A4,
Figure A5 and
Figure A6 and
Figure 5 we present a definite rapid decrease in lipreading word correctness for visemes sets containing fewer than ten visemes. However, positively, the region visemes sets of sizes between 11 and 20 contain the optimum viseme set for three out of the 12 speakers which is more than random chance. This means, for each speaker, we have found and presented an optimal number of visual units (shown by the best performing results in
Figure A1,
Figure A2,
Figure A3,
Figure A4,
Figure A5 and
Figure A6) but the optimal number is not related to any of the conventional viseme definitions, nor is it consistent across speakers.
Table 6 shows the word correctness,
, of each speakers phoneme classification.
7. Hierarchical Training for Weak-Learned Visual Units
Figure 5 showed our first results derived using an adapted version of the algorithm described in [
55].
Table 5 also shows us, for each of our 12 speakers the significantly improving visual unit sets. These sets are those where one single change of visual unit grouping has resulted in a significant (greater than one standard error over ten folds) increase in word correctness. This tells us that there are some units between the traditional visemes (for example [
13,
20,
21]), and phonemes which are better for visual speech classification.
Table 5 ([
31]) shows us several significantly improving sets. Our suggestions for why these are interesting are; first the tradeoff of homophenes against accuracy. It is possible these are the groupings where the accuracy improvement is significantly improving, despite the extra homophenes created as the number of visual units in the set decreases. Either the increase in homophenes is negligible or, the number of training samples for two visually indistinguishable classes significantly increases when combined.
We propose a novel idea; to implement hierarchical classifier training using both visual units and phonemes in sequence. Some work in acoustic speech recognition has used this layered approach to model building with success e.g., [
56]. It is our intention use our new range of visemes to test if our new training algorithm can improve phoneme classification without the need for more training data as this approach shares training data across models. This premise avoids the negative effects of introducing more homophenes because of the second layer of training discriminates between the sub-units within the first layer. This will assist the identification of the more subtle but important differences in visual gestures representing alternative phonemes. We note from [
22] that using the wrong clusters of phonemes is worse than using none, and also that this new approach aims to optimize performance within the scope of the datasets and system affects described previously in
Section 5 and
Section 6.
A bonus of our revised classification scheme is that because we weakly train the classifier before phoneme training, we remove any desire to consider post-processing methods (e.g., weighted finite state transducers [
24]) to reverse the P2V mapping in order to decode the real phoneme recognized.
In
Figure 5, the performance of classifiers with small numbers of visual units (fewer than 10) is poor. As described previously, we attributed this to the large number of homophenes. At the other side of our figure, sets containing large numbers of visual units (greater than 35) do not significantly, or even noticeably, improve the correctness. This is where many phonetic variations are visually indistinguishable on the lips. Also taking into account the set numbers printed in black (which are the significantly improving visual unit sets) we focus on sets of visual units in the size range 11 to 35 with the same 12 RMAV speakers for our experiments using hierarchical training of phoneme classifiers.
Here, we use our knowledge of visual speech to drive our novel redesign of the conventional training method. In
Figure 6 shows how we make it earlier in the process. The top of
Figure 6 in black boxes shows the steps of a lipreading system, divided into phases where the units change from words, to phonemes, to visual units (where used). Flow 1 shows how we translate the word ground truth into phonemes using a pronunciation dictionary (e.g., [
41] or [
57]) for labeling the classifiers, before decoding with a word language model. Flow 2 below this, using visual units. The variation in flow 2 shows we translate from visual unit trained classifiers back into words using the word network. Finally, row three shows our new approach, where we introduce an extra step into the training phase, which means classifiers are initialized as visual units, before retraining them into phoneme classifiers before word decoding. We describe this new process in detail now.
8. Classifier Adaptation Training
The basis of our new training algorithm is a hierarchical structure with the first level based on visual units, and the second level based on phonemes. In
Figure 7 we present an illustration based on a simple example using five phonemes (in reality there are up to 45 in the RMAV sentences) mapped to two visual units (in reality there will be between 11 and 35 as we have refined our experiment to only use sets of visual units in the optimal size range from the preliminary test results). Each phoneme is mapped to a visual unit as in [
5], our example map is in
Table 7. But now we are going to learn intermediate visual unit labeled HMMs before we create phoneme models.
In this example , and are associated with , so are initialized as duplicate copies of HMM . Likewise, phoneme models labeled and are initialized as replicas of . We now retrain the phoneme models using the same training data.
In full for each set of visual units of sizes from 11 to 35:
The big advantage of this approach is the phoneme classifiers have seen mostly positive cases therefore have good mode matching, the disadvantage is they are limited in their exposure to negative cases, less so than the visual units.
9. Language Network Units
Step five in our novel hierarchical training method requires a language network. It has been consistently observed that language models are very powerful in lipreading systems (e.g., in [
59]). Language models built upon the ground truth utterances of datasets learn grammar and structure rules of words and sentences (the latter in the case of continuous speech). However, the visual co-articulation effects damages the performance of visual speech language models as visually, people do not say what the language model expects. These types of network are commonplace, but we note that higher-order
N-gram language models may improve classification rates but the cost of this model is disproportionate to our goal of developing more accurate classifiers. Therefore, to decide which unit would best optimize our language model we test three units: visemes; phonemes; and words, as bigram models in a second preliminary test.
In the first two columns of
Table 8 we list the possible pairs of classifier units and language model units. For each of these pairs we use the common process previously described for lipreading in HTK, where our phonemes are based on the International Phonetic Alphabet [
1], and our visemes are Bear’s speaker-dependent visemes [
8]. Word labels are from the RMAV dataset. We define
classifier units as the labels used to identify individual classification models and
language units as the label scheme used for building the decoding network used post classification.
Language Network Unit Analysis
In
Table 8 column four we have listed one standard error values for these tests. The phoneme units are the most robust. In
Figure 8 we have plotted word correctness (
x-axis) for each speaker along the
y-axis over three figures, one figure per language network unit. The viseme network is top, phoneme network middle, and word network at the bottom. The viseme network is the lowest performing score (
). On the face of it, the idea of visemes classifiers is a good one because they take visual co-articulation into account to some extent. However, as seen here, a language model of visemes is too complex because of homophenes. This leaves us with a choice of either phoneme or word units for our language model in step five of our new hierarchical training method.
In
Figure 8(middle) we have our phoneme language network performance with both viseme and phoneme trained classifiers. This is more exciting because for all speakers we see a statistically significant increase in
compared to the viseme network scores in
Figure 8(top). Looking more closely between speakers we see that for four speakers (2, 9, 10 and 12), the viseme classifiers outperform the phonemes, yet for all other speakers there is no significant difference between the two. On average they are identical with an all-speaker mean
of
compared to the viseme classifiers (
Table 8, column 3).
In
Figure 8(bottom) we show our
for all speakers with a word network paired with classifiers built on viseme, phoneme, and word units. Our first observation is that word classifiers perform very poorly. We attribute this to a low number of training samples per class due to the extra number of classes in the word space compared to the number of classes in the phoneme space, so we do not continue our work with word-based classifiers. Also shown in
Figure 8(bottom) are the phoneme and viseme classifiers (in green and red respectively) with a word network. This time we see that for five of our 12 speakers (3, 5, 7, 8, and 11), the phoneme classifiers outperform the visemes and for our remaining speakers there is no significant difference once a work network is applied.
These results tell us that for some speakers viseme classifiers with phoneme networks are a better choice whereas others are easier to lipread with phoneme classifiers with a word network. Thus, we continue our work using both phoneme and word-based language networks.
10. Effects of Training Visual Units for Phoneme Classifiers
Here we present the results of our proposed hierarchical training method (described in
Section 4 with two different language models.
Figure 9 shows the mean correctness,
C, for all 12 speakers over 10 folds. We have plotted four symbols, one for each of the pairings of our HMM unit labels and the language network unit ({visual units and phonemes, visual units and words, phonemes and phonemes, phonemes and words}). Random guessing is plotted in orange.
The
x-axis of
Figure 9 is the size of the optimal visual unit sets from
Figure 5, from 11 to 36. This is the range of optimal number of visual units where phoneme label classifiers do not improve classification. The baseline of visual unit classification with a word network from [
31] is shown in blue and is not significantly different from conventionally learned phoneme classifiers. Based on our language network study in
Section 9, it is not a surprise to see just by using a phoneme network instead of a word network to support visual unit classification we significantly improve our mean correctness score for all visual unit set sizes for all speakers (shown in pink). We have plotted weighted guessing in orange.
More interesting to see is our new weakly trained phoneme HMMs are significantly better than the visual unit HMMs. In the first part of our work here phoneme HMMs gave an all-speaker mean
and was not significantly different from the best visual units. Here, regardless of the size of the original visual unit set,
C is almost double. Weakly learned phoneme classifiers with a word network gain
to
in mean
C, and when these phoneme classifiers are supported with a phoneme network we see a correctness gain range from
to
. These gains are supported by the all-speaker mean minimum and maximums listed in
Table 9. These gain scores are from over all the potential P2V mappings and show there is little difference in which P2V map is best for knowing which set of visual units to initialize our phoneme classifiers. All results are significantly better than guessing.
In
Figure 10,
Figure 11,
Figure 12 and
Figure 13, we have plotted for each of our 12 speakers non-aggregated results showing
one standard error. While not monotonic, these graphs are much smoother than the speaker-dependent graphs shown in
Appendix A. The significant differences between visual unit set sizes (in
Figure 5) have now disappeared because the learning of differences between visual units, has been incorporated into the training of phoneme classifiers, which in turn are now better trained (plotted in red and green which improve on blue and pink respectively).
An intriguing observation is comparing the use of a phoneme network for visual units and for weakly taught phonemes. For some speakers, the weakly learned phonemes are not always as important as having the right network unit. This is seen in
Figure 10(top,bottom),
Figure 11(middle),
Figure 12(middle), and
Figure 13(bottom) for Speaker’s 1, 3, 5, 8, and 12. By rewatching the original videos to estimate the age of our speakers, we categorize them as either an ‘older’ or ’younger’ speaker by eye because the exact ages were not captured during filming. The speakers with less significant difference in the effect of hierarchical training from visual to audio units are younger. This implies to lipread a younger person we need more support from the language model, than an older speaker. We suggest this could be because young people show more co-articulation than older people, but this requires further investigation.
11. Conclusions
We have described a method that allows us to construct any number of visual units. The presence of an optimum is a result of two competing effects on a lipreading system. In the first, as the number of visual units shrinks the number of homophenes rises and it becomes more difficult to recognize words (correctness drops). In the second, as the number of visual units rises we run out of training data to learn the subtle differences in lip-shapes (if they exist), so again, correctness drops. Thus, the optimum number of visual units lies between one and 45. In practice we see this optimum is between the number of phonemes and eight (which is the size of one of the smaller visual unit sets).
The choice of visual units in lipreading has caused some debate. Some workers use visemes (for example Fisher [
13] in which visemes are a theoretical construct representing phonemes that should look identical on the lips [
60]). Others, e.g., [
24] have noted that lipreading using phonemes can give superior performance to visemes. Here, we supply further evidence to the more nuanced hypothesis first presented in [
31], that there are intermediate units, which for convenience we call visual units, that can provide superior performance provided they are derived by an analysis of the data. A good number of visual units in a set is higher than previously thought.
We have also presented a novel learning algorithm which shows improved performance for these new data-driven visual units by using them as an intermediate step in training phoneme classifiers. The essence of our method is to retrain the visual unit models in a fashion similar to hierarchical training. This two-pass approach on the same training data has improved the training of phoneme-labeled classifiers and increased the classification performance.
We have also investigated the relationship between classifier unit choice with the unit choice for the supporting language network. We have shown that one can choose either phoneme or words without significantly different accuracy, but recommend a word net as this reduces the effect of homophene error and enables unbiased comparison of classifier performance.
In future works we would seek to experiment if this hierarchical training method would achieve the same benefit to other classification techniques, for example RBMs. This is inspired by the work in [
61,
62] and other recent hybrid HMM studies such as [
63].