Phoneme-to-viseme mappings: the good, the bad, and the ugly

Helen L Bear; Richard Harvey

doi:10.1016/j.specom.2017.07.001

Phoneme-to-viseme mappings: the good, the bad, and the ugly

Bear, Helen L;Harvey, Richard 2018-05-08 00:00:00 Visemes are the visual equivalent of phonemes. Although not precisely de ned, a working de nition of a viseme is \a set of phonemes which have identical appearance on the lips". Therefore a phoneme falls into one viseme class but a viseme may represent many phonemes: a many to one mapping. This mapping introduces ambiguity between phonemes when using viseme classi ers. Not only is this ambiguity damaging to the performance of audio-visual classi ers operating on real expressive speech, there is also considerable choice between possible mappings. In this paper we explore the issue of this choice of viseme-to-phoneme map. We show that there is de nite dierence in performance between viseme-to-phoneme mappings and explore why some maps appear to work better than others. We also devise a new algorithm for constructing phoneme-to-viseme mappings from labeled speech data. These new visemes, `Bear' visemes, are shown to perform better than previously known units. Keywords: lipreading, speaker-dependent, viseme, phoneme, resolution, speech recognition, classi cation, visual speech, visual units. 1. Introduction Recognition and synthesis of expressive audio-visual speech has proven to be a most challenging problem. When comparing audio-visual speech with acoustic recognition, one can identify several sources of diculty. Firstly, the visual component of speech brings new problems such as pose, lighting, frame rate, resolution, and so on. Secondly, old problems in acoustic recognition, such as person speci city or the optimal recognition units, appear in new ways in the visual domain. While some of these aspects have been partially studied, progress has been hampered by very small datasets. Furthermore, reliable tracking has eluded many researchers which in turn has led to sub-optimal feature extraction, consequent poor performance and hence, incorrect conclusions about the parts of the problem that are tractable or intractable. A further challenge is the lack of consensus on the recognition units and it is commonplace to need to compare, say, Corresponding author Email address: h.l.bear@uel.ac.uk (Helen L Bear) URL: https://www.uea.ac.uk/computing/people/profile/r-w-harvey (Richard Harvey) Preprint submitted to Speech Communication Special Issue on AV expressive speech and gesture. July, 2017 arXiv:1805.02934v1 [cs.CV] 8 May 2018 word error rates with viseme error rates computed from a dierent set of visemes. Our contention is that progress in expressive audio-visual speech will remain stunted while this fundamental uncertainty remains. In this paper we review the choice of visual recognition units and provide a comprehensive set of evaluations of the competing phoneme-to- viseme mappings. We give guidance on what works well and provide explanations for the dierences in performance. We also devise new algorithms for selecting optimal visual units should this be desired. We should note that while this paper tends to focus on visual-only recognition, or lipreading, this aspect is by far the most challenging so progress on lipreading can be used to provide more useful audio-visual systems. The rest of this paper is structured as follows: we discuss the current restrictions on a conventional lipreading system and identify the limitation of each upon the system. We then study the current sets of published visemes, before presenting a new speaker- dependent clustering algorithm for creating sets of visemes for individual speakers. We show that creating these speaker-dependent visemes follows from simple clustering and merge algorithms. These new visemes are tested on both isolated words and continuous speech datasets before we evaluate the ecacy of the improved performance against the extra investment into a new lipreading system. Since it is computationally simple to develop these speaker-dependent visemes we contend they are also a useful step in the analysis of speaker variability which itself is one of the more challenging problems in general lipreading. 2. Limitations in lipreading systems It is often said that lipreading is dicult because not all sounds appear on the lips . This is true but in reality there are a number of problems that can corrupt the lipreading signal even before one reaches the problem of trying to decode the visual signal. Table 1 provides a taxonomy of the challenges in lipreading. Some of them relate to the problems of extracting useful information from the visual signal whereas some appear later in the signal processing chain and relate to the coding and classi cation of the visual signal. Motion is an important part of almost all realistic settings. It is therefore essential to have either some form of tracking or to devise features that are invariant to non- informational motions. An early dataset which captured speaker motion (not camera motion) is CUAVE [37]. Lipreading experiments on this dataset such as [38] examine two dierent features, one based on the Discrete Cosine Transform (DCT) and another on the Active Appearance Model (AAM). The AAM (which can be shape-only, appearance-only or shape and appearance models) [4] sometimes preceded by Linear Predictors (LP) [2]. An AAM [4] is a model trained on a combination of shape and/or appearance information from a subset of video frames. The model is usually built from video frames manually labeled with landmarks which are chosen to cover the full range of motion throughout the video. In [38] they prefer the DCT but note that there were implementation diculties with the AAM which meant it was improperly tracked. Further lip-reading experiments on CUAVE [39] clari es how challenging comparing results is, because there is no agreed [1] compares the performance of a system that measures, via electromagnetic articulography, the hidden and visual parts of the mouth so the extent of this statement can be quanti ed. 2 Table 1: Challenges to successful machine lipreading. Each challenge has some references. Evaluation Previously studied? Motion Yes, [2{4] Pose Yes, [5{11] Expression Yes, [6, 7] Frame rate Yes, [12, 13] Video quality Yes [14{16] Color Yes, [9] Unit choice Yes, [17{21] Feature Yes, [3, 4, 22{24] Classi er technology Yes, [17, 25{28] Multiple persons Yes, [29{32] Speaker identity Yes, [33{35] Rate of speech Yes, [21, 36] evaluation protocol which could account for the motion challenge/face alignment. This is attributed to their partial success with particular speakers. The majority of automatic lipreading systems use a frontal pose in which the speaker's facial place is normal to the principal ray of the camera. However in [7] for example, an improvement in expression recognition is seen by both computers and humans when the pose is rotated to 45 . Other work [8, 9], looks more speci cally at visual speech recognition and suggests that a pro le view of a speaker may not lead to catastrophically low accuracies. This observation is consistent with [10] which measures human sentence perception from three viewing angles: full-frontal view (0 ), angled view (45 ), and side view (90 ). In this single-subject study a post-lingual deaf woman was tested to measure accuracy at the three angles independently. The three angles were randomly presented in every lipreading session. The results indicated that the side-view angle is most eective. A model for pose-mismatched lipreading is presented in [11] in which it is shown that without training data at the correct pose, the recognition accuracy falls dramatically. However, the authors also show that this can be mitigated by projecting the features back to a canonical pose. This transformation principle is also used in [5] which presents a view-independent lipreading system. This investigation uses a continuous speech cor- pus compared to the small vocabulary dataset in [11]. This later study acknowledges a human lipreaders preference for a non-frontal view and suggests it could be attributed to lip protrusion. They show that the 45 angle is preferable. In short, when it comes to pose, there is evidence that it can be accounted for and need not be insurmountable. Therefore, for this work we stick to frontal pose. Expression can be dicult to disentangle with the spoken word when lipreading natural speech. Smiling (a happy expression) has an known eect on lip motions during speech [40]. Eects on the inner, outer lips and lip protrusions have been measured in [41] who shows that smiling during speech (particularly vowels) places a restriction on lip motion with greater demand placed on the inner lips as variation in outer lips and lip protrusion is reduced. This in turn creates a greater challenge when lipreading non- neutral speech as gestures become less distinct. Furthermore, expression also eects the temporal property of speech [42, 43]. When a particular phoneme is uttered, its duration 3 can be shortened (for example when angry and vowels particularly become shorter) or elongated, for example when a speaker is sad. To the best of our knowledge there is no systematic study which speci cally investi- gates lipreading expressive speech. Rather, tasks focus on either, synthesizing expression in faces [44{46] or expression recognition during speech [47{49]. Studies such as [12] on the eect of low video frame-rate on human speech intelligi- bility during video communications, suggest that lower frame rates, if they are visible to the speaker, encourage humans to over-articulate to compensate for the reduced visual information available, akin to a visual Lombard eect. Accuracy is maximized when the same frame rate is used for both training and testing [13]. They further recommend that when the training data cannot be recorded at the same frame rate as the test data, then it is best if the training data has a higher frame rate (for feature extraction) than the test data. A further observation is that word classi cation rates vary in a non-linear fashion as the frame rate is reduced. When it comes to dependence of lipreading on video quality, an investigation into the eects of compression artifacts, visual noise (simulated with white noise) and local- ization errors in training is presented in [15], and in [16]. The authors undertake two experiments, of which the rst includes some attention to spatial resolution (the number of pixels). However, here, resolution varies along with other parameters. Neither of these papers consider the simple removal of information from a smaller image compared to a larger one. A more systematic study of resolution can be found in [14] in which video of varying resolution is parameterized using AAMs [50]. This work shows that machines can lipread continuous speech with as little as two pixels per lip. With regard to color, it has been surprisingly under used. In [9] algorithms are derived which contain three key components: shape models, motion models, and focused color feature detectors. In early works it was common to use colored lip-stick or markers to help track the lips (tracking remains challenging) but many authors convert the image to grayscale and use grayscale features. Unit choice refers to the question of whether to use phonemes, visemes, words or something else. Classi ers built on phonemes [18], visemes [19], and words [20] have all been previously presented. Sometimes the unit choice is linked to the problem: word clas- si ers often use word units, whereas continuous speech has to use phonemes or visemes. It is essentially a trade-o since using phonemes means accepting that there will be units that do not appear on the lips (the words \bad", \pad", and \mad" are usually said to be visually indistinguishable) whereas using visemes leads to better unit accuracy but there is then the problem of homopheny (words that have identical visemic transcriptions but dierent spellings). One study has reviewed how the unit selection aects recognition in relation to the unit selection of the supporting language model [21] and have shown that phoneme networks work best for both phoneme and viseme classi ers. However the practical reality is that many systems use visemes and there is need to resolve which choice of visemes works best. Comparative studies such as [17] have attempted to com- pare some previous viseme sets but, these often only consider a few dierent sets rather than the gulf available. Lan et al. present in [24] a comparison of dierent features rst presented in [4]. Revisited in [3], AAM features are produced as either model-based (using shape infor- mation) or pixel-based (using appearance information). In [24] Lan et al. observed that state of the art AAM features with appearance parameters outperform other feature 4 types like sieve features, 2D DCT, and eigen-lip features, suggesting appearance is more informative than shape. Also pixel methods bene t from image normalisation to remove shape and ane variation from region of interest (in this example, the mouth and lips). The method in [24] classi ed words with the an Audio-Visual dataset known as RMAV but recommended in future creating classi ers with viseme labels for lipreading, and advises that most information is from the inner of the mouth. Some works have attempted to adapt features to address dierent problems, such as motion described above. For example, in [51] the authors suggest altering HMM modeling to permit either frozen or occluded frames, and demonstrate that even low level jitter will signi cantly aect the quality of lip reading features. When it comes to the choice of classi er technology it is the norm that machine lipreading systems adapt methods from acoustic recognition. This not only follows from the observation that visual and acoustic speech have the same origins but also from the practical observation that language models are expensive to create and it makes sense to re-use the models across the two modalities. The conventional classi er process is 1) data preparation (an acoustic example is creating MFCC's [27], whereas a visual example might be [17]), 2) build Hidden Markov Model classi ers, and 3) feed the classi cation outputs through a language network to produce a transcript. Like feature selection, the choice of classi er is aected by the problem in hand. An optimal audio recognizer will not guarantee optimal performance in an audio-visual, or visual only domain. In [52], for example, it is noted that their audio-visual results should not be \read across" to lipreading. More modern deep learning techniques for lipreading are an alternative approach which require much more training data [28]. A key disadvantage of these methods is a lack of understanding about what exactly a neural network is learning in order for it to classify unseen gestures. So often the results from deep learning are good but the scienti c insight can be poor. Thus recent work has begun to demonstrate performance of dierent deep learning approaches with a variety of neural network architectures. Convolution neural networks (CNN) have been particularly prevalent for image classi cation ([53, 54]) and Long Short Term Memory networks (LSTM) are performing well on temporal problems (e.g. language modeling [55] or, scene labeling [56]). For lipreading, we have evidence that both of these achieve good recognition rates in end-to-end systems, in [57] a CNN achieves 61.1% top 1 accuracy and in [58] an LSTM achieves 79.6% top 1 accuracy on a small dataset. However, our lipreading is a combination of these challenges, that is a temporal-visual classi cation problem. For lipreading multiple persons, [30, 31] detailed human lipreading of multiple people, [30] recognizes consonants, and [31] visual vowels. [32] presents an audio-visual system for HCI which automatically detects a talking person (both spatially and tem- porally) using video and audio data from a single microphone. In summary there is no reason to think that multi-person lipreading is any less viable than single-person lipreading, although the challenge of variability due to speaker identity is real. Speaker identity is a major challenge in machine lipreading because Visual speech is not consistent across individuals. Sometimes this can be advantageous as in [33] where they use lipreading to identify speakers. With known speakers - lipreading recogni- tion rates can be high, but with unknown speakers (referred to as speaker-independent lipreading) this is as yet not at the same standard as speaker dependent lipreading. In [34] results show that classi ers trained and tested on distinct speakers compared to those 5 trained and tested on the same speakers are statistically signi cantly dierent. This is supported in [35] where the authors strive to discriminate languages from visual speech and they conclude that in order to improve performance would be to move away from speaker-dependent features. For acoustic speech it is acknowledged that people have dierent speaking styles, accents and rates of speech. For visual speech there is the additional confusion of what we call a \visual accent" in which very similar sounds can be made by persons with very dierent mouth shapes { examples of visual accent eects include people who talk out of the side of their mouths; ventriloquists and mimics. The rate of speech alters both an utterance duration and articulator positions. Therefore, both the sounds produced, but particularly, visible appearance are altered. In [36], the authors present an experiment which measures the eect of speech rate and shows the eect is signi cantly higher on visual speech than in acoustic. Anecdotal evidence suggests that speaker visual style can evolve as speakers age due to co-articulation reduction as a person travels/interacts with other adults [21]. In summary, while audio-visual speech processing has a great number of challenges, one of the pivotal ones is the question of the visual units and how they should be derived. Since all language models are de ned in terms of phonemes, the practical question is the choice of the mapping from phonemes to visemes. The literature has presented a great number of these phoneme-to-viseme (P2V) mappings and few consistent comparisons between them so this is the topic for the next section. 3. Comparison of phoneme-to-viseme mappings A summary of published P2V maps is provided in [59] Tables 2.3 and 2.4. This list is not exhaustive and these mappings motivated by: a focus on just consonants [60{63]; being speaker-dependent [64], prioritizing particular visemes [65]; or a focus on vowels [66, 67]. These are useful starting points, but for the purpose of this study we would like the phoneme-to-viseme mappings to include all phonemes in the transcript of the dataset to accurately re ect the range of phonemes used in a full vocabulary. Therefore, some mappings used here are a pairing of two mappings suggested in literature, e.g. one maps for the vowels and one map for the consonants. A full list of the mappings used is in Tables 2 and 3. Of these mappings , the most common are `the Disney 12' [66], the `lipreading 18' by Nichie [68], and Fisher's [61]. In total, eight vowel- and fteen consonant-maps are identi ed here and all of these are paired with each other to provide 120 P2V maps to test. Recent comparisons between maps include [17] and as part of [59]. In [59] the following list of reasons are given for discrepancies between classi er sets. Variation between speakers - i.e. speaker identity. Variation between viewers - indicating lipreading ability varies by individuals, those with more practice are better able to identify visemes. The context of the speech presented - context has an in uence on how consonants appear on the lips. In real tasks the context will enable easier distinction between indistinguishable phonemes in syllable only tests. 6 Table 2: Vowel phoneme-to-viseme maps previously presented in literature. Classi cation Viseme phoneme sets Bozkurt [69] f/ei/ /2/g f/ei/ /e/ //g f/3/g f/i/ /I/ /@/ /y/g f/AU/g f/O/ /A/ /OI/ /@U/g f/u/ /U/ /w/g Disney [66] f/U/ /h/g f/E@/ /i/ /ai/ /e/ /2/g f/u/g f/U@/ /O/ /O@/g Hazen [19] f/AU/ /U/ /u/ /@U/ /O/ /w/ /OI/g f/2/ /A/g f// /e/ /ai/ /ei/g f/@/ /I/ /i/g Jeers [70] f/A/ // /2/ /ai/ /e/ /ei/ /I/ /i/ /O/ /@/ /I/g f/OI/ /O/g f/AU/g f/3/ /@U/ /U/ /u/g Lee [71] f/i/ /I/g f/e/ /ei/ //g f/A/ /AU/ /ai/ /2/g f/O/ /OI/ /@U/g f/U/ /u/g Montgomery [67] f/i/ /I/g f/e/ // /ei/ /ai/g f/A/ /O/ /2/g f/U/ /3/ /@/gf/OI/g f/i/ /hh/g f/AU/ /@U/g f/u/ /u/g Neti [72] f/O/ /2/ /A/ /3/ /OI/ /AU/ /H/g f/u/ /U/ /@U/g f// /e/ /ei/ /ai/g f/I/ /i/ /@/g Nichie [68] f/uw/g f/U/ /@U/g f/AU/g f/i/ /2/ /ay/g f/2/g f/iy/ //g f/e/ /I@/g f/u/g f/@/ /ei/g Clustering criteria - the grouping methods vary between authors. For example, `phonemes are said to belong to a viseme if, when clustered, the percent correct identi cation for the viseme is above some threshold, which is typically between 70 - 75% correct. A stricter grouping criterion has a higher threshold, so more visemes are identi ed.'[59]. These last two points are reinforced by [17] who achieved highest accuracy with the phoneme-to-viseme map of Jeers in an HMM-based lipreading system. They attribute this to the use of continuous speech which encapsulates the same viseme in more con- texts within the training data, and suggest that the Jeers map has better clustering of consonant visemes for those contexts. In Table 4 we have described the sources and derivation methods for all of the phoneme-to-viseme maps used in our comparison study. We see the majority are con- structed using human testing with few test subjects, for example Finn [73] used only one lipreader, and Kricos [64] twelve. Data-driven methods are most recent, e.g. Lee's [71] visemes were presented in 2002 and Hazen's [19] in 2004. The remaining visemes are based around linguistic/phonemic rules. As an example, the clustering method of Hazen [19] involved bottom-up clustering us- ing maximum Bhattacharyya distances [76] to measure similarity between the phoneme- labeled Gaussian models. Before clustering, some phonemes were manually merged, =em= with =m=, =en= with =n=, and =Z= with =S=. 7 Table 3: Consonant phoneme-to-viseme maps previously presented in literature. Classi cation Viseme phoneme sets Binnie [60] f/p/ /b/ /m/g f/f/ /v/g f/T/ /D/g f/S/ /Z/g f/k/ /g/g f/w/g f/r/g f/l/ /n/g f/t/ /d/ /s/ /z/g Bozkurt [69] f/g/ /H/ /k/ /N/g f/l/ /d/ /n/ /t/g f/s/ /z/g f/tS/ /S/ /dZ/ /Z/g f/T/ /D/g f/r/g f/f/ /v/g f/p/ /b/ /m/g Disney [66] f/p/ /b/ /m/g f/w/g f/f/ /v/g f/T/g f/l/g f/d/ /t/ /z/ /s/ /r/ /n/g f/S/ /tS/ /j/g f/y/ /g/ /k/ /N/g Finn [73] f/p/ /b/ /m/g f/T/ /D/g f/w/ /s/g f/k/ /h/ /g/g f/S/ /Z/ /tS/ /j/g f/y/g f/z/g f/f/g f/v/g f/t/ /d/ /n/ /l/ /r/g Fisher [61] f/k/ /g/ /N/ /m/g f/p/ /b/g f/f/ /v/g f/S/ /Z/ /dZ/ /tS/g f/t/ /d/ /n/ /T/ /D/ /z/ /s/ /r/ /l/g Franks [62] f/p/ /b/ /m/g f/f/g f/r/ /w/g f/S/ /dZ/ /tS/g Hazen [19] f/l/g f/r/g f/y/g f/b/ /p/g fmg f/s/ /z/ /h/g f/tS/ /dZ/ /S/ /Z/g f/t/ /d/ /T/ /D/ /g/ /k/g f/N/g f/f/ /v/g Heider [74] f/p/ /b/ /m/g f/f/ /v/g f/k/ /g/g f/S/ /tS/ /dZ/g f/T/g f/n/ /t/ /d/g f/l/g f/r/g Jeers [70] f/f/ /v/g f/r/ /q/ /w/g f/p/ /b/ /m/g f/T/ /D/g f/tS/ /dZ/ /S/ /Z/g f/s/ /z/g f/d/ /l/ /n/ /t/g f/g/ /k/ /N/g Kricos [64] f/p/ /b/ /m/g f/f/ /v/g f/w/ /r/g f/t/ /d/ /s/ /z/g f/k/ /n/ /j/ /h/ /N/ /g/g f/l/g f/T/ /D/g f/S/ /Z/ /tS/ /dZ/g Lee [71] f/d/ /t/ /s/ /z/ /T/ /D/g f/g/ /k/ /n/ /N/ /l/ /y/ /H/g f/dZ/ /tS/ /S/ /Z/g f/r/ /w/g f/f/ /v/g f/p/ /b/ /m/g Neti [72] f/l/ /r/ /y/g f/s/ /z/g f/t/ /d/ /n/g f/S/ /Z/ /dZ/ /tS/g f/p/ /b/ /m/g f/N/ /k/ /g/ /w/g f/f/ /v/g f/T/ /D/g Nichie [68] f/p/ /b/ /m/g f/f/ /v/g f/W/ /w/g f/r/g f/s/ /z/g f/S/ /Z/ /tS/ /j/g f/T/g f/l/g f/k/ /g/ /N/g f/H/g f/t/ /d/ /n/g f/y/g Walden [63] f/p/ /b/ /m/g f/f/ /v/g f/T /D/g f/S/ /Z/g f/w/g f/s/ /z/g f/r/g f/l/g f/t/ /d/ /n/ /k/ /g/ /j/g Woodward [75] f/p/ /b/ /m/g f/f/ /v/g f/w /r/ /W/g f/t/ /d/ /n/ /l/ /T/ /D/ /s/ /z/ /tS/ /dZ/ /S/ /Z/ /j/ /k/ /g/ /h/g A P2V map may be summarized as a ratio we call \compression factor," CF NV CF = (1) NP which is the ratio of number output visemes, NV to input phonemes NP . The compres- sion factors for the P2V maps are listed in Table 5. Silence and garbage visemes are not included in Compression Factors. Because we have a British English dataset and some works were formulated using American English diacritics [77] we omit the following phonemes from some mappings: =si= (Disney [66]), =axr= =en= =el= =em= (Bozkirt [69]), =axr= =em= =epi= =tcl= =dcl= =en= =gcl= kcl=(Hazen [19]), and =axr= =em= =el= =nx= =en= =dx= =eng= =ux= (Jeers [70]). Moreover, Kricos provides speaker-dependent visemes [64]. These have been gen- eralized for our tests using the most common mixtures of phonemes. Where a viseme 8 Table 4: A comparison of literature phoneme-to-viseme maps. Author Year Inspiration Description Test subjects Binnie 1976 Human testing Confusion patterns unknown Bozkurt 2007 Subjective linguistics Common tri-phones 462 Disney | Speech synthesis Observations unknown Finn 1988 Human perception Montgomerys visemes 1 and /H/ Fisher 1986 Human testing Multiple-choice 18 intelligibility test Franks 1972 Human perception Confusions among sounds unknown produced in similar articulatory positions 275 Hazen 2004 Data-driven Bottom-up clustering 223 Heider 1940 Human perception Confusions post-training unknown Jeers 1971 Linguistics Sensory and cognitive unknown correlates Kricos 1982 Human testing Hierarchical clustering 12 Lee 2002 Data-driven Merging of Fisher visemes unknown Montgomery 1983 Human perception Confusion patterns 10 Neti 2000 Linguistics Decision tree clusters 26 Nichie 1912 Human observations Human observation of unknown lip movements Walden 1977 Human testing Hierarchical clustering 31 Woodward 1960 Linguistics Language rules unknown and context Table 5: Compression factors for viseme maps previously presented in literature. Consonant Map V:P CF Vowel Map V:P CF Woodward 4:24 0.16 Jeers 3:19 0.16 Disney 6:22 0.18 Neti 4:20 0.20 Fisher 5:21 0.23 Hazen 4:18 0.22 Lee 6:24 0.25 Disney 4:11 0.36 Franks 5:17 0.29 Lee 5:14 0.36 Kricos 8:24 0.33 Bozkurt 7:19 0.37 Jeers 8:23 0.35 Montgomery 8:19 0.42 Neti 8:23 0.35 Nichie 9:15 0.60 Bozkurt 8:22 0.36 - - - Finn 10:23 0.43 - - - Walden 9:20 0.45 - - - Binnie 9:19 0.47 - - - Hazen 10:21 0.48 - - - Heider 8:16 0.50 - - - Nichie 18:33 0.54 - - - map does not include phonemes present in the ground truth transcript these are grouped 9 into one viseme denoted (=gar=). Note that all phonemes in each P2V map are in the dataset but no mapping includes all 29 phonemes in the AVL2 vocabulary. 3.1. Data preparation The AVLetters2 (AVL2) dataset [78] is used to train and test HMM classi ers based upon our 120 P2V mappings with HTK [26]. AAM features (concatenated as in (4)) are used as they are known to outperform other feature methods in machine lipreading [17]. AVL2 [78] is an HD version of the AVLetters dataset [22]. It is a single word dataset of ve male British English speakers reciting the alphabet seven times. We use four of these speakers at the fth tracked too poorly to have con dence in lipreading accuracy. The speakers in this dataset are illustrated in [79]. AVL2 has 28 videos of between 1; 169 and 1; 499 frames between 47s and 58s in duration. As the dataset provides isolated words of single letters, it lends itself to controlled experiments without needing to address matters such as varying co-articulation. Table 6: The number of parameters in shape, appearance and combined shape & appearance AAM features for each speaker in the AVLetters2 dataset for each speaker. Features retain 95% variance of facial information. Speaker Shape Appearance Combined S1 11 27 38 S2 9 19 28 S3 9 17 25 S4 9 17 25 Table 6 describes the features extracted from the AVL2 videos. These features have been derived after tracking a full-face Active Appearance Model throughout the video before extracting features containing only the lip area. Therefore, they contain informa- tion representing only the speaker's lips and none of the rest of the face. Speakers 2, 3 and 4 are similar in number of parameters contained in the features. The combined features are the concatenation of the shape and appearance features [3]. All features retain 95% variance of facial shape and appearance information. The RMAV dataset consists of 20 British English speakers (we use 12 speakers,seven male and ve female, who have been tracked to maintain comparability with earlier work), 200 utterances per speaker of a subset of the Resource Management (RM) context independent sentences from [80] which totals around 1000 words each. The sentences are selected to maintain a good coverage all phonemes [81] and to represent the coverage of phonemes in spoken speech. The original videos were recorded in high de nition and in a full-frontal position. Individual speakers are tracked using Active Appearance Models [3] and AAM features of concatenated shape and appearance information have been extracted. Figure 1 plots the frequency of all phonemes within the RMAV dataset over 200 sentences and Table 7 lists the number of parameters of shape, appearance, and combined shape and appearance AAM features where the features retain 95% variance of facial information. 10 6906 649 668 495 487 489 Phonemes 1846 2000 641 638 Phonemes Figure 1: Occurrence frequency of phonemes in the RMAV dataset. 3.2. Classi cation method The method for these speaker-dependent classi cation tests on our combined shape and appearance features uses HMM classi ers built with HTK [26]. The features selected are from the AVL2 and RMAV datasets. The videos are tracked with a full-face AAM (Figure 2 (left)) and the features extracted consist of only the lip information (Figure 2 (right)). This means that we obtain a robust tracking from the full-face model, then using this t information, we apply a sub-active appearance model of only the lips. The Phoneme count /v/ /y/ /hh/ /iy/ /b/ /aa/ /jh/ /ia/ /f/ /ch/ /d/ /ae/ /g/ /eh/ /oh/ /ah/ /k/ /ea/ /m/ /ao/ /l/ /ih/ /n/ /ey/ /p/ /aw/ /s/ /ay/ /r/ /ax/ /t/ /er/ /w/ /az/ /uh/ /ng/ /ow/ /sh/ /z/ /th/ /ua/ /zh/ /uw/ /oy/ /dh/ Phoneme count Table 7: The number of parameters of shape, appearance, and combined shape and appearance AAM features for the RMAV dataset speakers. Features retain 95% variance of facial information. Speaker Shape Appearance Combined S1 13 46 59 S2 13 47 60 S3 13 43 56 S4 13 47 60 S5 13 45 58 S6 13 47 60 S7 13 37 50 S8 13 46 59 S9 13 45 58 S10 13 45 58 S11 14 72 86 S12 13 45 58 HMM classi ers are based upon viseme labels within each P2V map. A ground truth for measuring correct classi cation is a viseme transcription produced using the BEEP British English pronunciation dictionary [82] and a word transcription. The phonetic transcript is converted to a viseme transcript assuming the visemes in the mapping being tested (Tables 3 and 2). We test using a leave-one-out seven-fold cross validation. Seven folds are selected as we have seven utterances of the alphabet per speaker in AVL2, this is increased to 10-fold cross-validation for RMAV speakers. The HMMs are initialized using ` at start' training and re-estimated eight times and then force-aligned using HTK's HVite. Training is completed by re-estimating the HMMs three more times with the force-aligned transcript. 3.3. Active appearance models An example full-face shape model example is in Figure 2 where there are 76 land- marks, 34 of which are modeling the inner and outer lip contours. Figure 2: Example Active Appearance Model shape mesh (left), a lips only model is on the right. 12 The shape s of an AAM is the collection of coordinates of the v vertices (landmarks) which make up a mesh, s = (x ; y ; x ; y ; :::; x ; y ) (2) 1 1 2 2 v v These landmarks are aligned and normalized via Procrustes analysis [83] and then ana- lyzed via a Principal Component Analysis (PCA) to s = s + p s (3) 0 i i i=1 where s is the mean shape, p are coecient shape parameters, and s are the 0 i i eigenvectors of the co-variance matrix of the n largest eigenvalues [3]. Having built an Active Shape Model, the next step is to augment it with appearance data and hence compute an Active Appearance Model (AAM). Each shape model is used to warp the image data back to the mean shape. The appearance of those warped images is now modeled again using PCA [4], A(x) = A (x) + A (x) 8x 2 s (4) 0 i i 0 i=1 where are the appearance parameters, A is the shape-free-mean appearance, and i 0 A (x) are the appearance image eigenvectors of the co-variance matrix. Usually the best results are obtained using both shape and appearance information combined within a single AAM [4, 25]. Therefore, unless explicitly stated otherwise, we use these. Once an AAM is built and trained, we t the model using the Inverse Compositional algorithm [84] to all frames in the video sequence [3]. 3.4. Comparison of current phoneme-to-viseme maps Recognition performance of the HMMs can be measured by both correctness, C , and accuracy, A, N D S N D S I C = (5) A = (6) N N where S is the number of substitution errors, D is the number of deletion errors, I is the number of insertion errors and N the total number of labels in the reference transcriptions [26]. An insertion error (which are notoriously common in lip reading [85]) occurs when the recognizer output has extra words/visemes missing from the original transcript [26]. As an example one could say \Once upon a midnight dreary", but the recognizer outputs \Once upon upon midnight dreary dreary". Here the recognizer has inserted two words which were never present and has deleted one . Once this utterance has been translated to one of viseme labels rather than words, as an example using Montgomery's visemes, this sentence becomes \v09 v12 v04 v05 - v12 v01 v12 v04 - v12 - v01 v10 v04 v11 v04 - v04 v07 v16 v07 v16" (hyphens are included to show breaks between words). In this case, the same insertion errors would create predicted outputs of \v09 v12 v04 v05 - v12 v01 v12 v04 - v12 v01 v12 v04 - v01 v10 v04 v11 v04 - v04 v07 v16 v07 v16 - v04 v07 v16 v07 v16." 13 In this experiment, classi cation performance of the HMMs is measured by correct- ness, C (5), as there are no insertion errors to consider [26]. It is acknowledged that word classi cation is not as high performing as viseme classi cation. However, as each viseme set being tested has a dierent number of phonemes and visemes, words, are used so we can compare dierent viseme sets. It is the dierence between each set, rather than the individual performance, which is of interest in this investigation. Figure 3 shows the correctness of each pair of viseme sets. On the top is the isolated word case (the AVL2 data) and on the bottom the continuous data (RMAV). Each diagram is ordered by the mean correctness over all speakers. For the isolated words the Lee vowel and consonant sets [71] are the best with the Montgomery vowels [67] and Hazen consonants [19] close behind. The worst performers are Disney vowels [66] and the Franks [62] and Woodward consonants [75]. For continuous speech the Disney vowels are the best performer [66] as are the Woodward consonants [75]. It is notable that for continuous speech the high compression factor visemes sets work better than those with larger numbers of visemes. The most likely explanation is that continuous speech has additional variability due to co-articulation so a few coarsely de ned visemes are better than a greater number of nely de ned ones. Figure 4 shows the mean word correctness, C , over all speakers, 1s:e for pairings of vowel and consonant maps ordered by correctness from left to right. Again, isolated word results (the AVL2 data) at the top and continuous (RMAV) on the bottom. As previously, for isolated words, the Disney vowels are signi cantly worse than all others when paired with all consonant dierence over the whole group. The Lee [71], Montgomery [67] and Bozkurt [69] vowels are consistently above the mean and above the upper error bar for Disney [66], Jeers [70] and Hazen [19] vowels. In comparing the consonants, Lee [71] and Hazen [19] are the best whereas Woodward [75] and Franks [62] are the bottom performers. There is a signi cant dierence between the `best' visemes for individual speakers which arises from the unique way in which everyone articulates their speech. The continuous speech experiment results in Figure 4 (bottom) show that, for vowel visemes, the Disney set surpasses all others, whereas Woodward's consonants are now a better t. This is interesting as neither viseme set are data-derived. We recall that Disney's [66] are designed from human perception for synthesis of characters, and Wood- ward's [75] are from a pilot investigation into phoneme perception in lipreading using linguistic rules. As we move to more realistic data , continuous speech, many of the data- driven approaches degrade which implies that they data used to derive these visemes was unrealistic. For example the Lee visemes [71] were derived without any use of video data at all so it is hardly surprising that they are fragile when presented with more realistic data. The idea that vowel and consonant visemes should be treated dierently is no surprise. The suggestion that vowel visemes are essentially mouth shapes and the consonants govern how we move in and out of them was rst presented by Nichie in 1912 from human observations by a profoundly deaf educator [68] and is supported by results in [86] which show we should not mix vowel and consonant visemes for best results. Therefore, it is reassuring to see that the better speaker-independent phoneme-to-viseme mapping for continuous speech is a combination of two previous maps, where the two maps have diering derivation methods; perception and language rules. Generally speaking the continuous case (bottom of Figure 4) gives improved accura- cies compared to the isolated word case (top of Figure 4. The rst response to explain 14 this is to suggest the increase is caused by better training of classi ers with the greater volume of training samples in RMAV than in AVL2. However, we should note that this eect is marginally countered by the co-articulation eects in continuous speech, so a set of classi ers trained on a larger isolated word dataset and compared to AVL2 would provide a greater increase in recognition. Figure 5 are critical dierence plots between the viseme class sets based upon their classi cation performance [87] with isolated word training. Critical dierence is a measure of the con dence intervals between dierent machine learning algorithms derived from Friedman tests on the ranked scores (here p = 0:05). Two assumptions within critical dierence are: all measured results are `reliable', and all algorithms are evaluated using the same random samples [87]. As we use the HTK standard metrics [88], and use results with consistent random sampling across folds, these assumptions are not a concern. We have selected critical dierences here as these evaluate the performance of multiple classi ers on dierent datasets, whereas such as [89, 90], often require paired data or identical datasets. Figure 5 shows a signi cant dierence between some sub-sets of visemes. This is shown by the horizontal bars which do not overlap all viseme sets. Where the horizontal bars do overlap, this shows the viseme sets are indistinguishable at a 95% con dence. When comparing isolated words with continuous speech we see fewer signi cant dierences with continuous speech despite there being more test data. Table 8 summarises the best-performing visemes (consonant and vowels) for the iso- lated and continuous word data. The rst column shows that the Lee consonants are the best performing for isolated words. But also that Hazan, Nichie, Neti etc are indis- tinguishable from Lee (they within Lee's critical dierence). For continuous speech, the Woodward consonant visemes are the best but Fisher, Franks Disney etc are indistinguis- able. In bold are the viseme sets that are common to both isolated words and continuous speech: Lee, Hazen, Finn and Fisher. For the vowels (second column) there are no com- mon sets. However if we look at best and second-best (the third column of Table 8) then Hazen and Neti emerge as common. Looking across all sets the common method that performs near the top is that due to Hazen [19]. Interestingly these visemes were derived using the most realistic data (an audio-visual corpus based on TIMIT) and formed by a tree-based clustering of phoneme-trained HMMs. Note that the Hazan visemes were derived from American English data whereas here we use British English speakers. The eectiveness of each mapping as a function of compression factor is presented in Figure 6. The two plots representing continuous speech (bottom of Figure 6) show improving performance with decreasing compression factor { we speculated earlier that the coarser visemes were better able to handle co-articulation. For the isolated word case (top) there is little dierence. Very roughly, the best performing methods appear to have around 2 to 4 phonemes per viseme. So far we have seen that there are noticeable dierences between classi cation per- formances associated with a variety of viseme sets in the literature. Given that quite a few of the viseme sets are incremental improvements on previous sets, it is good to see con rmation that these sets are have rather similar performance. We have identi ed the best sets for the various conditions and have used critical dierence plots to explain the similarity between methods. We have identi ed that the most robust methods seem to be based on clustering large amounts of data but a questions arises when it comes to individual speakers { is it viable to create viseme sets per speaker and, if so, how similar 15 Table 8: Critically dierent viseme sets changes with isolated word and continuous speech data. Sets are listed in the order they appear in Figure 5. First Position Consonants First Position Vowels Second Position Vowels Lee Lee Montgomery Hazen Montgomery Nichie Nichie Nichie Bozkurt Neti Bozkurt Hazen Walden Neti Jeers Kricos Binnie Finn Bozkurt Fisher Woodward Disney Jeers Fisher Jeers Hazen Franks Hazen Neti Disney Lee Heider Hazen Finn are they? This is the topic of the next section. 16 Lee Mon Boz Nic Net Haz Jef Dis Lee Haz Nic Net Kri Fin Jef Wal Bin Dis Boz Hei Fis Fra Woo Consonant visemes Dis Jef Haz Net Lee Boz Nic Mon Woo Fis Fra Dis Lee Hei Haz Fin Boz Bin Jef Kri Net Wal Nic Consonant visemes Figure 3: Speaker-dependent all-speaker mean word classi cation, C, comparing viseme classes on iso- lated word speech (top) and continuous speech (bottom) Vowel visemes Vowel visemes 60 Lee Mon Boz Nic Net Haz Jef Dis 40 Lee Haz Nic 35 c Net Kri Fin Jef Wal Bin Dis Boz 15 Hei Fis Fra 10 c Lee Haz Nic Net Kri Fin Jef Wal Bin Dis Boz Hei Fis Fra Woo Lee Mon Boz Nic Net Haz Jef Dis Woo c c c c c c c c c c c c c c c v v v v v v v v Visemes Boz Net Lee Jef Haz Dis 45 Mon Nic Woo Fis Fra Dis Lee Hei Haz Fin Boz Bin Jef Kri Net Wal 10 c Nic WooFis Fra Dis Lee HeiHaz Fin Boz Bin Jef Kri Net Wal Nic Dis Jef HazNet LeeBoz NicMon c c c c c c c c c c c c c c c v v v v v v v v Visemes Figure 4: Speaker-independent all-speaker mean word classi cation, C 1s:e. For a given mapping (xaxis) the performance is measured after pairing with all vowel mappings (left) and vice versa on the right on AVL2 isolated words (top) and RMAV continuous (bottom) All speaker mean word correctness, C% All speaker mean word correctness, C% Consonants Vowels CD 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 CD 8 7 6 5 4 3 2 1 13 2.25 Woodward Lee 13 2.5 Franks Hazen 11.0625 3.875 Disney Nichie 10.6875 4.9375 Heider Neti 9.8125 7.25 8 1.9 Fisher Walden Disney Lee 9 7.75 6.5667 2.8333 Bozkirt Jeffers Jeffers Montgomery 8.5625 7.875 5.2333 3.1667 Finn Kricos Hazen Nichie 8.4375 4.9667 3.3333 Binnie Neti Bozkurt CD 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 CD 8 7 6 5 4 3 2 1 15 1.375 Nichie Woodward 13.5 1.875 Neti Fisher 13 3.125 Walden Franks 11.5 4.625 Kricos Disney 10.375 5.625 7.9 1 Jeffers Lee Montgomery Disney 9.5 6.125 6.9667 2.0667 Binnie Heider Nichie Jeffers 9.125 6.75 5.3333 3.2 Bozkirt Hazen Bozkurt Hazen 8.5 5.2 4.3333 Finn Lee Neti Figure 5: Critical dierence of all phoneme-to-viseme maps independent of phoneme-to-viseme pair partner. Vowel maps are on the left side, consonants on the right. Isolated words are in the top row, and continuous speech along the bottom row. Continuous speech Isolated words Lee Haz Nic 40 40 Net 38 38 Kri Fin 36 36 Jef Wal 34 34 Bin Dis 32 32 Boz Hei 30 30 Fis 28 28 Fra Woo 26 26 Lee Mon 24 24 Boz Nic 22 22 Net Haz 20 20 0 0.5 1|0 0.5 1 Jef Viseme set compression factors v Dis Woo Fis Fra Dis 38 Lee Hei Haz Fin Boz Bin 32 32 Jef Kri 30 30 Net 28 28 Wal Nic 26 26 Dis Jef 24 24 Haz Net 22 22 Lee Boz 20 20 0 0.5 1|0 0.5 1 Nic Viseme set compression factors Mon Figure 6: Scatter plot showing the relationship between compression factors, CF (x-axes), and word correctness, C, classi cation (y-axes) with consonant phoneme-to-viseme maps (left) and vowel phoneme- to-viseme maps (right), isolated word results are at the top, and continuous speech along the bottom. All speaker mean word correctness, C % All speaker mean word correctness, C % 4. Encoding speaker-dependent visemes In the second part of our phoneme-to-viseme mapping study, two approaches are used to nd a better method of mapping phonemes to visemes. These approaches are both speaker-dependent and data-driven from phoneme classi cation. Two cases are considered: 1. a strictly coupled map, where a phoneme can be grouped into a viseme only if it has been confused with all the phonemes within the viseme, and 2. a relaxed coupled case, where phonemes can be grouped into a viseme if it has been confused with any phoneme within the viseme. With all new P2V mappings each phoneme can be allocated to only one viseme class. These new P2V maps are tested on the AVL2 dataset using the same classi cation method as described in Section 3.2. The results from the best performing P2V map from our comparison study (Lee [71] or Woodward [75] and Disney [66]) is the benchmark to measure improvements with respect to the training data. 4.1. Viseme classes with strictly confusable phonemes Our approaches for identifying visemes are speaker-dependent, data-driven and based on phoneme confusions within the classi er. The idea of speaker-dependent visemes is not new [31, 34] but our algorithm is, and in conjunction with the xed outputs available from HTK enables easy reuse. The rst undertaking in this work is to complete classi cation using phoneme labeled HHM classi ers. The classi ers are built in HTK with at-start HMMs and force-aligned training-data for each speaker. The HMMs are re-estimated 11 times in total over seven folds of leave-one-out cross validation. This overall classi cation task does not perform well (see Table 9) particularly for an isolated word dataset. However, the HTK tool HResults is used to output a confusion matrix for each fold detailing which phoneme labels confuse with others and how often. For both data-driven speaker-dependent approaches, this is the rst step of completing Table 9: Mean per speaker Correctness, C, of phoneme-labeled HMM classi ers. Speaker 1 Speaker 2 Speaker 3 Speaker 4 Phoneme C 24:72 23:63 57:69 43:41 phoneme classi cation is essential to create the data to derive the P2V maps from. This is completed for each speaker in both AVL2 and RMAV datasets. Now, let us use a smaller seven-unit confusion matrix example, as in Table 10, to explain our clustering method. For the `strictly-confused' viseme set (remembering there is one per speaker), the sec- ond step of deriving the P2V map is to check for single-phoneme visemes. Any phonemes which have only been correctly recognized and have no false positive/negative classi ca- tions are permitted to be single phoneme visemes. In Table 10 we have highlighted the true positive classi cations in red and both false positives and false negative classi ca- tions in blue which shows =p6= is the only phoneme to t our `single-phoneme viseme' de nition. =p6= has a true positive value of +4 and zero false classi cations. There- fore this is our rst viseme. =v1= = f=p6=g. This action is followed by de ning all 21 Table 10: Demonstration confusion matrix showing confusions between phoneme-labeled classi ers to be used for clustering to create new speaker-dependent visemes. True positive classi cations are shown in red, confusions of either false positives and false negatives are shown in blue. The estimated classes are listed horizontally and the real classes are vertical. =p1= =p2= =p3= =p4= =p5= =p6= =p7= =p1= 1 0 0 0 0 0 4 =p2= 0 0 0 2 0 0 0 =p3= 1 0 0 0 0 0 1 =p4= 0 2 1 0 2 0 0 =p5= 3 0 1 1 1 0 0 =p6= 0 0 0 0 0 4 0 =p7= 1 0 3 0 0 0 1 combinations of remaining phonemes which can be grouped into visemes and identifying the grouping that contains the largest number of confusions by ordering all the viseme possibilities by descending size (Table 11). Table 11: List of all possible subgroups of phonemes with an example set of seven phonemes f=p1=; =p2=; =p3=; =p4=; =p5=; =p7=g f=p1=; =p2=; =p4=g f=p1=; =p4=; =p7=g f=p1=; =p2=; =p3=; =p4=; =p5=g f=p1=; =p2=; =p5=g f=p2=; =p4=; =p7=g f=p1=; =p2=; =p3=; =p4=; =p7=g f=p1=; =p2=; =p7=g f=p1=; =p3=g f=p1=; =p2=; =p3=; =p5=; =p7=g f=p2=; =p3=; =p4=g f=p1=; =p4=g f=p1=; =p2=; =p4=; =p5=; =p7=g f=p2=; =p3=; =p5=g f=p1=; =p5=g f=p1=; =p3=; =p4=; =p5=; =p7=g f=p2=; =p3=; =p7=g f=p1=; =p7=g f=p2=; =p3=; =p4=; =p5=; =p7=g f=p3=; =p4=; =p5=g f=p2=; =p3=g f=p1=; =p2=; =p3=; =p4=g f=p3=; =p4=; =p7=g f=p2=; =p4=g f=p1=; =p2=; =p3=; =p5=g f=p1=; =p3=; =p4=g f=p2=; =p5=g f=p1=; =p2=; =p3=; =p7=g f=p4=; =p5=; =p7=g f=p2=; =p7=g f=p2=; =p3=; =p4=; =p5=g f=p1=; =p4=; =p5=g f=p3=; =p4=g f=p2=; =p3=; =p4=; =p7=g f=p2=; =p4=; =p5=g f=p3=; =p5=g f=p3=; =p4=; =p5=; =p7=g f=p1=; =p5=; =p7=g f=p4=; =p5=g f=p1=; =p3=; =p4=; =p5=g f=p2=; =p5=; =p7=g f=p4=; =p7=g f=p1=; =p4=; =p5=; =p7=g f=p3=; =p5=; =p7=g f=p5=; =p7=g f=p2=; =p4=; =p5=; =p7=g f=p1=; =p3=; =p5=g f=p1=; =p2=; =p3=g f=p1=; =p3=; =p7=g Our grouping rule states that phonemes can be grouped into a viseme class only if all of the phonemes within the candidate group are mutually confusable. This means each pair of phonemes within a viseme must have a total false positive and false negative classi cation greater than zero. Once a phoneme has been assigned to a viseme class it can no longer be considered for grouping, and so any possible phoneme combinations that include this viseme are discarded. This ensures phonemes can belong to only a single viseme. By iterating though our list of all possibilities in order, we check if all the phonemes are mutually confused. This means all phonemes have a positive confusion value (a blue 22 value in Table 10) with all others. The rst phoneme possibility in our list where this is true is f=p1=; =p3=; =p7=g. This is con rmed by the Table 10 values: Nf=p1=j=p3=g + Nf=p3=j=p1=g = 0 + 1 = 1 > 0 also, Nf=p1=j=p7=g + Nf=p7=j=p1=g = 4 + 1 = 5 > 0 and, Nf=p3=j=p7=g + Nf=p7=j=p3=g = 1 + 3 = 4 > 0: This becomes our second viseme and thus our current viseme list looks like Table 12. Table 12: Demonstration example 1: rst-iteration of clustering, a phoneme-to-viseme map for strictly- confused phonemes. Viseme Phonemes =v1= f=p6=g =v2= f=p1=; =p3=; =p7=g We now only have three remaining phonemes to cluster, =p2=; =p4= and =p5=. This reduces our list of possible combinations substantially, see Table 13. Table 13: List of all possible subgroups of phonemes with an example set of seven phonemes after the rst viseme is formed. f=p2=; =p4=; =p5=g f=p2=; =p4=g f=p2=; =p5=g f=p4=; =p5=g The next iteration of our clustering algorithm identi es the combination of remaining phonemes which correspond to the next largest number of confusions, and so on, until no phonemes can be merged. This leaves us with the nal visemes in Table 14. Table 14: Demonstration example 2: nal phoneme-to-viseme map for strictly-confused phonemes. Viseme Phonemes =v1= f=p6=g =v2= f=p1=; =p3=; =p7=g =v3= f=p2=; =p4=g =v4= f=p5=g Our original phoneme classi cation has produced confusion matrices which permit confusions between vowel and consonant phonemes. We can see in Section 3.1 (Tables 2 and 3), previously presented P2V maps that vowel and consonant phonemes are not commonly mixed within visemes. Therefore, we make two types of P2V maps: one 23 which permits vowels and consonant phonemes to be mixed within the same viseme, and a second which restricts visemes to be vowel or consonant only by putting an extra condition in when checking for confusions greater than zero. It should be remembered that not all phonemes present in the ground truth transcripts will have been recognized and included in the phoneme confusion matrix. Any of the remaining phonemes which have not been assigned to a viseme are grouped into a single garbage =gar= viseme. This approach ensures any phonemes which have been confused are grouped into a viseme and we do not lose any of the `rarer', and less common visual phonemes. For example, =ea=, =oh=, =ao=, and =r= are not in the original transcript and so can be placed into =gar=. But for Speaker 2, =gar= also contains =ay= and =p=, and for Speaker 4 =gar= also contains =p= and =z=, as these do not show up in the speaker's phoneme classi cation outputs. This task has been undertaken for all four speakers in our dataset. The nal P2V maps are shown in Table 15. Table 15: Strictly-confused phoneme speaker-dependent visemes. The score in brackets is the compres- sion factor.B1 is listed on top, B2 visemes are listed at the bottom. Classi cation P2V mapping - permitting mixing of vowels and consonants Speaker1 f/2/ /ai/ /i/ /n/ /@U/g f/b/ /e/ /ei/ /y/ g f/d/ /s/g f/tS/ /l/g f/@/ /v/g (CF:0.48) f/w/g f/f/g f/k/g f/@/ /v/g f/dZ/ /z/g f/A/ /u/g f/t/g Speaker2 f/@/ /ai/ /ei/ /i/ /s/g f/e/ /v/ /w/ /y/g f/l/ /m/ /n/g f/b/ /d/ /p/g (CF: 0.44) f/z/g ftS/g f/t/g f/A/g f/dZ/ /k/g f/2/ /f/g f/@U/ /u/g Speaker3 f/ei/ /f/ /n/g f/d/ /t/ /p/g f/b/ /s/g f/l/ /m/g f/@/ /e/g f/i/g f/u/g (CF: 0.68) f/A/g f/dZ/g f/@U/g f/z/g f/y/g f/tSg/ f/ai/g f/2/g f/A/g f/dZ/g f/@U/g f/k/ /w/g f/v/g f/z/g Speaker4 f/2/ /ai/ /i/ /ei/ g f/m/ /n/g f/@/ /e/ /p/g f/k/ /w/g f/d/ /s/g f/dZ/ /t/g (CF: 0.64) f/f/g f/v/g f/A/g f/z/g f/tS/g f/b/g f/@U/g f/@U/g f/l/g f/u/g f/b/g Classi cation P2V mapping - restricting mixing of vowels and consonants Speaker1 f/2/ /i/ /@U/ /u/g f/A/ /ei/g f/@/ /e/ /ei/g f/d/ /s/ /t/ g f/tS/ /l/ g f/k/g (CF:0.50) f/z/g f/w/g f/f/g f/m/ /n/g f/dZ/ /v/g f/b/ /y/g Speaker2 f/ai/ /ei/ /i/ /u/g f/@U/g f/@/g f/e/g f/2/g f/A/g f/v/ /w/g f/dZ/ /p/ /y/g (CF: 0.58) f/d/ /b/g f/t/g f/k/g f/tS/g f/l/ /m/ /n/g f/f/ /s/g Speaker3 f/ei/ /i/g f/ai/g f/@/ /e/g f/2/g f/d/ /p/ /t/g f/l/ /m/g f/k/ /w/g f/v/g (CF: 0.68) f/tS/g f/@U/g f/y/g f/u/g f/A/g f/z/g f/f/ /n/g f/b/ /s/g f/dZ/g Speaker4 f/2/ /ai/ /i/ /ei/g f/@/ /e/g f/m/ /n/g f/k/ /l/g f/dZ/ /t/g f/d/ /s/g f/tS/g (CF: 0.65) f/@U/g f/y/g f/u/g f/A/g f/w/g f/f/g f/v/g f/b/g 4.2. Viseme classes with relaxed confusions between phonemes A disadvantage of the strictly confusable viseme set is that it contains some spurious single-phoneme visemes where the phoneme cannot be grouped because it is not con- fused with all other phonemes in the viseme. These types of phonemes are likely to be either: borderline cases at the extremes of a viseme cluster, i.e. they have subtle visual similarities to more than one phoneme cluster, or they do not occur frequently enough in the training data to be dierentiated from other phonemes. To address this we complete a second pass-through of the strictly-confused visemes listed in Table 14. We begin with the visemes as they currently stand (in our demonstra- 24 Table 16: Demonstration example 3: nal phoneme-to-viseme map for relaxed-confused phonemes. Viseme Phonemes =v1= f=p6=g =v2= f=p1=; =p3=; =p5=; =p7=g =v3= f=p2=; =p4=g tion example containing four classes) and relax the condition requiring confusion with all of the phonemes. Now any single phoneme viseme (in our demonstration, =v4=) can be allocated to a previously existing viseme if it has been confused with any phoneme in the viseme. In Table 10 we see =p5= was confused with =p1=, =p3=, and =p4=. Because =p4= is not in the same viseme as =p1= and =p3= we use the value of confusion to decide which to allocate it to as follows. Nf=p1=j=p5=g + Nf=p5=j=p1=g = 0 + 3 = 3 Nf=p3=j=p5=g + Nf=p5=j=p3=g = 0 + 1 = 1 Nf=p4=j=p5=g + Nf=p5=j=p4=g = 2 + 1 = 3 Therefore; for p5 the total confusion with =v2= is 3 + 1 = 4, whereas the total confusion with =v3= is 3. We select the viseme with most confusion to incorporate the unallocated phoneme =p5=. This reduces the number of viseme classes by merging single-phoneme visemes from Table 14 to form a second set shown in Table 16. This has the added bene t that we have also increased the number of training samples for each classi er. Table 17: The four variations on speaker-dependent phoneme-to-viseme maps derived from phoneme confusion in phoneme classi cation. Bear1, B1: Bear2, B2: Mixed vowels and consonants Split vowels and consonants + + Strict-confusion of phonemes Strict-confusion of phonemes Bear3, B3: Bear4, B4 Mixed vowels and consonants Split vowels and consonants + + Relaxed-confusion of phonemes Relaxed-confusion of phonemes Remember, as we have two versions of Table 14 - one with mixed vowel and consonant phonemes and a second with divided vowels and consonant phonemes - the same still applies to our relaxed-confused visemes sets. This means we end up with four types of speaker-dependent phoneme-to-viseme maps, described in Table 17. For our strictly- confused P2V maps in Table 15, these become the relaxed P2V maps in Table 18. In Table 17 we have labeled each of the four variations B1, B2, B3 and B4 for ease of reference. Now, and this is why these visemes are de ned as relaxed, any remaining phonemes which have confusions, but are so far not assigned to a viseme, the phoneme-pair confu- sions are used to map the remaining phonemes to an appropriate viseme, even though 25 Table 18: Relaxed-confused phoneme speaker-dependent visemes. The score in brackets is the ratio of visemes to phonemes. B3 visemes are on top, and B4 listed below. Classi cation P2V mapping - permitting mixing of vowels and consonants Speaker1 f/b/ /e/ /ei/ /p/ /w/ /y/ /k/g f/2/ /ai/ /f/ /i/ /m/ /n/ /@U/g (CF:0.28) f/dZ/ /z/g f/A/ /u/g f/d/ /s/ /t/g f/tS/ /l/g f/@/ /v/gf/@/ /v/g Speaker2 f/A/ /@/ /ai/ /ei/ /i/ /s/ /tS/g f/e/ /t/ /v/ /w/ /y/g f/l/ /m/ /n/g (CF: 0.32) f/2/ /f/g f/z/g f/b/ /d/ /p/g f/@U/ /u/g f/dZ/ /k/g Speaker3 f/2/ /ai/ /ei/ /f/ /i/ /n/g f/@/ /e/ /y/ /tS/g f/b/ /s/ /v/g f/l/ /m/ /u/g (CF: 0.40) f/dZ/g f/@U/g f/z/g f/d/ /p/ /t/g f/k/ /w/g f/A/g Speaker4 f/2/ /ai/ /tS/ /i/ /ei/ g f/A/ /m/ /u/ /n/g f/@/ /e/ /p/ /v/ /y/g (CF: 0.32) f/dZ/ /t/g f/k/ /l/ /w/g f/@U/g f/d/ /f/ /s/g f/b/g Classi cation P2V mapping - restricting mixing of vowels and consonants Speaker1 f/2/ /i/ /@U/ /u/g f/A/ /ai/g f/@/ /e/ /ei/g f/b/ /w/ /y/g f/d/ /f/ /s/ /t/g (CF:0.47) f/k/g f/z/g f/m/g f/l/g f/tS/g f/dZ/ /k/ /v/ /z/g Speaker2 f/A/ /2/ /@/ /ai/ /ei/ /i/ /@U/ /u/g f/k/ /t/ /v/ /w/g f/tS/ /l/ /m/ /n/g (CF: 0.29) f/f/ /s/g f/dZ/ /p/ /y/g f/b/ /d/g f/z/g Speaker3 f/2/ /ai/ /i/ /ei/g f/@/ /e/g f/b/ /s/ /v/g f/d/ /p/ /t/g f/l/ /m/g (CF: 0.56) f/y/g f/dZ/g f/@U/g f/z/g f/u/g f/@/ /e/g f/k/ /w/g f/f/ /n/g f/A/g f/tS/g Speaker4 f/2/ /ai/ /i/ /ei/g f/tS/ /k/ /l/ /w/g f/d/ /f/ /s/ /v/g f/m/ /n/g (CF: 0.50) f/f/g f/A/g f/dZ/ /t/g f/@U/g f/u/g f/y/g f/b/g it does not confuse with all phonemes already in it. Any remaining phonemes which are not assigned to a viseme are grouped into a new garbage =gar= viseme. This ap- proach ensures any phonemes which have been confused with any other are grouped into a viseme. 26 4.3. Results analysis Figure 7 (top) compares the new speaker-dependent viseme method with the Lee visemes which are the benchmark from the isolated word study. For Speaker 1 and Speaker 3, no new viseme map signi cantly improves upon Lee's performance although we do see improvements for both Speaker 2 and Speaker 4. The strictly-confused and split viseme map improves upon Lee's previous best word classi cation. The second set of our experiments with continuous speech training data (RMAV) is to repeat our investigation with speaker-dependent visemes. These have been derived with the same methods described in Section 4.1 & 4.2 and are listed in full for each speaker in Appendix A. Our classi cation method is identical to that used previously with HMMs. In the previous work of [86], we see limited improvement in word classi cation with viseme classes due to the size of the dataset. In Figure 7 (bottom) we have plotted the word correctness achieved for each RMAV speaker using all four variants of the speaker-dependent visemes. Our rst observation is that on this gure, the correctness scores achieved range from 26:67% to 41:53%, whereas in Figure 7 (top) the values range from 20:60% to 36:53%. As before, this overall increase is attributed to the larger volume of training samples in RMAV compared to AVLetters2. Compared to the benchmark of the Disney vowels and Montgomery consonant visemes which has been plotted in black on Figure 7 (bottom) we see that the comparison between speaker-dependent visemes and the best speaker-independent visemes is subject to the speaker. For three out of 12 speakers (sp01, sp03, sp05), the speaker-dependent visemes are all worse than our benchmark. For another three of our 12 speakers (sp02, sp09, sp14) all of the speaker-dependent visemes out-perform the benchmark. For all six remaining speakers, the results are mixed. This suggests that it is possible that speaker-dependent visemes could improve on speaker-independent ones, but that it is essential that they are exactly right for the individual otherwise they become at worse, detrimental, or a lot of eort for no signi cant improvement. Careful observation of Figure 7 (top) shows that when considering the performance of mixed or split visemes, split visemes sign cantly (> 1se) outperform mixed. When considering relaxed versus split the split has a marginal advantage but it is not signi cant (<1se). The comparison of strict and split visemes for continuous speech (Figure 7 (bottom) is consistent with the isolated word observations. The strictly-confused visemes perform better than those with a relaxed confusion, but not statistically signi cantly (<1se). Again, we see that mixing vowel and consonants phonemes within individual viseme classes reduces the classi cation performance but not signi cantly. In Figure 8 we have plotted accuracy, A, and correctness, C , for our best performing speaker-dependent visemes (B1) on continuous speech. We also plot, the accuracy scores of our benchmark from Woodward and Disney's visemes. These are compared with the correctness scores as a baseline to show the improvement. Whilst the improvement of speaker-dependent visemes is not signi cant when measured by Correctness, by plotting the accuracy of the viseme classi ers we can see that they do have a positive in uence in reducing insertion errors which are a bugbear of lipreading. B1 B2 B3 B4 Lee Speaker 1 Speaker 2 Speaker 3 Speaker 4 Test Speaker B1 B2 B3 B4 Benchmark sp01sp02sp03sp05sp06sp08sp09sp10sp11sp13sp14sp15 Speaker Figure 7: Word classi cation correctness C 1se, using all four new methods of deriving speaker depen- dent visemes. AVL2 (top) and RMAV (bottom) speakers against Lee (top) and Woodward and Disney (bottom) benchmarks in black. HTK Correctness C % HTK Correctness % SD C C% T S SD C A% T S Woodward and Disney Benchmark A% C C 45 C C C A B 35 A A B A A B B A A B B sp01 sp02 sp03 sp05 sp06 sp08 sp09 sp10 sp11 sp13 sp14 sp15 Speaker Figure 8: Comparing the accuracy change between strict and relaxed visemes to show the improvement in accuracy/reduction in insertion errors for all 12 speakers in continuous speech. The baseline is the correctness classi cation which ignores insertion error penalties. HTK Accuracy % 5. Performance of individual visemes In Figures 9 and 10, the contribution of each viseme has been listed in descending order along the xaxis for each speaker in AVL2. The contribution of each viseme is measured as the probability of each class, Prfvjv ^g. These values have been calculated from the HResults confusion matrices. This analysis of visemes within a set is also used in [91], which proposes a threshold subject to the information in the features. The same viseme comparison analysis has been repeated for our continuous speech recognition experiments and the results are shown in Figures 11 and 12. In the isolated word data (Figures 9 and 10) the dierence between a high-performing speaker map and a poor one is striking. Speaker 3 for example has at least ve visemes in which Prfvjv ^g = 1 (more in some con gurations) whereas Speaker 1 has only one good viseme. Referring to Tables 15 and 18 there is no consistency on the best viseme although generally visual silence appears to be easy to spot. This variation is to be expected { speaker variablity is a very serious problem in lipreading. Figures 11 and 12 show the same thing for the continuous speech data. Now there is a shallower drop-o to the curve and there are certainly no visemes for which Prfvjv ^g = 1. Although there appears to be less variablity among speakers this is an illusion caused by the poorly-performing visemes to be similar among speakers { within the top ve visemes there are signi cant dierences among speakers. Speakers: Speaker 1 Speaker 2 Speaker 3 Speaker 4 Sp1 v07 v02 v01 v04 v06 v03 Sp2 sil v10 v01 v02 v06 v08 v07 v05 v03 Sp3 v07 v08 v10 v12 sil v04 v01 v03 v06 v05 v11 v02 v09 Sp4 sil v01 v04 v12 v03 v06 v05 v02 v17 Ordered viseme classes Speakers: Speaker 1 Speaker 2 Speaker 3 Speaker 4 Sp1 sil v05 v01 v03 v04 v02 v06 gar Sp2 sil v09 v14 v01 v06 v07 v04 v02 v11 v08 Sp3 v04 v11 v12 v13 v14 v15 sil v01 v02 v05 v07 v03 v09 v08 v06 Sp4 v10 v11 v12 sil v06 v01 v05 v13 v04 v02 v14 v03 Ordered viseme classes Figure 9: Individual viseme classi cation, Prfvjv ^g with speaker-dependent visemes for four speakers with isolated word training of classi ers B1 visemes (top) and B2 visemes (bottom). Recognition P r {v|vˆ} Recognition P r {v|vˆ} Speakers: Speaker 1 Speaker 2 Speaker 3 Speaker 4 Sp1 sil v01 v03 v06 v12 v09 v11 v05 v02 Sp2 gar v01 v06 v03 v04 v02 v07 v05 v08 Sp3 v04 v07 v08 v10 v12 sil v03 v01 v06 v05 v11 v02 v09 Sp4 v04 sil v01 v12 v03 gar v06 v05 v02 Ordered viseme classes Speakers: Speaker 1 Speaker 2 Speaker 3 Speaker 4 Sp1 v09 sil v03 v04 v01 v05 v02 gar Sp2 sil v01 v05 v03 v04 v06 v02 Sp3 v04 v09 v10 v11 v12 sil v01 v14 v06 v02 v03 v07 v05 Sp4 v10 sil v04 v03 v06 v01 v09 v02 v05 v08 gar Ordered viseme classes Figure 10: Individual viseme classi cation, Prfvjv ^g with speaker-dependent visemes for four speakers with isolated word training of classi ers. B3 visemes (top) and B4 visemes (bottom). Recognition P r {v|vˆ} Recognition P r {v|vˆ} 90 Speaker 1 Speaker 2 Speaker 3 Speaker 4 Speaker 5 Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Ordered viseme classes 90 Speaker 1 Speaker 2 Speaker 3 Speaker 4 Speaker 5 Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Ordered viseme classes Figure 11: Individual viseme classi cation, Prfvjv ^g with speaker-dependent visemes for twelve speakers with continuous speech training of classi ers. B1 visemes (top) and B2 visemes (bottom). Recognition P r {v|vˆ} Recognition P r {v|vˆ} 90 Speaker 1 Speaker 2 Speaker 3 Speaker 4 Speaker 5 Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Ordered viseme classes 90 Speaker 1 Speaker 2 Speaker 3 Speaker 4 Speaker 5 Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Ordered viseme classes Figure 12: Individual viseme classi cation, Prfvjv ^g with speaker-dependent visemes for twelve speakers with continuous speech training of classi ers. B3 visemes (top) and B4 visemes (bottom). Recognition P r {v|vˆ} Recognition P r {v|vˆ} 6. Conclusions While lipreading and hence expressive audio-visual speech recognition face a number of challenges, one the persistent diculties has been the multiplicity of mappings be- tween phonemes and visemes. This paper has described a study of previously suggested Phoneme-to-Viseme (P2V) maps. For isolated word classi cation, Lee's [71] is the best of the previously published maps. For continuous speech a combination of Woodward's and Disney's visemes are better. The best performing viseme sets have on average, between two and four phonemes per viseme. When looking at speaker-independent visemes, whilst most viseme sets do not ex- perience any dierence in correctness between isolated and continuous speech, it is in- teresting to note that Woodward consonant visemes are better for continuous speech and are linguistically derived, whereas Lee visemes are better for isolated words and are data-derived. This suggests that an optimal set of visemes for all speakers would need to consider both the visual speech gestures of the individual and the rules of language. Which in essence is the dilemma for visemes: does one choose units that make sense in terms of likely visual gestures or in terms of the linguistic problem that is trying to be solved. Figure 13: A simple augment to the conventional lip-reading system to include speaker-dependent visemes. We have also derived some new visemes, the `Bear' visemes. These new data-driven visemes respect speaker individuality in speech and uses this property to demonstrate that our second data-driven method tested, a strictly-confused viseme derivation with split vowel and consonant phonemes, can improve word classi cation. The best of Bear visemes is the strict confused phonemes with split vowels and consonants (B2) for both isolated and continuous speech. Furthermore, a review of these speaker-dependent visemes (listed in Tables 15, 18, and Appendix A) shows that formally `accepted' visemes such as f =p= =b= =m= g and f/S/ /Z/ /dZ/ /tS/g are no longer present. Similarly with our previous vowel based visemes, six of our eight prior viseme sets pair /2/ with /A/ (albeit not as a complete 35 viseme, others are also present) but with our best speaker-dependent visemes these two phonemes are not paired. This is an interesting insight because it suggests that formerly `accepted' strong visemes might not be so useful for all speakers, and some adaptability, or further investigation into understanding viseme variation is still needed. Our suggestion at this time, is that linguistics or co-articulation in continuous speech, are a strong in uence causing this variation. In practical terms, our new viseme derivation method is simple and can be included within a conventional lipreading system easily. This is demonstrated in Figure 13 where our clustering method is shown in dashed boxes. We recommend this approach for viseme classi cation since speaker-independent visemes are unlikely to perform well. In general, for cases, Speaker-dependent visemes reduce insertion errors when clas- sifying continuous speech. This is thought to be because the phoneme confusions in speaker-dependent visemes are aected by speaker speci c visual co-articulation. For all viseme sets, not mixing vowel and consonant phonemes signi cantly improves classi ca- tion. 7. Acknowledgments We gratefully acknowledge the assistance of Dr Yuxuan Lan and Dr Barry-John Theobald, formerly of the University of East Anglia for their help with HTK and general advice and guidance. This work was conducted while Helen L. Bear was in receipt of a studentship from the UK Engineering and Physical Sciences Research Council (EPSRC). References [1] B.-J. T. Jacob L. Newman, S. J. Cox, Limitations of visual speech recognition, in: Proceedings of the International Conference on Audio-Visual Speech Processing, 2010. [2] E. Ong, R. Bowden, Robust lip-tracking using rigid ocks of selected linear predictors, in: 8th IEEE Int. Conf. on Automatic Face and Gesture Recognition (FG2008), 2008, pp. 247{254. [3] I. Matthews, S. Baker, Active appearance models revisited, International Journal of Computer Vision 60 (2) (2004) 135{164. URL http://www.springerlink.com/openurl.asp?id=doi:10.1023/B:VISI.0000029666.37597.d3 [4] T. Cootes, G. Edwards, C. Taylor, Active appearance models, IEEE Transactions on Pattern Anal- ysis and Machine Intelligence 23 (6) (2001) 681 {685. doi:10.1109/34.927467. [5] Y. Lan, B.-J. Theobald, R. Harvey, View independent computer lip-reading, in: IEEE International Conference on Multimedia and Expo (ICME), 2012, pp. 432{437. doi:10.1109/ICME.2012.192. [6] A. Pass, J. Zhang, D. Stewart, An investigation into features for multi-view lipreading, in: Image Processing (ICIP), 2010 17th IEEE International Conference on, IEEE, 2010, pp. 2417{2420. [7] S. Moore, R. Bowden, Local binary patterns for multi-view facial expression recognition, Computer Vision and Image Understanding 115 (4) (2011) 541 { 558. doi:10.1016/j.cviu.2010.12.001. [8] K. Kumar, T. Chen, R. Stern, Pro le view lip reading, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 4, 2007, pp. IV{429{IV{432. doi:10. 1109/ICASSP.2007.366941. [9] R. Kaucic, A. Blake, Accurate, real-time, unadorned lip tracking, in: Computer Vision, 1998. Sixth International Conference on, IEEE, 1998, pp. 370{375. [10] S. L. Bauman, G. Hambrecht, Analysis of view angle used in speechreading training of sentences, American Journal of Audiology 4 (3) (1995) 67{70. URL http://aja.asha.org/cgi/content/abstract/4/3/67 [11] P. Lucey, G. Potamianos, S. Sridharan, Visual speech recognition across multiple views, in: A. W.- C. Liew, S. Wang (Eds.), Visual Speech Reognition: Lip Segmentation and Mapping, 2009. doi: 10.4018/978-1-60566-186-5. 36 [12] A. Blokland, A. H. Anderson, Eect of low frame-rate video on intelligibility of speech, Speech Com- munication 26 (1-2) (1998) 97{103. doi:http://dx.doi.org/10.1016/S0167-6393(98)00053-3. [13] T. Saitoh, R. Konishi, A study of in uence of word lip reading by change of frame rate, in: Pro- ceedings of the International Conference on Auditory-Visual Speech Processing (AVSP), 2010. [14] H. Bear, R. W. Harvey, B.-J. Theobald, Y. Lan, Resolution limits on visual speech recognition, in: IEEE International Conference on Image Processing, 2014, pp. 2009{2013. doi:10.1109/ICIP. 2014.7025274. [15] M. Heckmann, F. Berthommier, C. Savariaux, K. Kroschel, Eects of image distortions on audio- visual speech recognition, in: AVSP 2003-International Conference on Audio-Visual Speech Pro- cessing, 2003, pp. 163{168. [16] M. Vitkovitch, P. Barber, Visible speech as a function of image quality: Eects of display parameters on lipreading ability, Applied Cognitive Psychology 10 (2) (1996) 121{140. doi:10.1002/(SICI) 1099-0720(199604)10:2<121::AID-ACP371>3.0.CO;2-V. URL http://dx.doi.org/10.1002/(SICI)1099-0720(199604)10:2<121::AID-ACP371>3.0.CO;2-V [17] L. Cappelletta, N. Harte, Phoneme-to-viseme mapping for visual speech recognition., in: ICPRAM (2), 2012, pp. 322{329. [18] D. Howell, B.-J. Theobald, S. J. Cox, Confusion modelling for automated lip-reading using weighted nite-state transducers., in: AVSP, 2013, pp. 197{202. [19] T. J. Hazen, K. Saenko, C.-H. La, J. R. Glass, A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments, in: Proceedings of the 6th International Conference on Multimodal Interfaces, ICMI '04, ACM, New York, NY, USA, 2004, pp. 235{242. doi:10.1145/1027933.1027972. URL http://doi.acm.org/10.1145/1027933.1027972 [20] J. Shin, J. Lee, D. Kim, Real-time lip reading system for isolated korean word recognition, Pattern Recognition 44 (3) (2011) 559{571. [21] H. L. Bear, R. Harvey, Decoding visemes: Improving machine lip-reading, in: 2016 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016, pp. 2009{2013. [22] I. Matthews, J. Bangham, R. Harvey, S. Cox, Non-linear scale decomposition based features for vi- sual speech recognition, Proceedings of the IX European Signal Processing Conference (EUSIPCO) (1998) 303{305. [23] Y. Lan, R. Harvey, B. Theobald, E.-J. Ong, R. Bowden, Comparing visual features for lipreading, in: International Conference on Auditory-Visual Speech Processing 2009, 2009, pp. 102{106. [24] Y. Lan, B.-J. Theobald, R. Harvey, E.-J. Ong, R. Bowden, Improving visual features for lip-reading, Proceedings of the International Conference on Audio-Visual Speech Processing (AVSP) 7 (3) (2010) 42{48. [25] I. Matthews, T. Cootes, J. Bangham, S. Cox, R. Harvey, Extraction of visual features for lipreading, Pattern Analysis and Machine Intelligence, IEEE Transactions on 24 (2) (2002) 198 {213. doi: 10.1109/34.982900. [26] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. A. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchec, P. Woodland, The HTK Book (for HTK Version 3.4), Cambridge University Engineering Department, 2006. URL http://htk.eng.cam.ac.uk/docs/docs.shtml [27] Q. Zhu, A. Alwan, On the use of variable frame rate analysis in speech recognition, in: Acoustics, Speech, and Signal Processing, 2000. ICASSP'00. Proceedings. 2000 IEEE International Conference on, Vol. 3, IEEE, 2000, pp. 1783{1786. [28] K. Thangthai, R. Harvey, S. Cox, B.-J. Theobald, Improving lip-reading performance for robust audiovisual speech recognition using dnns, in: Proc. FAAVSP, 1St Joint Conference on Facial Analysis, Animation and Audio{Visual Speech Processing, 2015. [29] F. J. Huang, T. Chen, Tracking of multiple faces for human-computer interfaces and virtual en- vironments, in: Proceedings of IEEE International Conference on Multimedia and Expo (ICME), Vol. 3, 2000, pp. 1563{1566. doi:10.1109/ICME.2000.871067. [30] J. Jiang, A. Alwan, L. E. Bernstein, E. T. Auer, P. A. Keating, Similarity structure in perceptual and physical measures for visual consonants across talkers, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 1, 2002, pp. I{441 {I{444. doi:10.1109/ICASSP.2002.5743749. [31] S. Lesner, P. Kricos, Visual vowel and diphthong perception across speakers, Journal of the Academy of Rehabilitative Audiology 14 (1981) 252{258. [32] R. Cutler, L. Davis, Look who's talking: speaker detection using video and audio correlation, in: 37 IEEE International Conference on Multimedia and Expo (ICME), Vol. 3, 2000, pp. 1589{1592. doi:10.1109/ICME.2000.871073. [33] J. Luettin, N. Thacker, S. Beet, Speaker identi cation by lipreading, in: Proceedings of the Fourth International Conference on Spoken Language (ICSLP), Vol. 1, 1996, pp. 62{65. doi:10.1109/ ICSLP.1996.607030. [34] H. L. Bear, S. J. Cox, R. W. Harvey, Speaker-independent machine lip-reading with speaker- dependent viseme classi ers, Facial Animation and Audio-Visual Speech Processing (FAAVSP) 2015 (2015) 190{195. URL http://www.isca-speech.org/archive/avsp15/papers/av15_190.pdf [35] J. L. Newman, S. J. Cox, Speaker independent visual-only language identi cation, in: Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, IEEE, 2010, pp. 5026{5029. [36] S. Taylor, B.-J. Theobald, I. Matthews, The eect of speaking rate on audio and visual speech, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3037{3041. doi:10.1109/ICASSP.2014.6854158. [37] E. K. Patterson, S. Gurbuz, Z. Tufekci, J. N. Gowdy, Cuave: A new audio-visual database for multimodal human-computer interface research, in: 2002 IEEE International Conference on Acous- tics, Speech, and Signal Processing, Vol. 2, 2002, pp. II{2017{II{2020. doi:10.1109/ICASSP.2002. [38] J. F. G. Perez, A. F. Frangi, E. L. Solano, K. Lukas, Lip reading for robust speech recognition on embedded devices, in: Proceedings. (ICASSP '05). IEEE International Conference on Acous- tics, Speech, and Signal Processing, 2005., Vol. 1, 2005, pp. 473{476. doi:10.1109/ICASSP.2005. [39] K. Paleek, Lipreading using spatiotemporal histogram of oriented gradients, in: 2016 24th Euro- pean Signal Processing Conference (EUSIPCO), 2016, pp. 1882{1885. doi:10.1109/EUSIPCO.2016. [40] R. E. Shor, The production and judgment of smile magnitude, The Journal of General Psychology 98 (1) (1978) 79{96. [41] S. Fagel, Eects of smiling on articulation: Lips, larynx and acoustics, in: Development of multi- modal interfaces: active listening and synchrony, Springer, 2010, pp. 294{303. [42] M. Kienast, A. Paeschke, W. Sendlmeier, Articulatory reduction in emotional speech, in: Sixth European Conference on Speech Communication and Technology, 1999. [43] M. Kienast, W. F. Sendlmeier, Acoustical analysis of spectral and temporal changes in emotional speech, in: ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, 2000. [44] F. Shaw, B.-J. Theobald, Expressive modulation of neutral visual speech, IEEE MultiMedia 23 (4) (2016) 68{78. [45] W. Hamza, E. Eide, R. Bakis, M. Picheny, J. Pitrelli, The ibm expressive speech synthesis system, in: Eighth International Conference on Spoken Language Processing, 2004. [46] N. N. Khatri, Z. H. Shah, S. A. Patel, Facial expression recognition: A survey, International Journal of Computer Science and Information Technologies (IJCSIT) 5 (1) (2014) 149{152. [47] S. Happy, A. Routray, Automatic facial expression recognition using features of salient facial patches, IEEE transactions on Aective Computing 6 (1) (2015) 1{12. [48] J. Yan, W. Zheng, Q. Xu, G. Lu, H. Li, B. Wang, Sparse kernel reduced-rank regression for bimodal emotion recognition from facial expression and speech, IEEE Transactions on Multimedia 18 (7) (2016) 1319{1329. [49] S. Zhang, X. Wang, G. Zhang, X. Zhao, Multimodal emotion recognition integrating aective speech with facial expression, WSEAS Transactions on Signal Processing 10 (2014) (2014) 526{537. [50] T. Cootes, G. Edwards, C. Taylor, Active appearance models, Pattern Analysis and Machine Intel- ligence, IEEE Transactions on 23 (6) (2001) 681 {685. doi:10.1109/34.927467. [51] R. Seymour, D. Stewart, J. Ming, Comparison of image transform-based features for visual speech recognition in clean and corrupted videos, Journal on Image and Video Processing 2008 (2008) 14. [52] G. Potamianos, C. Neti, J. Luettin, I. Matthews, Audio-visual automatic speech recognition: An overview, Issues in Visual and Audio-Visual Speech Processing 22 (2004) 23. [53] A. Sharif Razavian, H. Azizpour, J. Sullivan, S. Carlsson, Cnn features o-the-shelf: an astounding baseline for recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 806{813. [54] Z. Yan, V. Jagadeesh, D. DeCoste, W. Di, R. Piramuthu, Hd-cnn: Hierarchical deep convolutional neural network for image classi cation, in: International Conference on Computer Vision (ICCV), Vol. 2, 2015. 38 [55] M. Sundermeyer, R. Schluter, H. Ney, Lstm neural networks for language modeling., in: Interspeech, 2012, pp. 194{197. [56] W. Byeon, T. M. Breuel, F. Raue, M. Liwicki, Scene labeling with lstm recurrent neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3547{3555. [57] J. S. Chung, A. Zisserman, Lip Reading in the Wild, Springer International Publishing, Cham, 2017, pp. 87{103. doi:10.1007/978-3-319-54184-6_6. URL http://dx.doi.org/10.1007/978-3-319-54184-6_6 [58] M. Wand, J. Koutn k, J. Schmidhuber, Lipreading with long short-term memory, in: Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, IEEE, 2016, pp. 6115{6119. [59] B.-J. Theobald, Visual speech synthesis using shape and appearance models, Ph.D. thesis, Univer- sity of East Anglia (2003). [60] C. A. Binnie, P. L. Jackson, A. A. Montgomery, Visual intelligibility of consonants: A lipreading screening test with implications for aural rehabilitation, Journal of Speech and Hearing Disorders 41 (4) (1976) 530. [61] C. G. Fisher, Confusions among visually perceived consonants, Journal of Speech, Language and Hearing Research 11 (4) (1968) 796. [62] J. R. Franks, J. Kimble, The confusion of english consonant clusters in lipreading, Journal of Speech, Language and Hearing Research 15 (3) (1972) 474. [63] B. E. Walden, R. A. Prosek, A. A. Montgomery, C. K. Scherr, C. J. Jones, Eects of training on the visual recognition of consonants, Journal of Speech, Language and Hearing Research 20 (1) (1977) [64] P. B. Kricos, S. A. Lesner, Dierences in visual intelligibility across talkers., The Volta Review 82 (1982) 219{226. [65] E. Owens, B. Blazek, Visemes observed by hearing-impaired and normal-hearing adult viewers, Journal of Speech and Hearing Research 28 (3) (1985) 381. [66] J. Lander, Read my lips: Facial animation techniques, http://www.gamasutra.com/view/feature/ 131587/read_my_lips_facial_animation_.php, accessed: 2014-01-28 (2014). [67] A. A. Montgomery, P. L. Jackson, Physical characteristics of the lips underlying vowel lipreading performance, The Journal of the Acoustical Society of America 73 (1983) 2134{2144. [68] E. B. Nitchie, Lip-Reading, principles and practise: A handbook for teaching and self-practise, Frederick A Stokes Co, New York, 1912. [69] E. Bozkurt, C. Erdem, E. Erzin, T. Erdem, M. Ozkan, Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation, in: 3DTV Conference, IEEE, 2007, pp. 1{4. [70] J. Jeers, M. Barley, Speechreading (lipreading), Thomas Spring eld, IL:, 1971. [71] S. Lee, D. Yook, Audio-to-visual conversion using Hidden Markov Models, in: PRICAI 2002: Trends in Arti cial Intelligence, Springer, 2002, pp. 563{570. [72] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari, J. Zhou, Audio-visual speech recognition, in: Final Workshop 2000 Report, Vol. 764, 2000. [73] K. E. Finn, A. A. Montgomery, Automatic optically-based recognition of speech, Pattern Recogni- tion Letters 8 (3) (1988) 159{164. [74] F. Heider, G. M. Heider, An experimental investigation of lipreading, Psychological Monographs 52 (232) (1940) 124{153. [75] M. F. Woodward, C. G. Barber, Phoneme perception in lipreading, Journal of Speech, Language and Hearing Research 3 (3) (1960) 212. [76] A. Bhattachayya, On a measure of divergence between two statistical population de ned by their population distributions, Bulletin Calcutta Mathematical Society 35 (1943) 99{109. [77] K. Wilson, The Columbia guide to standard American English, New York : Columbia University Press, 1993. [78] S. Cox, R. Harvey, Y. Lan, J. Newman, B. Theobald, The challenge of multispeaker lip-reading, in: International Conference on Auditory-Visual Speech Processing, 2008, pp. 179{184. [79] H. L. Bear, Decoding visemes: improving machine lip-reading. PhD thesis, University of East Anglia, 2016. [80] W. M. Fisher, G. R. Doddington, K. M. Goudie-Marshall, The DARPA speech recognition research database: speci cations and status, in: Proceedings of the DARPA Workshop on speech recognition, 1986, pp. 93{99. [81] Y. Lan, B.-J. Theobald, R. Harvey, E.-J. Ong, R. Bowden, Improving visual features for lip-reading, in: Proceedings of International Conference on Auditory-Visual Speech Processing, Vol. 201, 2010. 39 [82] Cambridge University, UK. BEEP pronounciation dictionary [online] (1997) [cited Jan 2013]. [83] J. Gower, Generalized procrustes analysis, Psychometrika 40 (1) (1975) 33{51. doi:10.1007/ BF02291478. URL http://dx.doi.org/10.1007/BF02291478 [84] S. Baker, Inverse compositional algorithm, in: K. Ikeuchi (Ed.), Computer Vision, Springer US, 2014, pp. 426{428. doi:10.1007/978-0-387-31439-6_759. URL http://dx.doi.org/10.1007/978-0-387-31439-6_759 [85] T. J. Hazen, Automatic alignment and error correction of human generated transcripts for long speech recordings., in: INTERSPEECH, Vol. 2006, 2006, pp. 1606{1609. [86] H. L. Bear, R. W. Harvey, B.-J. Theobald, Y. Lan, Which phoneme-to-viseme maps best improve visual-only computer lip-reading?, in: Advances in Visual Computing, Springer, 2014, pp. 230{239. doi:10.1007/978-3-319-14364-4_22. [87] J. Demsar, Statistical comparisons of classi ers over multiple datasets, Journal of Machine Learning Research 7 (2006) 1{30. [88] S. J. Young, G. Evermann, M. Gales, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK book version 3.4 (2006). [89] R. R. Bouckaert, E. Frank, Evaluating the replicability of signi cance tests for comparing learning algorithms, in: Advances in knowledge discovery and data mining, Springer, 2004, pp. 3{12. [90] Y. Bengio, Y. Grandvalet, No unbiased estimator of the variance of k-fold cross-validation, The Journal of Machine Learning Research 5 (2004) 1089{1105. [91] H. L. Bear, G. Owen, R. Harvey, B.-J. Theobald, Some observations on computer lip-reading: moving from the dream to the reality, in: SPIE Security+ Defence, International Society for Optics and Photonics, 2014, pp. 92530G{92530G. doi:10.1117/12.2067464. 40 Appendix A. RMAV Speaker-dependent P2V maps Table A.19: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp01 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /dZ/ /m/ /v01/ // /2/ /@/ /ay/ /v01/ /I@/ /t/ /T/ /uw/ /v01/ // /2/ /@/ /ay/ /v02/ /3/ /I/ /iy/ /k/ /eh/ /I@/ /I/ /iy/ /z/ /eh/ /I@/ /I/ /iy/ /n/ /N/ /r/ /s/ /v02/ /6/ /@U/ /v02/ /3/ /I/ /iy/ /k/ /v02/ /S/ /T/ /v/ /w/ /v03/ /ey/ /v03/ /O/ /3/ /ey/ /n/ /N/ /r/ /s/ /z/ /v04/ /@/ /D/ /E/ /eh/ /v04/ /A/ /sil/ /sil/ /sil/ /sil/ /sp/ /v03/ /b/ /d/ /f/ /k/ /U/ /v05/ /uw/ /gar/ /gar/ /A/ // /2/ /O/ /m/ /n/ /N/ /p/ /r/ /v05/ /A/ /v06/ /U/ /@/ /ay/ /@/ /b/ /tS/ /r/ /s/ /t/ /v06/ /I@/ /t/ /T/ /uw/ /v07/ /O@/ /tS/ /d/ /D/ /E/ /eh/ /sil/ /sil/ /sil/ /sp/ /z/ /v08/ /OI/ /eh/ /ey/ /f/ /g/ /H/ /gar/ /gar/ /A/ /O/ /AU/ /@/ /v07/ /6/ /@U/ /p/ /w/ /v09/ /@/ /H/ /dZ/ /m/ /6/ /@U/ /D/ /3/ /ey/ /g/ /H/ /v08/ /S/ /v10/ /AU/ /@U/ /OI/ /p/ /S/ /O@/ /H/ /dZ/ /6/ /@U/ /OI/ /v09/ /O/ /v11/ /b/ /d/ /f/ /k/ /O@/ /U/ /w/ /y/ /Z/ /OI/ /O@/ /U/ /uw/ /Z/ sp01 /v10/ // /m/ /n/ /N/ /p/ /r/ /Z/ /Z/ /v11/ /d/ /g/ /H/ /r/ /s/ /t/ /v12/ /b/ /v12/ /D/ /dZ/ /v13/ /y/ /v13/ /S/ /T/ /v/ /w/ /v14/ /2/ /ay/ /z/ /v15/ /Z/ /v14/ /g/ /v16/ /O@/ /v15/ /tS/ /H/ /v17/ /sil/ /v16/ /Z/ /v18/ /OI/ /sil/ /sil/ /sil/ /sp/ /v19/ /tS/ /v20/ /@/ /v21/ /AU/ /gar/ /gar/ /sp/ Table A.20: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp02 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /l/ /m/ /n/ /p/ /v01/ /@/ /ay/ /E/ /eh/ /v01/ /@/ /ay/ /b/ /d/ /v01/ /@/ /ay/ /E/ /eh/ /s/ /S/ /t/ /v/ /w/ /ey/ /I/ /iy/ /eh/ /ey/ /dZ/ /ey/ /I/ /iy/ /w/ /v02/ /O/ /I@/ /6/ /@U/ /v02/ /l/ /m/ /n/ /p/ /v02/ /b/ /m/ /n/ /N/ /v02/ /g/ /H/ /I@/ /I/ /v03/ // /2/ /AU/ /OI/ /s/ /S/ /t/ /v/ /w/ /r/ /s/ /S/ /t/ /v/ /k/ /v04/ /U/ /uw/ /w/ /v/ /w/ /y/ /z/ /v03/ /@/ /ay/ /b/ /d/ /v05/ /O@/ /sil/ /sil/ /sil/ /sp/ /sil/ /sil/ /sil/ /sp/ /eh/ /ey/ /dZ/ /v06/ /sil/ /gar/ /gar/ /A/ // /2/ /O/ /gar/ /gar/ /A/ // /2/ /O/ /v04/ /A/ /O/ /v07/ /A/ /@/ /tS/ /E/ /3/ /f/ /@/ /tS/ /d/ /D/ /f/ /v05/ /3/ /uw/ /y/ /z/ /v08/ /b/ /m/ /n/ /N/ /f/ /g/ /H/ /I@/ /I/ /f/ /g/ /H/ /I@/ /dZ/ sp02 /v06/ /6/ /@U/ /r/ /s/ /S/ /t/ /v/ /I/ /iy/ /k/ /N/ /6/ /dZ/ /k/ /l/ /6/ /@U/ /v07/ // /2/ /AU/ /OI/ /v/ /w/ /y/ /z/ /6/ /@U/ /OI/ /T/ /O@/ /@U/ /OI/ /T/ /O@/ /U/ /v08/ /f/ /N/ /O@/ /v09/ /dZ/ /O@/ /U/ /uw/ /y/ /z/ /U/ /uw/ /Z/ /v09/ /E/ /v10/ /d/ /D/ /f/ /g/ /z/ /Z/ /v10/ /tS/ /T/ /k/ /l/ /v11/ /Z/ /v11/ /tS/ /T/ /v12/ /U/ /v12/ /Z/ /v13/ /sil/ /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ /@/ /sp/ /gar/ /gar/ /@/ 41 Table A.21: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp03 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /ey/ /f/ /I/ /iy/ /v01/ /E/ /3/ /sil/ /uw/ /v01/ /ey/ /f/ /I/ /iy/ /v01/ /ay/ /eh/ /ey/ /I@/ /k/ /l/ /m/ /n/ /S/ /v02/ /U/ /k/ /l/ /m/ /n/ /S/ /iy/ /6/ /@U/ /S/ /v03/ /ay/ /eh/ /ey/ /I@/ /S/ /v02/ /g/ /k/ /l/ /m/ /v02/ /D/ /g/ /iy/ /6/ /@U/ /v02/ /E/ /r/ /s/ /t/ /p/ /r/ /s/ /t/ /T/ /v03/ /E/ /r/ /s/ /sil/ /v04/ /O/ /z/ /T/ /uw/ /z/ /v05/ /O@/ /sil/ /sil/ /sil/ /sp/ /sil/ /sil/ /sil/ /sp/ /v04/ /d/ /T/ /v/ /w/ /v06/ // /2/ /@/ /gar/ /gar/ /A/ // /2/ /O/ /gar/ /gar/ /A/ // /2/ /O/ /v05/ /O/ /@U/ /p/ /v07/ /@/ /@/ /ay/ /@/ /b/ /tS/ /@/ /@/ /b/ /tS/ /d/ /v06/ // /v08/ /AU/ /tS/ /d/ /D/ /eh/ /3/ /d/ /D/ /E/ /3/ /f/ /v07/ /@/ /ay/ /b/ /tS/ /v09/ /A/ /3/ /g/ /H/ /I@/ /N/ /f/ /H/ /dZ/ /N/ /OI/ sp03 /v08/ /N/ /v10/ /g/ /k/ /l/ /m/ /N/ /6/ /@U/ /OI/ /p/ /OI/ /S/ /O@/ /U/ /uw/ /v09/ /H/ /p/ /r/ /s/ /t/ /T/ /p/ /T/ /O@/ /U/ /v/ /uw/ /v/ /w/ /y/ /z/ /v10/ /A/ /eh/ /3/ /T/ /v/ /w/ /y/ /Z/ /z/ /Z/ /v11/ /O@/ /U/ /v11/ /tS/ /d/ /D/ /f/ /v12/ /2/ /I@/ /v12/ /dZ/ /v/ /w/ /z/ /v13/ /Z/ /v13/ /b/ /v14/ /@/ /v14/ /S/ /Z/ /v15/ /AU/ /v15/ /H/ /N/ /gar/ /gar/ /OI/ /sp/ /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ /OI/ Table A.22: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp05 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ // /O/ /@/ /D/ /v01/ // /O/ /@/ /eh/ /v01/ /ay/ /b/ /d/ /w/ /v01/ /ay/ /uw/ /3/ /ey/ /I/ /iy/ /k/ /ey/ /I/ /iy/ /6/ /@U/ /v02/ // /O/ /@/ /D/ /v02/ /d/ /D/ /f/ /dZ/ /k/ /l/ /n/ /@U/ /3/ /ey/ /I/ /iy/ /k/ /l/ /m/ /n/ /r/ /s/ /v02/ /p/ /r/ /s/ /t/ /v02/ /E/ /U/ /k/ /l/ /n/ /s/ /S/ /z/ /v03/ /ay/ /uw/ /sil/ /sil/ /sil/ /sp/ /sil/ /sil/ /sil/ /sp/ /v03/ /I@/ /N/ /uw/ /v/ /v04/ /O@/ /gar/ /gar/ /A/ /2/ /AU/ /@/ /gar/ /gar/ /A/ // /2/ /O/ /v04/ /tS/ /6/ /v05/ /2/ /AU/ /E/ /f/ /g/ /H/ /I@/ /@/ /@/ /b/ /tS/ /E/ /v05/ /ay/ /b/ /d/ /w/ /v06/ /A/ /I@/ /I@/ /dZ/ /m/ /N/ /6/ /E/ /eh/ /3/ /ey/ /g/ /v06/ /f/ /m/ /v07/ /g/ /H/ /t/ /v/ /6/ /@U/ /OI/ /p/ /r/ /g/ /H/ /I@/ /I/ /iy/ /v07/ /A/ /g/ /H/ /v08/ /p/ /w/ /y/ /r/ /s/ /S/ /t/ /T/ /iy/ /N/ /6/ /@U/ /OI/ sp05 /v08/ /@U/ /S/ /v09/ /d/ /D/ /f/ /dZ/ /T/ /O@/ /U/ /uw/ /v/ /OI/ /p/ /t/ /T/ /O@/ /v09/ /dZ/ /l/ /m/ /n/ /r/ /s/ /v/ /y/ /z/ /Z/ /O@/ /U/ /v/ /w/ /y/ /v10/ /dZ/ /s/ /S/ /y/ /z/ /Z/ /v11/ /E/ /y/ /v10/ /N/ /T/ /v12/ /T/ /v11/ /b/ /tS/ /v13/ /2/ /AU/ /v12/ /Z/ /v14/ /Z/ /sil/ /sil/ /sil/ /sp/ /v15/ /O@/ /gar/ /gar/ /@/ /OI/ /v16/ /j/ /h/ /gar/ /gar/ /@/ /OI/ /sil/ /sp/ 42 Table A.23: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp06 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /@/ /ay/ /d/ /D/ /v01/ /A/ // /2/ /@/ /v01/ /H/ /N/ /6/ /@U/ /v01/ /A/ // /2/ /@/ /eh/ /I/ /k/ /l/ /n/ /3/ /I@/ /I/ /6/ /@U/ /v02/ /@/ /ay/ /d/ /D/ /3/ /I@/ /I/ /6/ /@U/ /n/ /p/ /s/ /t/ /@U/ /eh/ /I/ /k/ /l/ /n/ /@U/ /v02/ /v/ /w/ /y/ /z/ /v02/ /sil/ /uw/ /n/ /p/ /s/ /t/ /v02/ /k/ /l/ /m/ /n/ /v03/ /m/ /v03/ /ay/ /ey/ /iy/ /U/ /sil/ /sil/ /sil/ /sp/ /r/ /s/ /S/ /t/ /v/ /v04/ /H/ /N/ /6/ /@U/ /v04/ /AU/ /O@/ /gar/ /gar/ /A/ // /2/ /O/ /v/ /w/ /y/ /z/ /v05/ /ey/ /iy/ /r/ /S/ /v05/ /E/ /@/ /b/ /tS/ /3/ /ey/ /sil/ /sil/ /sil/ /sp/ /v06/ /I@/ /v06/ /@/ /ey/ /f/ /g/ /I@/ /iy/ /gar/ /gar/ /O/ /AU/ /ay/ /@/ /v07/ /A/ // /2/ /3/ /v07/ /O/ /iy/ /dZ/ /m/ /OI/ /r/ /tS/ /d/ /D/ /E/ /ey/ /v08/ /f/ /T/ /O@/ /v08/ /k/ /l/ /m/ /n/ /r/ /S/ /T/ /O@/ /U/ /ey/ /f/ /g/ /H/ /iy/ sp06 /v09/ /uw/ /r/ /s/ /S/ /t/ /v/ /U/ /uw/ /v/ /w/ /y/ /iy/ /dZ/ /N/ /OI/ /T/ /v10/ /uw/ /v/ /w/ /y/ /z/ /y/ /z/ /Z/ /T/ /O@/ /U/ /uw/ /Z/ /v11/ /b/ /tS/ /g/ /v09/ /b/ /tS/ /d/ /D/ /Z/ /v12/ /O/ /dZ/ /g/ /dZ/ /v13/ /Z/ /v10/ /Z/ /v14/ /sil/ /v11/ /H/ /T/ /v15/ /@/ /v12/ /N/ /v16/ /AU/ /sil/ /sil/ /sil/ /sp/ /v17/ /u/ /w/ /gar/ /gar/ /OI/ /gar/ /gar/ /OI/ /sp/ Table A.24: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp08 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /eh/ /f/ /H/ /I/ /v01/ /A/ // /O/ /@/ /v01/ /eh/ /f/ /H/ /I/ /v01/ /A/ // /O/ /@/ /l/ /m/ /N/ /p/ /r/ /eh/ /ey/ /I/ /iy/ /uw/ /l/ /m/ /N/ /p/ /r/ /eh/ /ey/ /I/ /iy/ /uw/ /r/ /s/ /t/ /uw/ /r/ /s/ /t/ /uw/ /v02/ /A/ // /O/ /@/ /v02/ /U/ /sil/ /sil/ /sil/ /sp/ /v02/ /k/ /l/ /n/ /p/ /ey/ /n/ /U/ /v03/ /6/ /@U/ /gar/ /gar/ /A/ // /2/ /O/ /s/ /t/ /T/ /v/ /w/ /v03/ /ay/ /b/ /uw/ /v04/ /I@/ /@/ /ay/ /@/ /b/ /tS/ /w/ /z/ /v04/ /g/ /v05/ /AU/ /E/ /tS/ /d/ /D/ /E/ /3/ /sil/ /sil/ /sil/ /sp/ /v05/ /tS/ /v06/ /2/ /3/ /3/ /ey/ /g/ /I@/ /dZ/ /gar/ /gar/ /2/ /AU/ /@/ /b/ /v06/ /S/ /y/ /v07/ /@/ /dZ/ /k/ /n/ /6/ /@U/ /d/ /D/ /E/ /3/ /f/ /v07/ /6/ /v08/ /k/ /l/ /n/ /p/ /@U/ /OI/ /S/ /T/ /O@/ /f/ /g/ /H/ /I@/ /dZ/ sp08 /v08/ /k/ /s/ /t/ /T/ /v/ /w/ /O@/ /U/ /uw/ /v/ /w/ /dZ/ /m/ /N/ /6/ /@U/ /v09/ /dZ/ /w/ /z/ /w/ /y/ /z/ /Z/ /@U/ /OI/ /S/ /O@/ /U/ /v10/ /D/ /v/ /w/ /z/ /v09/ /d/ /D/ /f/ /H/ /U/ /y/ /Z/ /v11/ /T/ /Z/ /N/ /v12/ /3/ /I@/ /v10/ /g/ /dZ/ /v13/ /AU/ /@U/ /v11/ /b/ /tS/ /S/ /v14/ /2/ /E/ /v12/ /Z/ /v15/ /O@/ /v13/ /y/ /v16/ /@/ /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ /OI/ /sil/ /sp/ /gar/ /gar/ /OI/ /O@/ 43 Table A.25: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp09 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /D/ /E/ /3/ /g/ /v01/ /E/ /3/ /v01/ /O/ /N/ /6/ /@U/ /v01/ /2/ /@U/ /k/ /l/ /m/ /n/ /p/ /v02/ /A/ // /O/ /@/ /v02/ /D/ /E/ /3/ /g/ /v02/ /A/ // /O/ /@/ /p/ /eh/ /ey/ /I/ /iy/ /6/ /k/ /l/ /m/ /n/ /p/ /eh/ /ey/ /I/ /iy/ /6/ /v02/ /I@/ /y/ /6/ /p/ /6/ /v03/ /ay/ /r/ /s/ /S/ /v03/ /U/ /uw/ /v03/ /ay/ /r/ /s/ /S/ /v03/ /k/ /l/ /m/ /n/ /v/ /w/ /z/ /v04/ /O@/ /v/ /w/ /z/ /p/ /r/ /s/ /S/ /t/ /v04/ // /2/ /@/ /b/ /v05/ /I@/ /sil/ /sil/ /sil/ /sp/ /t/ /T/ /z/ /T/ /v06/ /AU/ /gar/ /gar/ /A/ // /2/ /AU/ /sil/ /sil/ /sil/ /sp/ /v05/ /eh/ /ey/ /f/ /I/ /v07/ /2/ /@U/ /@/ /b/ /tS/ /d/ /eh/ /gar/ /gar/ /AU/ /@/ /b/ /tS/ /v06/ /O/ /N/ /6/ /@U/ /v08/ /sil/ /eh/ /ey/ /f/ /H/ /I@/ /D/ /E/ /3/ /f/ /g/ /v07/ /A/ /v09/ /k/ /l/ /m/ /n/ /I@/ /I/ /dZ/ /OI/ /T/ /g/ /H/ /I@/ /dZ/ /OI/ sp09 /v08/ /A/ /p/ /r/ /s/ /S/ /t/ /T/ /O@/ /U/ /uw/ /y/ /OI/ /O@/ /U/ /uw/ /v/ /v09/ /U/ /uw/ /t/ /T/ /z/ /y/ /Z/ /v/ /w/ /y/ /Z/ /v10/ /dZ/ /v10/ /f/ /v11/ /tS/ /v11/ /d/ /D/ /dZ/ /v12/ /Z/ /v12/ /g/ /v/ /w/ /y/ /v13/ /O@/ /v13/ /b/ /v14/ /sil/ /v14/ /tS/ /H/ /v15/ /H/ /v15/ /Z/ /v16/ /AU/ /sil/ /sil/ /sil/ /sp/ /v17/ /a/ /a/ /gar/ /gar/ /@/ /OI/ /gar/ /gar/ /@/ /OI/ /sp/ Table A.26: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp10 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /I/ /iy/ /dZ/ /l/ /v01/ /@/ /ay/ /eh/ /3/ /v01/ /@/ /uw/ /v/ /w/ /v01/ // /2/ /O/ /U/ /N/ /I/ /iy/ /6/ /@U/ /v02/ /H/ /n/ /6/ /@U/ /v02/ /@/ /ay/ /eh/ /3/ /v02/ /H/ /n/ /6/ /@U/ /v02/ // /2/ /O/ /U/ /r/ /s/ /t/ /T/ /I/ /iy/ /6/ /@U/ /r/ /s/ /t/ /T/ /v03/ /O@/ /sil/ /sil/ /sil/ /sp/ /v03/ /d/ /D/ /f/ /H/ /v03/ /b/ /v04/ /E/ /uw/ /gar/ /gar/ /A/ // /2/ /O/ /l/ /m/ /n/ /p/ /r/ /v04/ // /d/ /D/ /E/ /v05/ /A/ /I@/ /ay/ /@/ /b/ /tS/ /d/ /r/ /s/ /t/ /v/ /w/ /ey/ /f/ /v06/ /AU/ /d/ /D/ /E/ /eh/ /3/ /w/ /z/ /v05/ /k/ /v07/ /sil/ /3/ /ey/ /f/ /g/ /I@/ /v04/ /b/ /tS/ /y/ /v06/ /@/ /uw/ /v/ /w/ /v08/ /OI/ /I@/ /I/ /iy/ /dZ/ /k/ /sil/ /sil/ /sil/ /sp/ /v07/ /ay/ /S/ /sil/ /v09/ /@/ /k/ /l/ /m/ /N/ /OI/ /gar/ /gar/ /A/ /AU/ /@/ /E/ sp10 /v08/ /U/ /v10/ /d/ /D/ /f/ /H/ /OI/ /S/ /O@/ /U/ /z/ /I@/ /dZ/ /N/ /OI/ /S/ /v09/ /2/ /O/ /z/ /l/ /m/ /n/ /p/ /r/ /z/ /Z/ /S/ /T/ /O@/ /uw/ /Z/ /v10/ /I@/ /r/ /s/ /t/ /v/ /w/ /Z/ /v11/ /tS/ /g/ /w/ /z/ /v12/ /@/ /3/ /v11/ /S/ /v13/ /A/ /AU/ /v12/ /g/ /dZ/ /N/ /v14/ /Z/ /v13/ /b/ /tS/ /y/ /v15/ /O@/ /v14/ /Z/ /v16/ /OI/ /v15/ /T/ /gar/ /gar/ /sp/ /sil/ /sil/ /sil/ /sp/ 44 Table A.27: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp11 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /iy/ /k/ /m/ /n/ /v01/ /uw/ /v01/ /O/ /@/ /ay/ /tS/ /v01/ // /@/ /ay/ /E/ /6/ /p/ /r/ /s/ /t/ /v02/ // /@/ /ay/ /E/ /ey/ /3/ /ey/ /I/ /iy/ /t/ /3/ /ey/ /I/ /iy/ /v02/ /d/ /D/ /f/ /v02/ /dZ/ /k/ /l/ /m/ /v02/ /v/ /v03/ /A/ /v03/ /iy/ /k/ /m/ /n/ /N/ /p/ /r/ /s/ /t/ /v03/ /O/ /@/ /ay/ /tS/ /v04/ /2/ /O/ /@U/ /6/ /p/ /r/ /s/ /t/ /t/ /w/ /ey/ /v05/ /6/ /t/ /sil/ /sil/ /sil/ /sp/ /v04/ /d/ /D/ /f/ /v06/ /U/ /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ /A/ /2/ /O/ /AU/ /v05/ /w/ /v07/ /O@/ /gar/ /gar/ /A/ // /2/ /AU/ /b/ /tS/ /d/ /D/ /f/ /v06/ /S/ /v08/ /sil/ /b/ /E/ /3/ /g/ /H/ /f/ /g/ /H/ /I@/ /6/ /v07/ /A/ // /2/ /b/ /v09/ /OI/ /H/ /I@/ /I/ /dZ/ /l/ /6/ /@U/ /OI/ /S/ /T/ /v08/ /AU/ /E/ /3/ /I/ /v10/ /I@/ /l/ /@U/ /OI/ /S/ /T/ /T/ /O@/ /U/ /uw/ /v/ /v09/ /T/ /O@/ /v11/ /AU/ /T/ /O@/ /U/ /uw/ /v/ /v/ /y/ /z/ /Z/ sp11 /v10/ /@U/ /v12/ /dZ/ /k/ /l/ /m/ /v/ /w/ /y/ /z/ /Z/ /v11/ /g/ /y/ /z/ /N/ /p/ /r/ /s/ /t/ /Z/ /v12/ /H/ /l/ /t/ /w/ /v13/ /I@/ /uw/ /v13/ /d/ /f/ /g/ /H/ /v14/ /Z/ /v14/ /S/ /v15/ /U/ /v15/ /y/ /z/ /v16/ /sil/ /v16/ /D/ /T/ /v/ /v17/ /dZ/ /v17/ /tS/ /v18/ /@/ /v18/ /Z/ /gar/ /gar/ /OI/ /sp/ /v19/ /b/ /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ /@/ Table A.28: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp13 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /O/ /d/ /I/ /k/ /v01/ // /O/ /@/ /ay/ /v01/ /O/ /d/ /I/ /k/ /v01/ // /O/ /@/ /ay/ /n/ /p/ /s/ /uw/ /v/ /3/ /ey/ /I@/ /I/ /iy/ /n/ /p/ /s/ /uw/ /v/ /3/ /ey/ /I@/ /I/ /iy/ /v/ /z/ /Z/ /iy/ /v/ /z/ /Z/ /iy/ /v02/ /I@/ /v02/ /E/ /6/ /@U/ /uw/ /sil/ /sil/ /sil/ /sp/ /v02/ /d/ /f/ /g/ /k/ /v03/ /3/ /f/ /g/ /r/ /v03/ /AU/ /gar/ /gar/ /A/ // /2/ /AU/ /m/ /n/ /N/ /p/ /s/ /v04/ /b/ /D/ /E/ /eh/ /v04/ /2/ /U/ /ay/ /@/ /b/ /tS/ /D/ /s/ /t/ /v/ /w/ /z/ /v05/ /tS/ /v05/ /A/ /O@/ /D/ /E/ /eh/ /3/ /ey/ /z/ /v06/ /AU/ /iy/ /6/ /@U/ /v06/ /sil/ /ey/ /f/ /g/ /H/ /I@/ /sil/ /sil/ /sil/ /sp/ /v07/ /@/ /U/ /v07/ /@/ /I@/ /iy/ /dZ/ /m/ /N/ /gar/ /gar/ /A/ /2/ /AU/ /@/ /v08/ // /2/ /@/ /ay/ /v08/ /dZ/ /r/ /S/ /y/ /N/ /6/ /@U/ /OI/ /r/ /tS/ /D/ /E/ /H/ /dZ/ sp13 /v09/ /A/ /y/ /v09/ /d/ /f/ /g/ /k/ /r/ /S/ /t/ /T/ /O@/ /dZ/ /6/ /@U/ /OI/ /r/ /v10/ /m/ /sil/ /t/ /T/ /m/ /n/ /N/ /p/ /s/ /O@/ /U/ /w/ /y/ /r/ /S/ /T/ /O@/ /U/ /v11/ /S/ /s/ /t/ /v/ /w/ /z/ /U/ /uw/ /y/ /Z/ /v12/ /ey/ /z/ /v13/ /O@/ /w/ /v10/ /H/ /v14/ /N/ /v11/ /b/ /tS/ /D/ /gar/ /gar/ /OI/ /sp/ /v12/ /Z/ /v13/ /T/ /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ /OI/ 45 Table A.29: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp14 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /tS/ /iy/ /dZ/ /m/ /v01/ // /O/ /@/ /ay/ /v01/ // /3/ /ey/ /f/ /v01/ // /O/ /@/ /ay/ /@U/ /p/ /r/ /s/ /t/ /eh/ /3/ /ey/ /I/ /iy/ /v02/ /S/ /v/ /w/ /y/ /eh/ /3/ /ey/ /I/ /iy/ /t/ /T/ /iy/ /v03/ /tS/ /iy/ /dZ/ /m/ /iy/ /v02/ /@/ /ay/ /N/ /v02/ /uw/ /@U/ /p/ /r/ /s/ /t/ /v02/ /D/ /f/ /H/ /k/ /v03/ /O/ /b/ /d/ /D/ /v03/ /U/ /t/ /T/ /m/ /n/ /r/ /s/ /S/ /l/ /v04/ /I@/ /6/ /@U/ /v04/ /O/ /b/ /d/ /D/ /S/ /t/ /v/ /w/ /v04/ /S/ /v/ /w/ /y/ /v05/ /2/ /sil/ /l/ /sil/ /sil/ /sil/ /sp/ /v05/ /g/ /H/ /k/ /v06/ /AU/ /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ /A/ /2/ /AU/ /@/ /v06/ /E/ /U/ /v07/ /A/ /gar/ /gar/ /A/ /2/ /AU/ /@/ /tS/ /d/ /g/ /I@/ /dZ/ /v07/ // /3/ /ey/ /f/ /v08/ /A/ /@/ /E/ /g/ /H/ /I@/ /dZ/ /N/ /6/ /@U/ /OI/ /v08/ /A/ /uw/ /v09/ /O@/ /I@/ /k/ /N/ /6/ /OI/ /OI/ /p/ /T/ /O@/ /U/ /v09/ /I@/ /v10/ /a/ /a/ /OI/ /O@/ /U/ /uw/ /Z/ /U/ /uw/ /y/ /z/ /Z/ sp14 /v10/ /2/ /6/ /v11/ /D/ /f/ /H/ /k/ /Z/ /Z/ /v11/ /I@/ /m/ /n/ /r/ /s/ /S/ /v12/ /Z/ /S/ /t/ /v/ /w/ /v13/ /O@/ /v12/ /z/ /v14/ /sil/ /v13/ /y/ /v15/ /AU/ /v14/ /b/ /tS/ /d/ /T/ /v16/ /i/ /a/ /v15/ /p/ /gar/ /gar/ /@/ /OI/ /sp/ /v16/ /g/ /v17/ /dZ/ /N/ /v18/ /Z/ /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ /@/ /OI/ Table A.30: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp15 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /@/ /d/ /D/ /ey/ /v01/ /@/ /ay/ /eh/ /ey/ /v01/ /@/ /d/ /D/ /ey/ /v01/ /@/ /ay/ /eh/ /ey/ /I/ /iy/ /k/ /l/ /m/ /iy/ /@U/ /uw/ /I/ /iy/ /k/ /l/ /m/ /iy/ /@U/ /uw/ /m/ /n/ /y/ /v02/ /2/ /O/ /AU/ /E/ /m/ /n/ /y/ /v02/ /b/ /d/ /D/ /f/ /v02/ /I@/ /p/ /r/ /s/ /OI/ /sil/ /sil/ /sil/ /sp/ /k/ /l/ /m/ /n/ /N/ /t/ /T/ /z/ /v03/ /6/ /gar/ /gar/ /A/ // /2/ /O/ /N/ /p/ /v/ /v03/ /eh/ /@U/ /v04/ /A/ // /3/ /ay/ /@/ /b/ /tS/ /E/ /sil/ /sil/ /sil/ /sp/ /v04/ /A/ // /2/ /O/ /v05/ /sil/ /O@/ /E/ /eh/ /3/ /g/ /H/ /gar/ /gar/ /A/ // /2/ /O/ /v05/ /6/ /v06/ /U/ /H/ /I@/ /dZ/ /N/ /6/ /@/ /tS/ /E/ /3/ /H/ /v06/ /N/ /uw/ /v/ /v07/ /@/ /6/ /@U/ /OI/ /p/ /r/ /H/ /I@/ /dZ/ /6/ /OI/ /v07/ /U/ /v08/ /b/ /d/ /D/ /f/ /r/ /s/ /S/ /t/ /T/ /OI/ /r/ /s/ /S/ /t/ sp15 /v08/ /g/ /H/ /dZ/ /k/ /l/ /m/ /n/ /N/ /T/ /O@/ /U/ /uw/ /v/ /t/ /T/ /O@/ /U/ /w/ /v09/ /3/ /N/ /p/ /v/ /v/ /w/ /z/ /Z/ /w/ /y/ /z/ /Z/ /v10/ /b/ /tS/ /v09/ /r/ /s/ /S/ /t/ /v11/ /3/ /z/ /v12/ /ay/ /E/ /v10/ /dZ/ /v13/ /sil/ /O@/ /v11/ /Z/ /v14/ /AU/ /OI/ /v12/ /w/ /y/ /v15/ /Z/ /v13/ /H/ /v16/ /e/ /r/ /v14/ /tS/ /gar/ /gar/ /@/ /sp/ /sil/ /sil/ /sil/ /sp/ http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University) http://www.deepdyve.com/lp/arxiv-cornell-university/phoneme-to-viseme-mappings-the-good-the-bad-and-the-ugly-orEGC9abkj

Loading next page...

References (89)

Yuxuan Lan, B. Theobald, R. Harvey (2012)
View Independent Computer Lip-Reading
2012 IEEE International Conference on Multimedia and Expo
B. Theobald (2003)
Visual speech synthesis using shape and appearance models
S. Happy, A. Routray (2015)
Automatic facial expression recognition using features of salient facial patches
IEEE Transactions on Affective Computing, 6
Soonkyu Lee, Dongsuk Yook (2002)
Audio-to-Visual Conversion Using Hidden Markov Models
C. Binnie, P. Jackson, A. Montgomery (1976)
Visual intelligibility of consonants: a lipreading screening test with implications for aural rehabilitation.
The Journal of speech and hearing disorders, 41 4
Nidhi Khatri, Z. Shah, Samip Patel (2014)
Facial Expression Recognition: A Survey
Luca Cappelletta, N. Harte (2012)
Phoneme-to-viseme Mapping for Visual Speech Recognition
S. Bauman, G. Hambrecht (1995)
Analysis of View Angle Used in Speechreading Training of Sentences
American Journal of Audiology, 4
Adrian Pass, Jianguo Zhang, Darryl Stewart (2010)
AN investigation into features for multi-view lipreading
2010 IEEE International Conference on Image Processing
Zhicheng Yan, V. Jagadeesh, D. DeCoste, Wei Di, Robinson Piramuthu (2014)
HD-CNN: Hierarchical Deep Convolutional Neural Network for Image Classification
ArXiv, abs/1410.0736
Timothy Hazen, Kate Saenko, C. La, James Glass (2004)
A segment-based audio-visual speech recognizer: data collection, development, and initial experiments
Brian Walden, R. Prosek, Allen Montgomery, Charlene Scherr, Carla Jones (1977)
Effects of training on the visual recognition of consonants.
Journal of speech and hearing research, 20 1
Joon Chung, Andrew Zisserman (2016)
Lip Reading in the Wild
S. Young, G. Evermann, M. Gales, Thomas Hain, Dan Kershaw, G. Moore, J. Odell, D. Ollason, Daniel Povey, Valtchev, P. Woodland (2006)
The HTK book version 3.4
M. Sundermeyer, R. Schlüter, H. Ney (2012)
LSTM Neural Networks for Language Modeling
R. Shor (1978)
The Production and Judgment of Smile Magnitude
Journal of General Psychology, 98
A. Razavian, Hossein Azizpour, Josephine Sullivan, S. Carlsson (2014)
CNN Features Off-the-Shelf: An Astounding Baseline for Recognition
2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops
Helen Bear, R. Harvey, B. Theobald, Yuxuan Lan (2014)
Which Phoneme-to-Viseme Maps Best Improve Visual-Only Computer Lip-Reading?
ArXiv, abs/1710.01093
Jongju Shin, Jin Lee, Daijin Kim (2011)
Real-time lip reading system for isolated Korean word recognition
Pattern Recognit., 44
S. Cox, R. Harvey, Yuxuan Lan, Jacob Newman, B. Theobald (2008)
The challenge of multispeaker lip-reading
E. Bozkurt, C. Erdem, E. Erzin, T. Erdem, M. Ozkan (2007)
Comparison of Phoneme and Viseme Based Acoustic Units for Speech Driven Realistic lip Animation
2007 3DTV Conference
Jintao Jiang, A. Alwan, L. Bernstein, E. Auer, P. Keating (2002)
Similarity structure in perceptual and physical measures for visual Consonants across talkers
2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1
Helen Bear, R. Harvey (2016)
Decoding visemes: Improving machine lip-reading
2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
K. Thangthai, R. Harvey, S. Cox, B. Theobald (2015)
Improving lip-reading performance for robust audiovisual speech recognition using DNNs
Stephen Moore, Richard Bowden (2011)
Local binary patterns for multi-view facial expression recognition
Comput. Vis. Image Underst., 115
E. Patterson, S. Gurbuz, Z. Tufekci, J. Gowdy (2002)
CUAVE: A new audio-visual database for multimodal human-computer interface research
2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2
A. Markides (1979)
Speechreading (lipreading).
Child: care, health and development, 5 1
J. Kumari, R. Rajesh, K. Pooja (2015)
Facial Expression Recognition: A Survey
Procedia Computer Science, 58
K. Wilson (1993)
The Columbia Guide to Standard American English
UK. BEEP pronounciation dictionary
I. Matthews, Simon Baker (2004)
Active Appearance Models Revisited
International Journal of Computer Vision, 60
Jacob Newman, B. Theobald, S. Cox (2010)
Limitations of visual speech recognition
Eng-Jon Ong, R. Bowden (2008)
Robust Lip-Tracking using Rigid Flocks of Selected Linear Predictors
Helen Bear, Sarah Taylor (2017)
Visual speech recognition: aligning terminologies for better understanding
ArXiv, abs/1710.01292
Helen Bear, S. Cox, R. Harvey (2015)
Speaker-independent machine lip-reading with speaker-dependent viseme classifiers
ArXiv, abs/1710.01122
Helen Bear, R. Harvey, B. Theobald, Yuxuan Lan (2014)
Resolution limits on visual speech recognition
2014 IEEE International Conference on Image Processing (ICIP)
Xiaoming Zhao, Shiqing Zhang, Xiaohua Wang, Gang Zhang (2014)
Multimodal Emotion Recognition Integrating Affective Speech with Facial Expression
M. Vitkovitch, P. Barber (1996)
Visible Speech as a Function of Image Quality: Effects of Display Parameters on Lipreading Ability
Applied Cognitive Psychology, 10
Kathleen Finn, A. Montgomery (1988)
Automatic optically-based recognition of speech
Pattern Recognit. Lett., 8
(2014)
Read my lips: Facial animation techniques
R. Bouckaert, E. Frank (2004)
Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms
S. Lesner (1981)
Visual vowel and diphthong perception across speakers
, 14
Sarah Taylor, B. Theobald, I. Matthews (2014)
The effect of speaking rate on audio and visual speech
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
A. Blokland, A. Anderson (1998)
Effect of low frame-rate video on intelligibility of speech
Speech Commun., 26
Jingjie Yan, Wenming Zheng, Qinyu Xu, G. Lu, Haibo Li, Bei Wang (2016)
Sparse Kernel Reduced-Rank Regression for Bimodal Emotion Recognition From Facial Expression and Speech
IEEE Transactions on Multimedia, 18
(1997)
Cambridge University, UK
M. Woodward, C. Barber (1960)
Phoneme perception in lipreading.
Journal of speech and hearing research, 3
I. Matthews, J. Bangham, Richard Harvey, S. Cox (1998)
Nonlinear scale decomposition based features for visual speech recognition
9th European Signal Processing Conference (EUSIPCO 1998)
J. Luettin, N. Thacker, S. Beet (1996)
Speaker identification by lipreading
Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96, 1
A. Montgomery, P. Jackson (1983)
Physical characteristics of the lips underlying vowel lipreading performance.
The Journal of the Acoustical Society of America, 73 6
Yoshua Bengio, Yves Grandvalet
Série Scientifique Scientific Series No Unbiased Estimator of the Variance of K-fold Cross-validation No Unbiased Estimator of the Variance of K-fold Cross-validation
J. Franks, J. Kimble (1972)
The confusion of English consonant clusters in lipreading.
Journal of speech and hearing research, 15 3
M. Heckmann, F. Berthommier, C. Savariaux, K. Kroschel (2003)
Effects of image distortions on audio-visual speech recognition
(1986)
The DARPA speech recognition research database: specifications and status
Elmer Owens, Barbara Blazek (1985)
Visemes observed by hearing-impaired and normal-hearing adult viewers.
Journal of speech and hearing research, 28 3
Sascha Fagel (2009)
Effects of Smiling on Articulation: Lips, Larynx and Acoustics
Kshitiz Kumar, Tsuhan Chen, R. Stern (2007)
Profile View Lip Reading
2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 4
Fu Huang, Tsuhan Chen (2000)
Tracking of multiple faces for human-computer interfaces and virtual environments
2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532), 3
F. Neuberger (1971)
[Lip reading].
Monatsschrift fur Ohrenheilkunde und Laryngo-Rhinologie, 105 6
Jacob Newman, S. Cox (2010)
Speaker independent visual-only language identification
2010 IEEE International Conference on Acoustics, Speech and Signal Processing
Ross Cutler, L. Davis (2000)
Look who's talking: speaker detection using video and audio correlation
2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532), 3
J. Demšar (2006)
Statistical Comparisons of Classifiers over Multiple Data Sets
J. Mach. Learn. Res., 7
I. Matthews, Tim Cootes, J. Bangham, S. Cox, Richard Harvey (2002)
Extraction of Visual Features for Lipreading
IEEE Trans. Pattern Anal. Mach. Intell., 24
T. Saitoh, R. Konishi (2010)
A study of influence of word lip reading by change of frame rate
Yoni Bauduin (2004)
Audio-Visual Speech Recognition
R. Kaucic, A. Blake (1998)
Accurate, real-time, unadorned lip tracking
Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271)
Wael Hamza, E. Eide, R. Bakis, M. Picheny, J. Pitrelli (2004)
The IBM expressive speech synthesis system
(1943)
On a measure of divergence between two statistical population defined by their population distributions
Yuxuan Lan, R. Harvey, B. Theobald, Eng-Jon Ong, R. Bowden (2009)
Comparing visual features for lipreading
S. Baker (2014)
Inverse Compositional Algorithm
K. Paleček (2016)
Lipreading using spatiotemporal histogram of oriented gradients
2016 24th European Signal Processing Conference (EUSIPCO)
G. Potamianos, C. Neti, J. Luettin, I. Matthews (2004)
Audio-Visual Automatic Speech Recognition: An Overview
Jesus Perez, Alejandro Frangi, EDUARDO SOLANO, K. Lukas (2005)
Lip reading for robust speech recognition on embedded devices
Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., 1
Miriam Kienast, A. Paeschke, W. Sendlmeier (1999)
Articulatory reduction in emotional speech
Timothy Hazen (2006)
Automatic alignment and error correction of human generated transcripts for long speech recordings
Helen Bear, Gari Owen, R. Harvey, B. Theobald (2014)
Some observations on computer lip-reading: moving from the dream to the reality
, 9253
Tim Cootes, G. Edwards, C. Taylor (1998)
Active Appearance Models
Felix Shaw, B. Theobald (2016)
Expressive Modulation of Neutral Visual Speech
IEEE MultiMedia, 23
Rowan Seymour, D. Stewart, J. Ming (2008)
Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos
EURASIP Journal on Image and Video Processing, 2008
Miriam Kienast, W. Sendlmeier (2000)
Acoustical analysis of spectral and temporal changes in emotional speech
, 69
P. Lucey, G. Potamianos, S. Sridharan (2008)
Visual speech recognition across multiple views
J. Gower (1975)
Generalized procrustes analysis
Psychometrika, 40
Q. Zhu, A. Alwan (2000)
On the use of variable frame rate analysis in speech recognition
2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), 3
Michael Wand, J. Koutník, J. Schmidhuber (2016)
Lipreading with long short-term memory
2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Yuxuan Lan, B. Theobald, R. Harvey, Eng-Jon Ong, R. Bowden (2010)
Improving visual features for lip-reading
Dominic Howell, B. Theobald, S. Cox (2013)
Confusion modelling for automated lip-reading usingweighted finite-state transducers
Wonmin Byeon, T. Breuel, Federico Raue, M. Liwicki (2015)
Scene labeling with LSTM recurrent neural networks
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Cletus Fisher (1968)
Confusions among visually perceived consonants.
Journal of speech and hearing research, 11 4
S. Lesner (1982)
Differences in visual intelligibility across talkers
, 84

ISSN: 0167-6393
eISSN: ARCH-3348
DOI: 10.1016/j.specom.2017.07.001
Publisher site: See Article on Publisher Site

Abstract

Visemes are the visual equivalent of phonemes. Although not precisely de ned, a working de nition of a viseme is \a set of phonemes which have identical appearance on the lips". Therefore a phoneme falls into one viseme class but a viseme may represent many phonemes: a many to one mapping. This mapping introduces ambiguity between phonemes when using viseme classi ers. Not only is this ambiguity damaging to the performance of audio-visual classi ers operating on real expressive speech, there is also considerable choice between possible mappings. In this paper we explore the issue of this choice of viseme-to-phoneme map. We show that there is de nite dierence in performance between viseme-to-phoneme mappings and explore why some maps appear to work better than others. We also devise a new algorithm for constructing phoneme-to-viseme mappings from labeled speech data. These new visemes, `Bear' visemes, are shown to perform better than previously known units. Keywords: lipreading, speaker-dependent, viseme, phoneme, resolution, speech recognition, classi cation, visual speech, visual units. 1. Introduction Recognition and synthesis of expressive audio-visual speech has proven to be a most challenging problem. When comparing audio-visual speech with acoustic recognition, one can identify several sources of diculty. Firstly, the visual component of speech brings new problems such as pose, lighting, frame rate, resolution, and so on. Secondly, old problems in acoustic recognition, such as person speci city or the optimal recognition units, appear in new ways in the visual domain. While some of these aspects have been partially studied, progress has been hampered by very small datasets. Furthermore, reliable tracking has eluded many researchers which in turn has led to sub-optimal feature extraction, consequent poor performance and hence, incorrect conclusions about the parts of the problem that are tractable or intractable. A further challenge is the lack of consensus on the recognition units and it is commonplace to need to compare, say, Corresponding author Email address: h.l.bear@uel.ac.uk (Helen L Bear) URL: https://www.uea.ac.uk/computing/people/profile/r-w-harvey (Richard Harvey) Preprint submitted to Speech Communication Special Issue on AV expressive speech and gesture. July, 2017 arXiv:1805.02934v1 [cs.CV] 8 May 2018 word error rates with viseme error rates computed from a dierent set of visemes. Our contention is that progress in expressive audio-visual speech will remain stunted while this fundamental uncertainty remains. In this paper we review the choice of visual recognition units and provide a comprehensive set of evaluations of the competing phoneme-to- viseme mappings. We give guidance on what works well and provide explanations for the dierences in performance. We also devise new algorithms for selecting optimal visual units should this be desired. We should note that while this paper tends to focus on visual-only recognition, or lipreading, this aspect is by far the most challenging so progress on lipreading can be used to provide more useful audio-visual systems. The rest of this paper is structured as follows: we discuss the current restrictions on a conventional lipreading system and identify the limitation of each upon the system. We then study the current sets of published visemes, before presenting a new speaker- dependent clustering algorithm for creating sets of visemes for individual speakers. We show that creating these speaker-dependent visemes follows from simple clustering and merge algorithms. These new visemes are tested on both isolated words and continuous speech datasets before we evaluate the ecacy of the improved performance against the extra investment into a new lipreading system. Since it is computationally simple to develop these speaker-dependent visemes we contend they are also a useful step in the analysis of speaker variability which itself is one of the more challenging problems in general lipreading. 2. Limitations in lipreading systems It is often said that lipreading is dicult because not all sounds appear on the lips . This is true but in reality there are a number of problems that can corrupt the lipreading signal even before one reaches the problem of trying to decode the visual signal. Table 1 provides a taxonomy of the challenges in lipreading. Some of them relate to the problems of extracting useful information from the visual signal whereas some appear later in the signal processing chain and relate to the coding and classi cation of the visual signal. Motion is an important part of almost all realistic settings. It is therefore essential to have either some form of tracking or to devise features that are invariant to non- informational motions. An early dataset which captured speaker motion (not camera motion) is CUAVE [37]. Lipreading experiments on this dataset such as [38] examine two dierent features, one based on the Discrete Cosine Transform (DCT) and another on the Active Appearance Model (AAM). The AAM (which can be shape-only, appearance-only or shape and appearance models) [4] sometimes preceded by Linear Predictors (LP) [2]. An AAM [4] is a model trained on a combination of shape and/or appearance information from a subset of video frames. The model is usually built from video frames manually labeled with landmarks which are chosen to cover the full range of motion throughout the video. In [38] they prefer the DCT but note that there were implementation diculties with the AAM which meant it was improperly tracked. Further lip-reading experiments on CUAVE [39] clari es how challenging comparing results is, because there is no agreed [1] compares the performance of a system that measures, via electromagnetic articulography, the hidden and visual parts of the mouth so the extent of this statement can be quanti ed. 2 Table 1: Challenges to successful machine lipreading. Each challenge has some references. Evaluation Previously studied? Motion Yes, [2{4] Pose Yes, [5{11] Expression Yes, [6, 7] Frame rate Yes, [12, 13] Video quality Yes [14{16] Color Yes, [9] Unit choice Yes, [17{21] Feature Yes, [3, 4, 22{24] Classi er technology Yes, [17, 25{28] Multiple persons Yes, [29{32] Speaker identity Yes, [33{35] Rate of speech Yes, [21, 36] evaluation protocol which could account for the motion challenge/face alignment. This is attributed to their partial success with particular speakers. The majority of automatic lipreading systems use a frontal pose in which the speaker's facial place is normal to the principal ray of the camera. However in [7] for example, an improvement in expression recognition is seen by both computers and humans when the pose is rotated to 45 . Other work [8, 9], looks more speci cally at visual speech recognition and suggests that a pro le view of a speaker may not lead to catastrophically low accuracies. This observation is consistent with [10] which measures human sentence perception from three viewing angles: full-frontal view (0 ), angled view (45 ), and side view (90 ). In this single-subject study a post-lingual deaf woman was tested to measure accuracy at the three angles independently. The three angles were randomly presented in every lipreading session. The results indicated that the side-view angle is most eective. A model for pose-mismatched lipreading is presented in [11] in which it is shown that without training data at the correct pose, the recognition accuracy falls dramatically. However, the authors also show that this can be mitigated by projecting the features back to a canonical pose. This transformation principle is also used in [5] which presents a view-independent lipreading system. This investigation uses a continuous speech cor- pus compared to the small vocabulary dataset in [11]. This later study acknowledges a human lipreaders preference for a non-frontal view and suggests it could be attributed to lip protrusion. They show that the 45 angle is preferable. In short, when it comes to pose, there is evidence that it can be accounted for and need not be insurmountable. Therefore, for this work we stick to frontal pose. Expression can be dicult to disentangle with the spoken word when lipreading natural speech. Smiling (a happy expression) has an known eect on lip motions during speech [40]. Eects on the inner, outer lips and lip protrusions have been measured in [41] who shows that smiling during speech (particularly vowels) places a restriction on lip motion with greater demand placed on the inner lips as variation in outer lips and lip protrusion is reduced. This in turn creates a greater challenge when lipreading non- neutral speech as gestures become less distinct. Furthermore, expression also eects the temporal property of speech [42, 43]. When a particular phoneme is uttered, its duration 3 can be shortened (for example when angry and vowels particularly become shorter) or elongated, for example when a speaker is sad. To the best of our knowledge there is no systematic study which speci cally investi- gates lipreading expressive speech. Rather, tasks focus on either, synthesizing expression in faces [44{46] or expression recognition during speech [47{49]. Studies such as [12] on the eect of low video frame-rate on human speech intelligi- bility during video communications, suggest that lower frame rates, if they are visible to the speaker, encourage humans to over-articulate to compensate for the reduced visual information available, akin to a visual Lombard eect. Accuracy is maximized when the same frame rate is used for both training and testing [13]. They further recommend that when the training data cannot be recorded at the same frame rate as the test data, then it is best if the training data has a higher frame rate (for feature extraction) than the test data. A further observation is that word classi cation rates vary in a non-linear fashion as the frame rate is reduced. When it comes to dependence of lipreading on video quality, an investigation into the eects of compression artifacts, visual noise (simulated with white noise) and local- ization errors in training is presented in [15], and in [16]. The authors undertake two experiments, of which the rst includes some attention to spatial resolution (the number of pixels). However, here, resolution varies along with other parameters. Neither of these papers consider the simple removal of information from a smaller image compared to a larger one. A more systematic study of resolution can be found in [14] in which video of varying resolution is parameterized using AAMs [50]. This work shows that machines can lipread continuous speech with as little as two pixels per lip. With regard to color, it has been surprisingly under used. In [9] algorithms are derived which contain three key components: shape models, motion models, and focused color feature detectors. In early works it was common to use colored lip-stick or markers to help track the lips (tracking remains challenging) but many authors convert the image to grayscale and use grayscale features. Unit choice refers to the question of whether to use phonemes, visemes, words or something else. Classi ers built on phonemes [18], visemes [19], and words [20] have all been previously presented. Sometimes the unit choice is linked to the problem: word clas- si ers often use word units, whereas continuous speech has to use phonemes or visemes. It is essentially a trade-o since using phonemes means accepting that there will be units that do not appear on the lips (the words \bad", \pad", and \mad" are usually said to be visually indistinguishable) whereas using visemes leads to better unit accuracy but there is then the problem of homopheny (words that have identical visemic transcriptions but dierent spellings). One study has reviewed how the unit selection aects recognition in relation to the unit selection of the supporting language model [21] and have shown that phoneme networks work best for both phoneme and viseme classi ers. However the practical reality is that many systems use visemes and there is need to resolve which choice of visemes works best. Comparative studies such as [17] have attempted to com- pare some previous viseme sets but, these often only consider a few dierent sets rather than the gulf available. Lan et al. present in [24] a comparison of dierent features rst presented in [4]. Revisited in [3], AAM features are produced as either model-based (using shape infor- mation) or pixel-based (using appearance information). In [24] Lan et al. observed that state of the art AAM features with appearance parameters outperform other feature 4 types like sieve features, 2D DCT, and eigen-lip features, suggesting appearance is more informative than shape. Also pixel methods bene t from image normalisation to remove shape and ane variation from region of interest (in this example, the mouth and lips). The method in [24] classi ed words with the an Audio-Visual dataset known as RMAV but recommended in future creating classi ers with viseme labels for lipreading, and advises that most information is from the inner of the mouth. Some works have attempted to adapt features to address dierent problems, such as motion described above. For example, in [51] the authors suggest altering HMM modeling to permit either frozen or occluded frames, and demonstrate that even low level jitter will signi cantly aect the quality of lip reading features. When it comes to the choice of classi er technology it is the norm that machine lipreading systems adapt methods from acoustic recognition. This not only follows from the observation that visual and acoustic speech have the same origins but also from the practical observation that language models are expensive to create and it makes sense to re-use the models across the two modalities. The conventional classi er process is 1) data preparation (an acoustic example is creating MFCC's [27], whereas a visual example might be [17]), 2) build Hidden Markov Model classi ers, and 3) feed the classi cation outputs through a language network to produce a transcript. Like feature selection, the choice of classi er is aected by the problem in hand. An optimal audio recognizer will not guarantee optimal performance in an audio-visual, or visual only domain. In [52], for example, it is noted that their audio-visual results should not be \read across" to lipreading. More modern deep learning techniques for lipreading are an alternative approach which require much more training data [28]. A key disadvantage of these methods is a lack of understanding about what exactly a neural network is learning in order for it to classify unseen gestures. So often the results from deep learning are good but the scienti c insight can be poor. Thus recent work has begun to demonstrate performance of dierent deep learning approaches with a variety of neural network architectures. Convolution neural networks (CNN) have been particularly prevalent for image classi cation ([53, 54]) and Long Short Term Memory networks (LSTM) are performing well on temporal problems (e.g. language modeling [55] or, scene labeling [56]). For lipreading, we have evidence that both of these achieve good recognition rates in end-to-end systems, in [57] a CNN achieves 61.1% top 1 accuracy and in [58] an LSTM achieves 79.6% top 1 accuracy on a small dataset. However, our lipreading is a combination of these challenges, that is a temporal-visual classi cation problem. For lipreading multiple persons, [30, 31] detailed human lipreading of multiple people, [30] recognizes consonants, and [31] visual vowels. [32] presents an audio-visual system for HCI which automatically detects a talking person (both spatially and tem- porally) using video and audio data from a single microphone. In summary there is no reason to think that multi-person lipreading is any less viable than single-person lipreading, although the challenge of variability due to speaker identity is real. Speaker identity is a major challenge in machine lipreading because Visual speech is not consistent across individuals. Sometimes this can be advantageous as in [33] where they use lipreading to identify speakers. With known speakers - lipreading recogni- tion rates can be high, but with unknown speakers (referred to as speaker-independent lipreading) this is as yet not at the same standard as speaker dependent lipreading. In [34] results show that classi ers trained and tested on distinct speakers compared to those 5 trained and tested on the same speakers are statistically signi cantly dierent. This is supported in [35] where the authors strive to discriminate languages from visual speech and they conclude that in order to improve performance would be to move away from speaker-dependent features. For acoustic speech it is acknowledged that people have dierent speaking styles, accents and rates of speech. For visual speech there is the additional confusion of what we call a \visual accent" in which very similar sounds can be made by persons with very dierent mouth shapes { examples of visual accent eects include people who talk out of the side of their mouths; ventriloquists and mimics. The rate of speech alters both an utterance duration and articulator positions. Therefore, both the sounds produced, but particularly, visible appearance are altered. In [36], the authors present an experiment which measures the eect of speech rate and shows the eect is signi cantly higher on visual speech than in acoustic. Anecdotal evidence suggests that speaker visual style can evolve as speakers age due to co-articulation reduction as a person travels/interacts with other adults [21]. In summary, while audio-visual speech processing has a great number of challenges, one of the pivotal ones is the question of the visual units and how they should be derived. Since all language models are de ned in terms of phonemes, the practical question is the choice of the mapping from phonemes to visemes. The literature has presented a great number of these phoneme-to-viseme (P2V) mappings and few consistent comparisons between them so this is the topic for the next section. 3. Comparison of phoneme-to-viseme mappings A summary of published P2V maps is provided in [59] Tables 2.3 and 2.4. This list is not exhaustive and these mappings motivated by: a focus on just consonants [60{63]; being speaker-dependent [64], prioritizing particular visemes [65]; or a focus on vowels [66, 67]. These are useful starting points, but for the purpose of this study we would like the phoneme-to-viseme mappings to include all phonemes in the transcript of the dataset to accurately re ect the range of phonemes used in a full vocabulary. Therefore, some mappings used here are a pairing of two mappings suggested in literature, e.g. one maps for the vowels and one map for the consonants. A full list of the mappings used is in Tables 2 and 3. Of these mappings , the most common are `the Disney 12' [66], the `lipreading 18' by Nichie [68], and Fisher's [61]. In total, eight vowel- and fteen consonant-maps are identi ed here and all of these are paired with each other to provide 120 P2V maps to test. Recent comparisons between maps include [17] and as part of [59]. In [59] the following list of reasons are given for discrepancies between classi er sets. Variation between speakers - i.e. speaker identity. Variation between viewers - indicating lipreading ability varies by individuals, those with more practice are better able to identify visemes. The context of the speech presented - context has an in uence on how consonants appear on the lips. In real tasks the context will enable easier distinction between indistinguishable phonemes in syllable only tests. 6 Table 2: Vowel phoneme-to-viseme maps previously presented in literature. Classi cation Viseme phoneme sets Bozkurt [69] f/ei/ /2/g f/ei/ /e/ //g f/3/g f/i/ /I/ /@/ /y/g f/AU/g f/O/ /A/ /OI/ /@U/g f/u/ /U/ /w/g Disney [66] f/U/ /h/g f/E@/ /i/ /ai/ /e/ /2/g f/u/g f/U@/ /O/ /O@/g Hazen [19] f/AU/ /U/ /u/ /@U/ /O/ /w/ /OI/g f/2/ /A/g f// /e/ /ai/ /ei/g f/@/ /I/ /i/g Jeers [70] f/A/ // /2/ /ai/ /e/ /ei/ /I/ /i/ /O/ /@/ /I/g f/OI/ /O/g f/AU/g f/3/ /@U/ /U/ /u/g Lee [71] f/i/ /I/g f/e/ /ei/ //g f/A/ /AU/ /ai/ /2/g f/O/ /OI/ /@U/g f/U/ /u/g Montgomery [67] f/i/ /I/g f/e/ // /ei/ /ai/g f/A/ /O/ /2/g f/U/ /3/ /@/gf/OI/g f/i/ /hh/g f/AU/ /@U/g f/u/ /u/g Neti [72] f/O/ /2/ /A/ /3/ /OI/ /AU/ /H/g f/u/ /U/ /@U/g f// /e/ /ei/ /ai/g f/I/ /i/ /@/g Nichie [68] f/uw/g f/U/ /@U/g f/AU/g f/i/ /2/ /ay/g f/2/g f/iy/ //g f/e/ /I@/g f/u/g f/@/ /ei/g Clustering criteria - the grouping methods vary between authors. For example, `phonemes are said to belong to a viseme if, when clustered, the percent correct identi cation for the viseme is above some threshold, which is typically between 70 - 75% correct. A stricter grouping criterion has a higher threshold, so more visemes are identi ed.'[59]. These last two points are reinforced by [17] who achieved highest accuracy with the phoneme-to-viseme map of Jeers in an HMM-based lipreading system. They attribute this to the use of continuous speech which encapsulates the same viseme in more con- texts within the training data, and suggest that the Jeers map has better clustering of consonant visemes for those contexts. In Table 4 we have described the sources and derivation methods for all of the phoneme-to-viseme maps used in our comparison study. We see the majority are con- structed using human testing with few test subjects, for example Finn [73] used only one lipreader, and Kricos [64] twelve. Data-driven methods are most recent, e.g. Lee's [71] visemes were presented in 2002 and Hazen's [19] in 2004. The remaining visemes are based around linguistic/phonemic rules. As an example, the clustering method of Hazen [19] involved bottom-up clustering us- ing maximum Bhattacharyya distances [76] to measure similarity between the phoneme- labeled Gaussian models. Before clustering, some phonemes were manually merged, =em= with =m=, =en= with =n=, and =Z= with =S=. 7 Table 3: Consonant phoneme-to-viseme maps previously presented in literature. Classi cation Viseme phoneme sets Binnie [60] f/p/ /b/ /m/g f/f/ /v/g f/T/ /D/g f/S/ /Z/g f/k/ /g/g f/w/g f/r/g f/l/ /n/g f/t/ /d/ /s/ /z/g Bozkurt [69] f/g/ /H/ /k/ /N/g f/l/ /d/ /n/ /t/g f/s/ /z/g f/tS/ /S/ /dZ/ /Z/g f/T/ /D/g f/r/g f/f/ /v/g f/p/ /b/ /m/g Disney [66] f/p/ /b/ /m/g f/w/g f/f/ /v/g f/T/g f/l/g f/d/ /t/ /z/ /s/ /r/ /n/g f/S/ /tS/ /j/g f/y/ /g/ /k/ /N/g Finn [73] f/p/ /b/ /m/g f/T/ /D/g f/w/ /s/g f/k/ /h/ /g/g f/S/ /Z/ /tS/ /j/g f/y/g f/z/g f/f/g f/v/g f/t/ /d/ /n/ /l/ /r/g Fisher [61] f/k/ /g/ /N/ /m/g f/p/ /b/g f/f/ /v/g f/S/ /Z/ /dZ/ /tS/g f/t/ /d/ /n/ /T/ /D/ /z/ /s/ /r/ /l/g Franks [62] f/p/ /b/ /m/g f/f/g f/r/ /w/g f/S/ /dZ/ /tS/g Hazen [19] f/l/g f/r/g f/y/g f/b/ /p/g fmg f/s/ /z/ /h/g f/tS/ /dZ/ /S/ /Z/g f/t/ /d/ /T/ /D/ /g/ /k/g f/N/g f/f/ /v/g Heider [74] f/p/ /b/ /m/g f/f/ /v/g f/k/ /g/g f/S/ /tS/ /dZ/g f/T/g f/n/ /t/ /d/g f/l/g f/r/g Jeers [70] f/f/ /v/g f/r/ /q/ /w/g f/p/ /b/ /m/g f/T/ /D/g f/tS/ /dZ/ /S/ /Z/g f/s/ /z/g f/d/ /l/ /n/ /t/g f/g/ /k/ /N/g Kricos [64] f/p/ /b/ /m/g f/f/ /v/g f/w/ /r/g f/t/ /d/ /s/ /z/g f/k/ /n/ /j/ /h/ /N/ /g/g f/l/g f/T/ /D/g f/S/ /Z/ /tS/ /dZ/g Lee [71] f/d/ /t/ /s/ /z/ /T/ /D/g f/g/ /k/ /n/ /N/ /l/ /y/ /H/g f/dZ/ /tS/ /S/ /Z/g f/r/ /w/g f/f/ /v/g f/p/ /b/ /m/g Neti [72] f/l/ /r/ /y/g f/s/ /z/g f/t/ /d/ /n/g f/S/ /Z/ /dZ/ /tS/g f/p/ /b/ /m/g f/N/ /k/ /g/ /w/g f/f/ /v/g f/T/ /D/g Nichie [68] f/p/ /b/ /m/g f/f/ /v/g f/W/ /w/g f/r/g f/s/ /z/g f/S/ /Z/ /tS/ /j/g f/T/g f/l/g f/k/ /g/ /N/g f/H/g f/t/ /d/ /n/g f/y/g Walden [63] f/p/ /b/ /m/g f/f/ /v/g f/T /D/g f/S/ /Z/g f/w/g f/s/ /z/g f/r/g f/l/g f/t/ /d/ /n/ /k/ /g/ /j/g Woodward [75] f/p/ /b/ /m/g f/f/ /v/g f/w /r/ /W/g f/t/ /d/ /n/ /l/ /T/ /D/ /s/ /z/ /tS/ /dZ/ /S/ /Z/ /j/ /k/ /g/ /h/g A P2V map may be summarized as a ratio we call \compression factor," CF NV CF = (1) NP which is the ratio of number output visemes, NV to input phonemes NP . The compres- sion factors for the P2V maps are listed in Table 5. Silence and garbage visemes are not included in Compression Factors. Because we have a British English dataset and some works were formulated using American English diacritics [77] we omit the following phonemes from some mappings: =si= (Disney [66]), =axr= =en= =el= =em= (Bozkirt [69]), =axr= =em= =epi= =tcl= =dcl= =en= =gcl= kcl=(Hazen [19]), and =axr= =em= =el= =nx= =en= =dx= =eng= =ux= (Jeers [70]). Moreover, Kricos provides speaker-dependent visemes [64]. These have been gen- eralized for our tests using the most common mixtures of phonemes. Where a viseme 8 Table 4: A comparison of literature phoneme-to-viseme maps. Author Year Inspiration Description Test subjects Binnie 1976 Human testing Confusion patterns unknown Bozkurt 2007 Subjective linguistics Common tri-phones 462 Disney | Speech synthesis Observations unknown Finn 1988 Human perception Montgomerys visemes 1 and /H/ Fisher 1986 Human testing Multiple-choice 18 intelligibility test Franks 1972 Human perception Confusions among sounds unknown produced in similar articulatory positions 275 Hazen 2004 Data-driven Bottom-up clustering 223 Heider 1940 Human perception Confusions post-training unknown Jeers 1971 Linguistics Sensory and cognitive unknown correlates Kricos 1982 Human testing Hierarchical clustering 12 Lee 2002 Data-driven Merging of Fisher visemes unknown Montgomery 1983 Human perception Confusion patterns 10 Neti 2000 Linguistics Decision tree clusters 26 Nichie 1912 Human observations Human observation of unknown lip movements Walden 1977 Human testing Hierarchical clustering 31 Woodward 1960 Linguistics Language rules unknown and context Table 5: Compression factors for viseme maps previously presented in literature. Consonant Map V:P CF Vowel Map V:P CF Woodward 4:24 0.16 Jeers 3:19 0.16 Disney 6:22 0.18 Neti 4:20 0.20 Fisher 5:21 0.23 Hazen 4:18 0.22 Lee 6:24 0.25 Disney 4:11 0.36 Franks 5:17 0.29 Lee 5:14 0.36 Kricos 8:24 0.33 Bozkurt 7:19 0.37 Jeers 8:23 0.35 Montgomery 8:19 0.42 Neti 8:23 0.35 Nichie 9:15 0.60 Bozkurt 8:22 0.36 - - - Finn 10:23 0.43 - - - Walden 9:20 0.45 - - - Binnie 9:19 0.47 - - - Hazen 10:21 0.48 - - - Heider 8:16 0.50 - - - Nichie 18:33 0.54 - - - map does not include phonemes present in the ground truth transcript these are grouped 9 into one viseme denoted (=gar=). Note that all phonemes in each P2V map are in the dataset but no mapping includes all 29 phonemes in the AVL2 vocabulary. 3.1. Data preparation The AVLetters2 (AVL2) dataset [78] is used to train and test HMM classi ers based upon our 120 P2V mappings with HTK [26]. AAM features (concatenated as in (4)) are used as they are known to outperform other feature methods in machine lipreading [17]. AVL2 [78] is an HD version of the AVLetters dataset [22]. It is a single word dataset of ve male British English speakers reciting the alphabet seven times. We use four of these speakers at the fth tracked too poorly to have con dence in lipreading accuracy. The speakers in this dataset are illustrated in [79]. AVL2 has 28 videos of between 1; 169 and 1; 499 frames between 47s and 58s in duration. As the dataset provides isolated words of single letters, it lends itself to controlled experiments without needing to address matters such as varying co-articulation. Table 6: The number of parameters in shape, appearance and combined shape & appearance AAM features for each speaker in the AVLetters2 dataset for each speaker. Features retain 95% variance of facial information. Speaker Shape Appearance Combined S1 11 27 38 S2 9 19 28 S3 9 17 25 S4 9 17 25 Table 6 describes the features extracted from the AVL2 videos. These features have been derived after tracking a full-face Active Appearance Model throughout the video before extracting features containing only the lip area. Therefore, they contain informa- tion representing only the speaker's lips and none of the rest of the face. Speakers 2, 3 and 4 are similar in number of parameters contained in the features. The combined features are the concatenation of the shape and appearance features [3]. All features retain 95% variance of facial shape and appearance information. The RMAV dataset consists of 20 British English speakers (we use 12 speakers,seven male and ve female, who have been tracked to maintain comparability with earlier work), 200 utterances per speaker of a subset of the Resource Management (RM) context independent sentences from [80] which totals around 1000 words each. The sentences are selected to maintain a good coverage all phonemes [81] and to represent the coverage of phonemes in spoken speech. The original videos were recorded in high de nition and in a full-frontal position. Individual speakers are tracked using Active Appearance Models [3] and AAM features of concatenated shape and appearance information have been extracted. Figure 1 plots the frequency of all phonemes within the RMAV dataset over 200 sentences and Table 7 lists the number of parameters of shape, appearance, and combined shape and appearance AAM features where the features retain 95% variance of facial information. 10 6906 649 668 495 487 489 Phonemes 1846 2000 641 638 Phonemes Figure 1: Occurrence frequency of phonemes in the RMAV dataset. 3.2. Classi cation method The method for these speaker-dependent classi cation tests on our combined shape and appearance features uses HMM classi ers built with HTK [26]. The features selected are from the AVL2 and RMAV datasets. The videos are tracked with a full-face AAM (Figure 2 (left)) and the features extracted consist of only the lip information (Figure 2 (right)). This means that we obtain a robust tracking from the full-face model, then using this t information, we apply a sub-active appearance model of only the lips. The Phoneme count /v/ /y/ /hh/ /iy/ /b/ /aa/ /jh/ /ia/ /f/ /ch/ /d/ /ae/ /g/ /eh/ /oh/ /ah/ /k/ /ea/ /m/ /ao/ /l/ /ih/ /n/ /ey/ /p/ /aw/ /s/ /ay/ /r/ /ax/ /t/ /er/ /w/ /az/ /uh/ /ng/ /ow/ /sh/ /z/ /th/ /ua/ /zh/ /uw/ /oy/ /dh/ Phoneme count Table 7: The number of parameters of shape, appearance, and combined shape and appearance AAM features for the RMAV dataset speakers. Features retain 95% variance of facial information. Speaker Shape Appearance Combined S1 13 46 59 S2 13 47 60 S3 13 43 56 S4 13 47 60 S5 13 45 58 S6 13 47 60 S7 13 37 50 S8 13 46 59 S9 13 45 58 S10 13 45 58 S11 14 72 86 S12 13 45 58 HMM classi ers are based upon viseme labels within each P2V map. A ground truth for measuring correct classi cation is a viseme transcription produced using the BEEP British English pronunciation dictionary [82] and a word transcription. The phonetic transcript is converted to a viseme transcript assuming the visemes in the mapping being tested (Tables 3 and 2). We test using a leave-one-out seven-fold cross validation. Seven folds are selected as we have seven utterances of the alphabet per speaker in AVL2, this is increased to 10-fold cross-validation for RMAV speakers. The HMMs are initialized using ` at start' training and re-estimated eight times and then force-aligned using HTK's HVite. Training is completed by re-estimating the HMMs three more times with the force-aligned transcript. 3.3. Active appearance models An example full-face shape model example is in Figure 2 where there are 76 land- marks, 34 of which are modeling the inner and outer lip contours. Figure 2: Example Active Appearance Model shape mesh (left), a lips only model is on the right. 12 The shape s of an AAM is the collection of coordinates of the v vertices (landmarks) which make up a mesh, s = (x ; y ; x ; y ; :::; x ; y ) (2) 1 1 2 2 v v These landmarks are aligned and normalized via Procrustes analysis [83] and then ana- lyzed via a Principal Component Analysis (PCA) to s = s + p s (3) 0 i i i=1 where s is the mean shape, p are coecient shape parameters, and s are the 0 i i eigenvectors of the co-variance matrix of the n largest eigenvalues [3]. Having built an Active Shape Model, the next step is to augment it with appearance data and hence compute an Active Appearance Model (AAM). Each shape model is used to warp the image data back to the mean shape. The appearance of those warped images is now modeled again using PCA [4], A(x) = A (x) + A (x) 8x 2 s (4) 0 i i 0 i=1 where are the appearance parameters, A is the shape-free-mean appearance, and i 0 A (x) are the appearance image eigenvectors of the co-variance matrix. Usually the best results are obtained using both shape and appearance information combined within a single AAM [4, 25]. Therefore, unless explicitly stated otherwise, we use these. Once an AAM is built and trained, we t the model using the Inverse Compositional algorithm [84] to all frames in the video sequence [3]. 3.4. Comparison of current phoneme-to-viseme maps Recognition performance of the HMMs can be measured by both correctness, C , and accuracy, A, N D S N D S I C = (5) A = (6) N N where S is the number of substitution errors, D is the number of deletion errors, I is the number of insertion errors and N the total number of labels in the reference transcriptions [26]. An insertion error (which are notoriously common in lip reading [85]) occurs when the recognizer output has extra words/visemes missing from the original transcript [26]. As an example one could say \Once upon a midnight dreary", but the recognizer outputs \Once upon upon midnight dreary dreary". Here the recognizer has inserted two words which were never present and has deleted one . Once this utterance has been translated to one of viseme labels rather than words, as an example using Montgomery's visemes, this sentence becomes \v09 v12 v04 v05 - v12 v01 v12 v04 - v12 - v01 v10 v04 v11 v04 - v04 v07 v16 v07 v16" (hyphens are included to show breaks between words). In this case, the same insertion errors would create predicted outputs of \v09 v12 v04 v05 - v12 v01 v12 v04 - v12 v01 v12 v04 - v01 v10 v04 v11 v04 - v04 v07 v16 v07 v16 - v04 v07 v16 v07 v16." 13 In this experiment, classi cation performance of the HMMs is measured by correct- ness, C (5), as there are no insertion errors to consider [26]. It is acknowledged that word classi cation is not as high performing as viseme classi cation. However, as each viseme set being tested has a dierent number of phonemes and visemes, words, are used so we can compare dierent viseme sets. It is the dierence between each set, rather than the individual performance, which is of interest in this investigation. Figure 3 shows the correctness of each pair of viseme sets. On the top is the isolated word case (the AVL2 data) and on the bottom the continuous data (RMAV). Each diagram is ordered by the mean correctness over all speakers. For the isolated words the Lee vowel and consonant sets [71] are the best with the Montgomery vowels [67] and Hazen consonants [19] close behind. The worst performers are Disney vowels [66] and the Franks [62] and Woodward consonants [75]. For continuous speech the Disney vowels are the best performer [66] as are the Woodward consonants [75]. It is notable that for continuous speech the high compression factor visemes sets work better than those with larger numbers of visemes. The most likely explanation is that continuous speech has additional variability due to co-articulation so a few coarsely de ned visemes are better than a greater number of nely de ned ones. Figure 4 shows the mean word correctness, C , over all speakers, 1s:e for pairings of vowel and consonant maps ordered by correctness from left to right. Again, isolated word results (the AVL2 data) at the top and continuous (RMAV) on the bottom. As previously, for isolated words, the Disney vowels are signi cantly worse than all others when paired with all consonant dierence over the whole group. The Lee [71], Montgomery [67] and Bozkurt [69] vowels are consistently above the mean and above the upper error bar for Disney [66], Jeers [70] and Hazen [19] vowels. In comparing the consonants, Lee [71] and Hazen [19] are the best whereas Woodward [75] and Franks [62] are the bottom performers. There is a signi cant dierence between the `best' visemes for individual speakers which arises from the unique way in which everyone articulates their speech. The continuous speech experiment results in Figure 4 (bottom) show that, for vowel visemes, the Disney set surpasses all others, whereas Woodward's consonants are now a better t. This is interesting as neither viseme set are data-derived. We recall that Disney's [66] are designed from human perception for synthesis of characters, and Wood- ward's [75] are from a pilot investigation into phoneme perception in lipreading using linguistic rules. As we move to more realistic data , continuous speech, many of the data- driven approaches degrade which implies that they data used to derive these visemes was unrealistic. For example the Lee visemes [71] were derived without any use of video data at all so it is hardly surprising that they are fragile when presented with more realistic data. The idea that vowel and consonant visemes should be treated dierently is no surprise. The suggestion that vowel visemes are essentially mouth shapes and the consonants govern how we move in and out of them was rst presented by Nichie in 1912 from human observations by a profoundly deaf educator [68] and is supported by results in [86] which show we should not mix vowel and consonant visemes for best results. Therefore, it is reassuring to see that the better speaker-independent phoneme-to-viseme mapping for continuous speech is a combination of two previous maps, where the two maps have diering derivation methods; perception and language rules. Generally speaking the continuous case (bottom of Figure 4) gives improved accura- cies compared to the isolated word case (top of Figure 4. The rst response to explain 14 this is to suggest the increase is caused by better training of classi ers with the greater volume of training samples in RMAV than in AVL2. However, we should note that this eect is marginally countered by the co-articulation eects in continuous speech, so a set of classi ers trained on a larger isolated word dataset and compared to AVL2 would provide a greater increase in recognition. Figure 5 are critical dierence plots between the viseme class sets based upon their classi cation performance [87] with isolated word training. Critical dierence is a measure of the con dence intervals between dierent machine learning algorithms derived from Friedman tests on the ranked scores (here p = 0:05). Two assumptions within critical dierence are: all measured results are `reliable', and all algorithms are evaluated using the same random samples [87]. As we use the HTK standard metrics [88], and use results with consistent random sampling across folds, these assumptions are not a concern. We have selected critical dierences here as these evaluate the performance of multiple classi ers on dierent datasets, whereas such as [89, 90], often require paired data or identical datasets. Figure 5 shows a signi cant dierence between some sub-sets of visemes. This is shown by the horizontal bars which do not overlap all viseme sets. Where the horizontal bars do overlap, this shows the viseme sets are indistinguishable at a 95% con dence. When comparing isolated words with continuous speech we see fewer signi cant dierences with continuous speech despite there being more test data. Table 8 summarises the best-performing visemes (consonant and vowels) for the iso- lated and continuous word data. The rst column shows that the Lee consonants are the best performing for isolated words. But also that Hazan, Nichie, Neti etc are indis- tinguishable from Lee (they within Lee's critical dierence). For continuous speech, the Woodward consonant visemes are the best but Fisher, Franks Disney etc are indistinguis- able. In bold are the viseme sets that are common to both isolated words and continuous speech: Lee, Hazen, Finn and Fisher. For the vowels (second column) there are no com- mon sets. However if we look at best and second-best (the third column of Table 8) then Hazen and Neti emerge as common. Looking across all sets the common method that performs near the top is that due to Hazen [19]. Interestingly these visemes were derived using the most realistic data (an audio-visual corpus based on TIMIT) and formed by a tree-based clustering of phoneme-trained HMMs. Note that the Hazan visemes were derived from American English data whereas here we use British English speakers. The eectiveness of each mapping as a function of compression factor is presented in Figure 6. The two plots representing continuous speech (bottom of Figure 6) show improving performance with decreasing compression factor { we speculated earlier that the coarser visemes were better able to handle co-articulation. For the isolated word case (top) there is little dierence. Very roughly, the best performing methods appear to have around 2 to 4 phonemes per viseme. So far we have seen that there are noticeable dierences between classi cation per- formances associated with a variety of viseme sets in the literature. Given that quite a few of the viseme sets are incremental improvements on previous sets, it is good to see con rmation that these sets are have rather similar performance. We have identi ed the best sets for the various conditions and have used critical dierence plots to explain the similarity between methods. We have identi ed that the most robust methods seem to be based on clustering large amounts of data but a questions arises when it comes to individual speakers { is it viable to create viseme sets per speaker and, if so, how similar 15 Table 8: Critically dierent viseme sets changes with isolated word and continuous speech data. Sets are listed in the order they appear in Figure 5. First Position Consonants First Position Vowels Second Position Vowels Lee Lee Montgomery Hazen Montgomery Nichie Nichie Nichie Bozkurt Neti Bozkurt Hazen Walden Neti Jeers Kricos Binnie Finn Bozkurt Fisher Woodward Disney Jeers Fisher Jeers Hazen Franks Hazen Neti Disney Lee Heider Hazen Finn are they? This is the topic of the next section. 16 Lee Mon Boz Nic Net Haz Jef Dis Lee Haz Nic Net Kri Fin Jef Wal Bin Dis Boz Hei Fis Fra Woo Consonant visemes Dis Jef Haz Net Lee Boz Nic Mon Woo Fis Fra Dis Lee Hei Haz Fin Boz Bin Jef Kri Net Wal Nic Consonant visemes Figure 3: Speaker-dependent all-speaker mean word classi cation, C, comparing viseme classes on iso- lated word speech (top) and continuous speech (bottom) Vowel visemes Vowel visemes 60 Lee Mon Boz Nic Net Haz Jef Dis 40 Lee Haz Nic 35 c Net Kri Fin Jef Wal Bin Dis Boz 15 Hei Fis Fra 10 c Lee Haz Nic Net Kri Fin Jef Wal Bin Dis Boz Hei Fis Fra Woo Lee Mon Boz Nic Net Haz Jef Dis Woo c c c c c c c c c c c c c c c v v v v v v v v Visemes Boz Net Lee Jef Haz Dis 45 Mon Nic Woo Fis Fra Dis Lee Hei Haz Fin Boz Bin Jef Kri Net Wal 10 c Nic WooFis Fra Dis Lee HeiHaz Fin Boz Bin Jef Kri Net Wal Nic Dis Jef HazNet LeeBoz NicMon c c c c c c c c c c c c c c c v v v v v v v v Visemes Figure 4: Speaker-independent all-speaker mean word classi cation, C 1s:e. For a given mapping (xaxis) the performance is measured after pairing with all vowel mappings (left) and vice versa on the right on AVL2 isolated words (top) and RMAV continuous (bottom) All speaker mean word correctness, C% All speaker mean word correctness, C% Consonants Vowels CD 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 CD 8 7 6 5 4 3 2 1 13 2.25 Woodward Lee 13 2.5 Franks Hazen 11.0625 3.875 Disney Nichie 10.6875 4.9375 Heider Neti 9.8125 7.25 8 1.9 Fisher Walden Disney Lee 9 7.75 6.5667 2.8333 Bozkirt Jeffers Jeffers Montgomery 8.5625 7.875 5.2333 3.1667 Finn Kricos Hazen Nichie 8.4375 4.9667 3.3333 Binnie Neti Bozkurt CD 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 CD 8 7 6 5 4 3 2 1 15 1.375 Nichie Woodward 13.5 1.875 Neti Fisher 13 3.125 Walden Franks 11.5 4.625 Kricos Disney 10.375 5.625 7.9 1 Jeffers Lee Montgomery Disney 9.5 6.125 6.9667 2.0667 Binnie Heider Nichie Jeffers 9.125 6.75 5.3333 3.2 Bozkirt Hazen Bozkurt Hazen 8.5 5.2 4.3333 Finn Lee Neti Figure 5: Critical dierence of all phoneme-to-viseme maps independent of phoneme-to-viseme pair partner. Vowel maps are on the left side, consonants on the right. Isolated words are in the top row, and continuous speech along the bottom row. Continuous speech Isolated words Lee Haz Nic 40 40 Net 38 38 Kri Fin 36 36 Jef Wal 34 34 Bin Dis 32 32 Boz Hei 30 30 Fis 28 28 Fra Woo 26 26 Lee Mon 24 24 Boz Nic 22 22 Net Haz 20 20 0 0.5 1|0 0.5 1 Jef Viseme set compression factors v Dis Woo Fis Fra Dis 38 Lee Hei Haz Fin Boz Bin 32 32 Jef Kri 30 30 Net 28 28 Wal Nic 26 26 Dis Jef 24 24 Haz Net 22 22 Lee Boz 20 20 0 0.5 1|0 0.5 1 Nic Viseme set compression factors Mon Figure 6: Scatter plot showing the relationship between compression factors, CF (x-axes), and word correctness, C, classi cation (y-axes) with consonant phoneme-to-viseme maps (left) and vowel phoneme- to-viseme maps (right), isolated word results are at the top, and continuous speech along the bottom. All speaker mean word correctness, C % All speaker mean word correctness, C % 4. Encoding speaker-dependent visemes In the second part of our phoneme-to-viseme mapping study, two approaches are used to nd a better method of mapping phonemes to visemes. These approaches are both speaker-dependent and data-driven from phoneme classi cation. Two cases are considered: 1. a strictly coupled map, where a phoneme can be grouped into a viseme only if it has been confused with all the phonemes within the viseme, and 2. a relaxed coupled case, where phonemes can be grouped into a viseme if it has been confused with any phoneme within the viseme. With all new P2V mappings each phoneme can be allocated to only one viseme class. These new P2V maps are tested on the AVL2 dataset using the same classi cation method as described in Section 3.2. The results from the best performing P2V map from our comparison study (Lee [71] or Woodward [75] and Disney [66]) is the benchmark to measure improvements with respect to the training data. 4.1. Viseme classes with strictly confusable phonemes Our approaches for identifying visemes are speaker-dependent, data-driven and based on phoneme confusions within the classi er. The idea of speaker-dependent visemes is not new [31, 34] but our algorithm is, and in conjunction with the xed outputs available from HTK enables easy reuse. The rst undertaking in this work is to complete classi cation using phoneme labeled HHM classi ers. The classi ers are built in HTK with at-start HMMs and force-aligned training-data for each speaker. The HMMs are re-estimated 11 times in total over seven folds of leave-one-out cross validation. This overall classi cation task does not perform well (see Table 9) particularly for an isolated word dataset. However, the HTK tool HResults is used to output a confusion matrix for each fold detailing which phoneme labels confuse with others and how often. For both data-driven speaker-dependent approaches, this is the rst step of completing Table 9: Mean per speaker Correctness, C, of phoneme-labeled HMM classi ers. Speaker 1 Speaker 2 Speaker 3 Speaker 4 Phoneme C 24:72 23:63 57:69 43:41 phoneme classi cation is essential to create the data to derive the P2V maps from. This is completed for each speaker in both AVL2 and RMAV datasets. Now, let us use a smaller seven-unit confusion matrix example, as in Table 10, to explain our clustering method. For the `strictly-confused' viseme set (remembering there is one per speaker), the sec- ond step of deriving the P2V map is to check for single-phoneme visemes. Any phonemes which have only been correctly recognized and have no false positive/negative classi ca- tions are permitted to be single phoneme visemes. In Table 10 we have highlighted the true positive classi cations in red and both false positives and false negative classi ca- tions in blue which shows =p6= is the only phoneme to t our `single-phoneme viseme' de nition. =p6= has a true positive value of +4 and zero false classi cations. There- fore this is our rst viseme. =v1= = f=p6=g. This action is followed by de ning all 21 Table 10: Demonstration confusion matrix showing confusions between phoneme-labeled classi ers to be used for clustering to create new speaker-dependent visemes. True positive classi cations are shown in red, confusions of either false positives and false negatives are shown in blue. The estimated classes are listed horizontally and the real classes are vertical. =p1= =p2= =p3= =p4= =p5= =p6= =p7= =p1= 1 0 0 0 0 0 4 =p2= 0 0 0 2 0 0 0 =p3= 1 0 0 0 0 0 1 =p4= 0 2 1 0 2 0 0 =p5= 3 0 1 1 1 0 0 =p6= 0 0 0 0 0 4 0 =p7= 1 0 3 0 0 0 1 combinations of remaining phonemes which can be grouped into visemes and identifying the grouping that contains the largest number of confusions by ordering all the viseme possibilities by descending size (Table 11). Table 11: List of all possible subgroups of phonemes with an example set of seven phonemes f=p1=; =p2=; =p3=; =p4=; =p5=; =p7=g f=p1=; =p2=; =p4=g f=p1=; =p4=; =p7=g f=p1=; =p2=; =p3=; =p4=; =p5=g f=p1=; =p2=; =p5=g f=p2=; =p4=; =p7=g f=p1=; =p2=; =p3=; =p4=; =p7=g f=p1=; =p2=; =p7=g f=p1=; =p3=g f=p1=; =p2=; =p3=; =p5=; =p7=g f=p2=; =p3=; =p4=g f=p1=; =p4=g f=p1=; =p2=; =p4=; =p5=; =p7=g f=p2=; =p3=; =p5=g f=p1=; =p5=g f=p1=; =p3=; =p4=; =p5=; =p7=g f=p2=; =p3=; =p7=g f=p1=; =p7=g f=p2=; =p3=; =p4=; =p5=; =p7=g f=p3=; =p4=; =p5=g f=p2=; =p3=g f=p1=; =p2=; =p3=; =p4=g f=p3=; =p4=; =p7=g f=p2=; =p4=g f=p1=; =p2=; =p3=; =p5=g f=p1=; =p3=; =p4=g f=p2=; =p5=g f=p1=; =p2=; =p3=; =p7=g f=p4=; =p5=; =p7=g f=p2=; =p7=g f=p2=; =p3=; =p4=; =p5=g f=p1=; =p4=; =p5=g f=p3=; =p4=g f=p2=; =p3=; =p4=; =p7=g f=p2=; =p4=; =p5=g f=p3=; =p5=g f=p3=; =p4=; =p5=; =p7=g f=p1=; =p5=; =p7=g f=p4=; =p5=g f=p1=; =p3=; =p4=; =p5=g f=p2=; =p5=; =p7=g f=p4=; =p7=g f=p1=; =p4=; =p5=; =p7=g f=p3=; =p5=; =p7=g f=p5=; =p7=g f=p2=; =p4=; =p5=; =p7=g f=p1=; =p3=; =p5=g f=p1=; =p2=; =p3=g f=p1=; =p3=; =p7=g Our grouping rule states that phonemes can be grouped into a viseme class only if all of the phonemes within the candidate group are mutually confusable. This means each pair of phonemes within a viseme must have a total false positive and false negative classi cation greater than zero. Once a phoneme has been assigned to a viseme class it can no longer be considered for grouping, and so any possible phoneme combinations that include this viseme are discarded. This ensures phonemes can belong to only a single viseme. By iterating though our list of all possibilities in order, we check if all the phonemes are mutually confused. This means all phonemes have a positive confusion value (a blue 22 value in Table 10) with all others. The rst phoneme possibility in our list where this is true is f=p1=; =p3=; =p7=g. This is con rmed by the Table 10 values: Nf=p1=j=p3=g + Nf=p3=j=p1=g = 0 + 1 = 1 > 0 also, Nf=p1=j=p7=g + Nf=p7=j=p1=g = 4 + 1 = 5 > 0 and, Nf=p3=j=p7=g + Nf=p7=j=p3=g = 1 + 3 = 4 > 0: This becomes our second viseme and thus our current viseme list looks like Table 12. Table 12: Demonstration example 1: rst-iteration of clustering, a phoneme-to-viseme map for strictly- confused phonemes. Viseme Phonemes =v1= f=p6=g =v2= f=p1=; =p3=; =p7=g We now only have three remaining phonemes to cluster, =p2=; =p4= and =p5=. This reduces our list of possible combinations substantially, see Table 13. Table 13: List of all possible subgroups of phonemes with an example set of seven phonemes after the rst viseme is formed. f=p2=; =p4=; =p5=g f=p2=; =p4=g f=p2=; =p5=g f=p4=; =p5=g The next iteration of our clustering algorithm identi es the combination of remaining phonemes which correspond to the next largest number of confusions, and so on, until no phonemes can be merged. This leaves us with the nal visemes in Table 14. Table 14: Demonstration example 2: nal phoneme-to-viseme map for strictly-confused phonemes. Viseme Phonemes =v1= f=p6=g =v2= f=p1=; =p3=; =p7=g =v3= f=p2=; =p4=g =v4= f=p5=g Our original phoneme classi cation has produced confusion matrices which permit confusions between vowel and consonant phonemes. We can see in Section 3.1 (Tables 2 and 3), previously presented P2V maps that vowel and consonant phonemes are not commonly mixed within visemes. Therefore, we make two types of P2V maps: one 23 which permits vowels and consonant phonemes to be mixed within the same viseme, and a second which restricts visemes to be vowel or consonant only by putting an extra condition in when checking for confusions greater than zero. It should be remembered that not all phonemes present in the ground truth transcripts will have been recognized and included in the phoneme confusion matrix. Any of the remaining phonemes which have not been assigned to a viseme are grouped into a single garbage =gar= viseme. This approach ensures any phonemes which have been confused are grouped into a viseme and we do not lose any of the `rarer', and less common visual phonemes. For example, =ea=, =oh=, =ao=, and =r= are not in the original transcript and so can be placed into =gar=. But for Speaker 2, =gar= also contains =ay= and =p=, and for Speaker 4 =gar= also contains =p= and =z=, as these do not show up in the speaker's phoneme classi cation outputs. This task has been undertaken for all four speakers in our dataset. The nal P2V maps are shown in Table 15. Table 15: Strictly-confused phoneme speaker-dependent visemes. The score in brackets is the compres- sion factor.B1 is listed on top, B2 visemes are listed at the bottom. Classi cation P2V mapping - permitting mixing of vowels and consonants Speaker1 f/2/ /ai/ /i/ /n/ /@U/g f/b/ /e/ /ei/ /y/ g f/d/ /s/g f/tS/ /l/g f/@/ /v/g (CF:0.48) f/w/g f/f/g f/k/g f/@/ /v/g f/dZ/ /z/g f/A/ /u/g f/t/g Speaker2 f/@/ /ai/ /ei/ /i/ /s/g f/e/ /v/ /w/ /y/g f/l/ /m/ /n/g f/b/ /d/ /p/g (CF: 0.44) f/z/g ftS/g f/t/g f/A/g f/dZ/ /k/g f/2/ /f/g f/@U/ /u/g Speaker3 f/ei/ /f/ /n/g f/d/ /t/ /p/g f/b/ /s/g f/l/ /m/g f/@/ /e/g f/i/g f/u/g (CF: 0.68) f/A/g f/dZ/g f/@U/g f/z/g f/y/g f/tSg/ f/ai/g f/2/g f/A/g f/dZ/g f/@U/g f/k/ /w/g f/v/g f/z/g Speaker4 f/2/ /ai/ /i/ /ei/ g f/m/ /n/g f/@/ /e/ /p/g f/k/ /w/g f/d/ /s/g f/dZ/ /t/g (CF: 0.64) f/f/g f/v/g f/A/g f/z/g f/tS/g f/b/g f/@U/g f/@U/g f/l/g f/u/g f/b/g Classi cation P2V mapping - restricting mixing of vowels and consonants Speaker1 f/2/ /i/ /@U/ /u/g f/A/ /ei/g f/@/ /e/ /ei/g f/d/ /s/ /t/ g f/tS/ /l/ g f/k/g (CF:0.50) f/z/g f/w/g f/f/g f/m/ /n/g f/dZ/ /v/g f/b/ /y/g Speaker2 f/ai/ /ei/ /i/ /u/g f/@U/g f/@/g f/e/g f/2/g f/A/g f/v/ /w/g f/dZ/ /p/ /y/g (CF: 0.58) f/d/ /b/g f/t/g f/k/g f/tS/g f/l/ /m/ /n/g f/f/ /s/g Speaker3 f/ei/ /i/g f/ai/g f/@/ /e/g f/2/g f/d/ /p/ /t/g f/l/ /m/g f/k/ /w/g f/v/g (CF: 0.68) f/tS/g f/@U/g f/y/g f/u/g f/A/g f/z/g f/f/ /n/g f/b/ /s/g f/dZ/g Speaker4 f/2/ /ai/ /i/ /ei/g f/@/ /e/g f/m/ /n/g f/k/ /l/g f/dZ/ /t/g f/d/ /s/g f/tS/g (CF: 0.65) f/@U/g f/y/g f/u/g f/A/g f/w/g f/f/g f/v/g f/b/g 4.2. Viseme classes with relaxed confusions between phonemes A disadvantage of the strictly confusable viseme set is that it contains some spurious single-phoneme visemes where the phoneme cannot be grouped because it is not con- fused with all other phonemes in the viseme. These types of phonemes are likely to be either: borderline cases at the extremes of a viseme cluster, i.e. they have subtle visual similarities to more than one phoneme cluster, or they do not occur frequently enough in the training data to be dierentiated from other phonemes. To address this we complete a second pass-through of the strictly-confused visemes listed in Table 14. We begin with the visemes as they currently stand (in our demonstra- 24 Table 16: Demonstration example 3: nal phoneme-to-viseme map for relaxed-confused phonemes. Viseme Phonemes =v1= f=p6=g =v2= f=p1=; =p3=; =p5=; =p7=g =v3= f=p2=; =p4=g tion example containing four classes) and relax the condition requiring confusion with all of the phonemes. Now any single phoneme viseme (in our demonstration, =v4=) can be allocated to a previously existing viseme if it has been confused with any phoneme in the viseme. In Table 10 we see =p5= was confused with =p1=, =p3=, and =p4=. Because =p4= is not in the same viseme as =p1= and =p3= we use the value of confusion to decide which to allocate it to as follows. Nf=p1=j=p5=g + Nf=p5=j=p1=g = 0 + 3 = 3 Nf=p3=j=p5=g + Nf=p5=j=p3=g = 0 + 1 = 1 Nf=p4=j=p5=g + Nf=p5=j=p4=g = 2 + 1 = 3 Therefore; for p5 the total confusion with =v2= is 3 + 1 = 4, whereas the total confusion with =v3= is 3. We select the viseme with most confusion to incorporate the unallocated phoneme =p5=. This reduces the number of viseme classes by merging single-phoneme visemes from Table 14 to form a second set shown in Table 16. This has the added bene t that we have also increased the number of training samples for each classi er. Table 17: The four variations on speaker-dependent phoneme-to-viseme maps derived from phoneme confusion in phoneme classi cation. Bear1, B1: Bear2, B2: Mixed vowels and consonants Split vowels and consonants + + Strict-confusion of phonemes Strict-confusion of phonemes Bear3, B3: Bear4, B4 Mixed vowels and consonants Split vowels and consonants + + Relaxed-confusion of phonemes Relaxed-confusion of phonemes Remember, as we have two versions of Table 14 - one with mixed vowel and consonant phonemes and a second with divided vowels and consonant phonemes - the same still applies to our relaxed-confused visemes sets. This means we end up with four types of speaker-dependent phoneme-to-viseme maps, described in Table 17. For our strictly- confused P2V maps in Table 15, these become the relaxed P2V maps in Table 18. In Table 17 we have labeled each of the four variations B1, B2, B3 and B4 for ease of reference. Now, and this is why these visemes are de ned as relaxed, any remaining phonemes which have confusions, but are so far not assigned to a viseme, the phoneme-pair confu- sions are used to map the remaining phonemes to an appropriate viseme, even though 25 Table 18: Relaxed-confused phoneme speaker-dependent visemes. The score in brackets is the ratio of visemes to phonemes. B3 visemes are on top, and B4 listed below. Classi cation P2V mapping - permitting mixing of vowels and consonants Speaker1 f/b/ /e/ /ei/ /p/ /w/ /y/ /k/g f/2/ /ai/ /f/ /i/ /m/ /n/ /@U/g (CF:0.28) f/dZ/ /z/g f/A/ /u/g f/d/ /s/ /t/g f/tS/ /l/g f/@/ /v/gf/@/ /v/g Speaker2 f/A/ /@/ /ai/ /ei/ /i/ /s/ /tS/g f/e/ /t/ /v/ /w/ /y/g f/l/ /m/ /n/g (CF: 0.32) f/2/ /f/g f/z/g f/b/ /d/ /p/g f/@U/ /u/g f/dZ/ /k/g Speaker3 f/2/ /ai/ /ei/ /f/ /i/ /n/g f/@/ /e/ /y/ /tS/g f/b/ /s/ /v/g f/l/ /m/ /u/g (CF: 0.40) f/dZ/g f/@U/g f/z/g f/d/ /p/ /t/g f/k/ /w/g f/A/g Speaker4 f/2/ /ai/ /tS/ /i/ /ei/ g f/A/ /m/ /u/ /n/g f/@/ /e/ /p/ /v/ /y/g (CF: 0.32) f/dZ/ /t/g f/k/ /l/ /w/g f/@U/g f/d/ /f/ /s/g f/b/g Classi cation P2V mapping - restricting mixing of vowels and consonants Speaker1 f/2/ /i/ /@U/ /u/g f/A/ /ai/g f/@/ /e/ /ei/g f/b/ /w/ /y/g f/d/ /f/ /s/ /t/g (CF:0.47) f/k/g f/z/g f/m/g f/l/g f/tS/g f/dZ/ /k/ /v/ /z/g Speaker2 f/A/ /2/ /@/ /ai/ /ei/ /i/ /@U/ /u/g f/k/ /t/ /v/ /w/g f/tS/ /l/ /m/ /n/g (CF: 0.29) f/f/ /s/g f/dZ/ /p/ /y/g f/b/ /d/g f/z/g Speaker3 f/2/ /ai/ /i/ /ei/g f/@/ /e/g f/b/ /s/ /v/g f/d/ /p/ /t/g f/l/ /m/g (CF: 0.56) f/y/g f/dZ/g f/@U/g f/z/g f/u/g f/@/ /e/g f/k/ /w/g f/f/ /n/g f/A/g f/tS/g Speaker4 f/2/ /ai/ /i/ /ei/g f/tS/ /k/ /l/ /w/g f/d/ /f/ /s/ /v/g f/m/ /n/g (CF: 0.50) f/f/g f/A/g f/dZ/ /t/g f/@U/g f/u/g f/y/g f/b/g it does not confuse with all phonemes already in it. Any remaining phonemes which are not assigned to a viseme are grouped into a new garbage =gar= viseme. This ap- proach ensures any phonemes which have been confused with any other are grouped into a viseme. 26 4.3. Results analysis Figure 7 (top) compares the new speaker-dependent viseme method with the Lee visemes which are the benchmark from the isolated word study. For Speaker 1 and Speaker 3, no new viseme map signi cantly improves upon Lee's performance although we do see improvements for both Speaker 2 and Speaker 4. The strictly-confused and split viseme map improves upon Lee's previous best word classi cation. The second set of our experiments with continuous speech training data (RMAV) is to repeat our investigation with speaker-dependent visemes. These have been derived with the same methods described in Section 4.1 & 4.2 and are listed in full for each speaker in Appendix A. Our classi cation method is identical to that used previously with HMMs. In the previous work of [86], we see limited improvement in word classi cation with viseme classes due to the size of the dataset. In Figure 7 (bottom) we have plotted the word correctness achieved for each RMAV speaker using all four variants of the speaker-dependent visemes. Our rst observation is that on this gure, the correctness scores achieved range from 26:67% to 41:53%, whereas in Figure 7 (top) the values range from 20:60% to 36:53%. As before, this overall increase is attributed to the larger volume of training samples in RMAV compared to AVLetters2. Compared to the benchmark of the Disney vowels and Montgomery consonant visemes which has been plotted in black on Figure 7 (bottom) we see that the comparison between speaker-dependent visemes and the best speaker-independent visemes is subject to the speaker. For three out of 12 speakers (sp01, sp03, sp05), the speaker-dependent visemes are all worse than our benchmark. For another three of our 12 speakers (sp02, sp09, sp14) all of the speaker-dependent visemes out-perform the benchmark. For all six remaining speakers, the results are mixed. This suggests that it is possible that speaker-dependent visemes could improve on speaker-independent ones, but that it is essential that they are exactly right for the individual otherwise they become at worse, detrimental, or a lot of eort for no signi cant improvement. Careful observation of Figure 7 (top) shows that when considering the performance of mixed or split visemes, split visemes sign cantly (> 1se) outperform mixed. When considering relaxed versus split the split has a marginal advantage but it is not signi cant (<1se). The comparison of strict and split visemes for continuous speech (Figure 7 (bottom) is consistent with the isolated word observations. The strictly-confused visemes perform better than those with a relaxed confusion, but not statistically signi cantly (<1se). Again, we see that mixing vowel and consonants phonemes within individual viseme classes reduces the classi cation performance but not signi cantly. In Figure 8 we have plotted accuracy, A, and correctness, C , for our best performing speaker-dependent visemes (B1) on continuous speech. We also plot, the accuracy scores of our benchmark from Woodward and Disney's visemes. These are compared with the correctness scores as a baseline to show the improvement. Whilst the improvement of speaker-dependent visemes is not signi cant when measured by Correctness, by plotting the accuracy of the viseme classi ers we can see that they do have a positive in uence in reducing insertion errors which are a bugbear of lipreading. B1 B2 B3 B4 Lee Speaker 1 Speaker 2 Speaker 3 Speaker 4 Test Speaker B1 B2 B3 B4 Benchmark sp01sp02sp03sp05sp06sp08sp09sp10sp11sp13sp14sp15 Speaker Figure 7: Word classi cation correctness C 1se, using all four new methods of deriving speaker depen- dent visemes. AVL2 (top) and RMAV (bottom) speakers against Lee (top) and Woodward and Disney (bottom) benchmarks in black. HTK Correctness C % HTK Correctness % SD C C% T S SD C A% T S Woodward and Disney Benchmark A% C C 45 C C C A B 35 A A B A A B B A A B B sp01 sp02 sp03 sp05 sp06 sp08 sp09 sp10 sp11 sp13 sp14 sp15 Speaker Figure 8: Comparing the accuracy change between strict and relaxed visemes to show the improvement in accuracy/reduction in insertion errors for all 12 speakers in continuous speech. The baseline is the correctness classi cation which ignores insertion error penalties. HTK Accuracy % 5. Performance of individual visemes In Figures 9 and 10, the contribution of each viseme has been listed in descending order along the xaxis for each speaker in AVL2. The contribution of each viseme is measured as the probability of each class, Prfvjv ^g. These values have been calculated from the HResults confusion matrices. This analysis of visemes within a set is also used in [91], which proposes a threshold subject to the information in the features. The same viseme comparison analysis has been repeated for our continuous speech recognition experiments and the results are shown in Figures 11 and 12. In the isolated word data (Figures 9 and 10) the dierence between a high-performing speaker map and a poor one is striking. Speaker 3 for example has at least ve visemes in which Prfvjv ^g = 1 (more in some con gurations) whereas Speaker 1 has only one good viseme. Referring to Tables 15 and 18 there is no consistency on the best viseme although generally visual silence appears to be easy to spot. This variation is to be expected { speaker variablity is a very serious problem in lipreading. Figures 11 and 12 show the same thing for the continuous speech data. Now there is a shallower drop-o to the curve and there are certainly no visemes for which Prfvjv ^g = 1. Although there appears to be less variablity among speakers this is an illusion caused by the poorly-performing visemes to be similar among speakers { within the top ve visemes there are signi cant dierences among speakers. Speakers: Speaker 1 Speaker 2 Speaker 3 Speaker 4 Sp1 v07 v02 v01 v04 v06 v03 Sp2 sil v10 v01 v02 v06 v08 v07 v05 v03 Sp3 v07 v08 v10 v12 sil v04 v01 v03 v06 v05 v11 v02 v09 Sp4 sil v01 v04 v12 v03 v06 v05 v02 v17 Ordered viseme classes Speakers: Speaker 1 Speaker 2 Speaker 3 Speaker 4 Sp1 sil v05 v01 v03 v04 v02 v06 gar Sp2 sil v09 v14 v01 v06 v07 v04 v02 v11 v08 Sp3 v04 v11 v12 v13 v14 v15 sil v01 v02 v05 v07 v03 v09 v08 v06 Sp4 v10 v11 v12 sil v06 v01 v05 v13 v04 v02 v14 v03 Ordered viseme classes Figure 9: Individual viseme classi cation, Prfvjv ^g with speaker-dependent visemes for four speakers with isolated word training of classi ers B1 visemes (top) and B2 visemes (bottom). Recognition P r {v|vˆ} Recognition P r {v|vˆ} Speakers: Speaker 1 Speaker 2 Speaker 3 Speaker 4 Sp1 sil v01 v03 v06 v12 v09 v11 v05 v02 Sp2 gar v01 v06 v03 v04 v02 v07 v05 v08 Sp3 v04 v07 v08 v10 v12 sil v03 v01 v06 v05 v11 v02 v09 Sp4 v04 sil v01 v12 v03 gar v06 v05 v02 Ordered viseme classes Speakers: Speaker 1 Speaker 2 Speaker 3 Speaker 4 Sp1 v09 sil v03 v04 v01 v05 v02 gar Sp2 sil v01 v05 v03 v04 v06 v02 Sp3 v04 v09 v10 v11 v12 sil v01 v14 v06 v02 v03 v07 v05 Sp4 v10 sil v04 v03 v06 v01 v09 v02 v05 v08 gar Ordered viseme classes Figure 10: Individual viseme classi cation, Prfvjv ^g with speaker-dependent visemes for four speakers with isolated word training of classi ers. B3 visemes (top) and B4 visemes (bottom). Recognition P r {v|vˆ} Recognition P r {v|vˆ} 90 Speaker 1 Speaker 2 Speaker 3 Speaker 4 Speaker 5 Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Ordered viseme classes 90 Speaker 1 Speaker 2 Speaker 3 Speaker 4 Speaker 5 Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Ordered viseme classes Figure 11: Individual viseme classi cation, Prfvjv ^g with speaker-dependent visemes for twelve speakers with continuous speech training of classi ers. B1 visemes (top) and B2 visemes (bottom). Recognition P r {v|vˆ} Recognition P r {v|vˆ} 90 Speaker 1 Speaker 2 Speaker 3 Speaker 4 Speaker 5 Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Ordered viseme classes 90 Speaker 1 Speaker 2 Speaker 3 Speaker 4 Speaker 5 Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Ordered viseme classes Figure 12: Individual viseme classi cation, Prfvjv ^g with speaker-dependent visemes for twelve speakers with continuous speech training of classi ers. B3 visemes (top) and B4 visemes (bottom). Recognition P r {v|vˆ} Recognition P r {v|vˆ} 6. Conclusions While lipreading and hence expressive audio-visual speech recognition face a number of challenges, one the persistent diculties has been the multiplicity of mappings be- tween phonemes and visemes. This paper has described a study of previously suggested Phoneme-to-Viseme (P2V) maps. For isolated word classi cation, Lee's [71] is the best of the previously published maps. For continuous speech a combination of Woodward's and Disney's visemes are better. The best performing viseme sets have on average, between two and four phonemes per viseme. When looking at speaker-independent visemes, whilst most viseme sets do not ex- perience any dierence in correctness between isolated and continuous speech, it is in- teresting to note that Woodward consonant visemes are better for continuous speech and are linguistically derived, whereas Lee visemes are better for isolated words and are data-derived. This suggests that an optimal set of visemes for all speakers would need to consider both the visual speech gestures of the individual and the rules of language. Which in essence is the dilemma for visemes: does one choose units that make sense in terms of likely visual gestures or in terms of the linguistic problem that is trying to be solved. Figure 13: A simple augment to the conventional lip-reading system to include speaker-dependent visemes. We have also derived some new visemes, the `Bear' visemes. These new data-driven visemes respect speaker individuality in speech and uses this property to demonstrate that our second data-driven method tested, a strictly-confused viseme derivation with split vowel and consonant phonemes, can improve word classi cation. The best of Bear visemes is the strict confused phonemes with split vowels and consonants (B2) for both isolated and continuous speech. Furthermore, a review of these speaker-dependent visemes (listed in Tables 15, 18, and Appendix A) shows that formally `accepted' visemes such as f =p= =b= =m= g and f/S/ /Z/ /dZ/ /tS/g are no longer present. Similarly with our previous vowel based visemes, six of our eight prior viseme sets pair /2/ with /A/ (albeit not as a complete 35 viseme, others are also present) but with our best speaker-dependent visemes these two phonemes are not paired. This is an interesting insight because it suggests that formerly `accepted' strong visemes might not be so useful for all speakers, and some adaptability, or further investigation into understanding viseme variation is still needed. Our suggestion at this time, is that linguistics or co-articulation in continuous speech, are a strong in uence causing this variation. In practical terms, our new viseme derivation method is simple and can be included within a conventional lipreading system easily. This is demonstrated in Figure 13 where our clustering method is shown in dashed boxes. We recommend this approach for viseme classi cation since speaker-independent visemes are unlikely to perform well. In general, for cases, Speaker-dependent visemes reduce insertion errors when clas- sifying continuous speech. This is thought to be because the phoneme confusions in speaker-dependent visemes are aected by speaker speci c visual co-articulation. For all viseme sets, not mixing vowel and consonant phonemes signi cantly improves classi ca- tion. 7. Acknowledgments We gratefully acknowledge the assistance of Dr Yuxuan Lan and Dr Barry-John Theobald, formerly of the University of East Anglia for their help with HTK and general advice and guidance. This work was conducted while Helen L. Bear was in receipt of a studentship from the UK Engineering and Physical Sciences Research Council (EPSRC). References [1] B.-J. T. Jacob L. Newman, S. J. Cox, Limitations of visual speech recognition, in: Proceedings of the International Conference on Audio-Visual Speech Processing, 2010. [2] E. Ong, R. Bowden, Robust lip-tracking using rigid ocks of selected linear predictors, in: 8th IEEE Int. Conf. on Automatic Face and Gesture Recognition (FG2008), 2008, pp. 247{254. [3] I. Matthews, S. Baker, Active appearance models revisited, International Journal of Computer Vision 60 (2) (2004) 135{164. URL http://www.springerlink.com/openurl.asp?id=doi:10.1023/B:VISI.0000029666.37597.d3 [4] T. Cootes, G. Edwards, C. Taylor, Active appearance models, IEEE Transactions on Pattern Anal- ysis and Machine Intelligence 23 (6) (2001) 681 {685. doi:10.1109/34.927467. [5] Y. Lan, B.-J. Theobald, R. Harvey, View independent computer lip-reading, in: IEEE International Conference on Multimedia and Expo (ICME), 2012, pp. 432{437. doi:10.1109/ICME.2012.192. [6] A. Pass, J. Zhang, D. Stewart, An investigation into features for multi-view lipreading, in: Image Processing (ICIP), 2010 17th IEEE International Conference on, IEEE, 2010, pp. 2417{2420. [7] S. Moore, R. Bowden, Local binary patterns for multi-view facial expression recognition, Computer Vision and Image Understanding 115 (4) (2011) 541 { 558. doi:10.1016/j.cviu.2010.12.001. [8] K. Kumar, T. Chen, R. Stern, Pro le view lip reading, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 4, 2007, pp. IV{429{IV{432. doi:10. 1109/ICASSP.2007.366941. [9] R. Kaucic, A. Blake, Accurate, real-time, unadorned lip tracking, in: Computer Vision, 1998. Sixth International Conference on, IEEE, 1998, pp. 370{375. [10] S. L. Bauman, G. Hambrecht, Analysis of view angle used in speechreading training of sentences, American Journal of Audiology 4 (3) (1995) 67{70. URL http://aja.asha.org/cgi/content/abstract/4/3/67 [11] P. Lucey, G. Potamianos, S. Sridharan, Visual speech recognition across multiple views, in: A. W.- C. Liew, S. Wang (Eds.), Visual Speech Reognition: Lip Segmentation and Mapping, 2009. doi: 10.4018/978-1-60566-186-5. 36 [12] A. Blokland, A. H. Anderson, Eect of low frame-rate video on intelligibility of speech, Speech Com- munication 26 (1-2) (1998) 97{103. doi:http://dx.doi.org/10.1016/S0167-6393(98)00053-3. [13] T. Saitoh, R. Konishi, A study of in uence of word lip reading by change of frame rate, in: Pro- ceedings of the International Conference on Auditory-Visual Speech Processing (AVSP), 2010. [14] H. Bear, R. W. Harvey, B.-J. Theobald, Y. Lan, Resolution limits on visual speech recognition, in: IEEE International Conference on Image Processing, 2014, pp. 2009{2013. doi:10.1109/ICIP. 2014.7025274. [15] M. Heckmann, F. Berthommier, C. Savariaux, K. Kroschel, Eects of image distortions on audio- visual speech recognition, in: AVSP 2003-International Conference on Audio-Visual Speech Pro- cessing, 2003, pp. 163{168. [16] M. Vitkovitch, P. Barber, Visible speech as a function of image quality: Eects of display parameters on lipreading ability, Applied Cognitive Psychology 10 (2) (1996) 121{140. doi:10.1002/(SICI) 1099-0720(199604)10:2<121::AID-ACP371>3.0.CO;2-V. URL http://dx.doi.org/10.1002/(SICI)1099-0720(199604)10:2<121::AID-ACP371>3.0.CO;2-V [17] L. Cappelletta, N. Harte, Phoneme-to-viseme mapping for visual speech recognition., in: ICPRAM (2), 2012, pp. 322{329. [18] D. Howell, B.-J. Theobald, S. J. Cox, Confusion modelling for automated lip-reading using weighted nite-state transducers., in: AVSP, 2013, pp. 197{202. [19] T. J. Hazen, K. Saenko, C.-H. La, J. R. Glass, A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments, in: Proceedings of the 6th International Conference on Multimodal Interfaces, ICMI '04, ACM, New York, NY, USA, 2004, pp. 235{242. doi:10.1145/1027933.1027972. URL http://doi.acm.org/10.1145/1027933.1027972 [20] J. Shin, J. Lee, D. Kim, Real-time lip reading system for isolated korean word recognition, Pattern Recognition 44 (3) (2011) 559{571. [21] H. L. Bear, R. Harvey, Decoding visemes: Improving machine lip-reading, in: 2016 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016, pp. 2009{2013. [22] I. Matthews, J. Bangham, R. Harvey, S. Cox, Non-linear scale decomposition based features for vi- sual speech recognition, Proceedings of the IX European Signal Processing Conference (EUSIPCO) (1998) 303{305. [23] Y. Lan, R. Harvey, B. Theobald, E.-J. Ong, R. Bowden, Comparing visual features for lipreading, in: International Conference on Auditory-Visual Speech Processing 2009, 2009, pp. 102{106. [24] Y. Lan, B.-J. Theobald, R. Harvey, E.-J. Ong, R. Bowden, Improving visual features for lip-reading, Proceedings of the International Conference on Audio-Visual Speech Processing (AVSP) 7 (3) (2010) 42{48. [25] I. Matthews, T. Cootes, J. Bangham, S. Cox, R. Harvey, Extraction of visual features for lipreading, Pattern Analysis and Machine Intelligence, IEEE Transactions on 24 (2) (2002) 198 {213. doi: 10.1109/34.982900. [26] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. A. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchec, P. Woodland, The HTK Book (for HTK Version 3.4), Cambridge University Engineering Department, 2006. URL http://htk.eng.cam.ac.uk/docs/docs.shtml [27] Q. Zhu, A. Alwan, On the use of variable frame rate analysis in speech recognition, in: Acoustics, Speech, and Signal Processing, 2000. ICASSP'00. Proceedings. 2000 IEEE International Conference on, Vol. 3, IEEE, 2000, pp. 1783{1786. [28] K. Thangthai, R. Harvey, S. Cox, B.-J. Theobald, Improving lip-reading performance for robust audiovisual speech recognition using dnns, in: Proc. FAAVSP, 1St Joint Conference on Facial Analysis, Animation and Audio{Visual Speech Processing, 2015. [29] F. J. Huang, T. Chen, Tracking of multiple faces for human-computer interfaces and virtual en- vironments, in: Proceedings of IEEE International Conference on Multimedia and Expo (ICME), Vol. 3, 2000, pp. 1563{1566. doi:10.1109/ICME.2000.871067. [30] J. Jiang, A. Alwan, L. E. Bernstein, E. T. Auer, P. A. Keating, Similarity structure in perceptual and physical measures for visual consonants across talkers, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 1, 2002, pp. I{441 {I{444. doi:10.1109/ICASSP.2002.5743749. [31] S. Lesner, P. Kricos, Visual vowel and diphthong perception across speakers, Journal of the Academy of Rehabilitative Audiology 14 (1981) 252{258. [32] R. Cutler, L. Davis, Look who's talking: speaker detection using video and audio correlation, in: 37 IEEE International Conference on Multimedia and Expo (ICME), Vol. 3, 2000, pp. 1589{1592. doi:10.1109/ICME.2000.871073. [33] J. Luettin, N. Thacker, S. Beet, Speaker identi cation by lipreading, in: Proceedings of the Fourth International Conference on Spoken Language (ICSLP), Vol. 1, 1996, pp. 62{65. doi:10.1109/ ICSLP.1996.607030. [34] H. L. Bear, S. J. Cox, R. W. Harvey, Speaker-independent machine lip-reading with speaker- dependent viseme classi ers, Facial Animation and Audio-Visual Speech Processing (FAAVSP) 2015 (2015) 190{195. URL http://www.isca-speech.org/archive/avsp15/papers/av15_190.pdf [35] J. L. Newman, S. J. Cox, Speaker independent visual-only language identi cation, in: Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, IEEE, 2010, pp. 5026{5029. [36] S. Taylor, B.-J. Theobald, I. Matthews, The eect of speaking rate on audio and visual speech, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3037{3041. doi:10.1109/ICASSP.2014.6854158. [37] E. K. Patterson, S. Gurbuz, Z. Tufekci, J. N. Gowdy, Cuave: A new audio-visual database for multimodal human-computer interface research, in: 2002 IEEE International Conference on Acous- tics, Speech, and Signal Processing, Vol. 2, 2002, pp. II{2017{II{2020. doi:10.1109/ICASSP.2002. [38] J. F. G. Perez, A. F. Frangi, E. L. Solano, K. Lukas, Lip reading for robust speech recognition on embedded devices, in: Proceedings. (ICASSP '05). IEEE International Conference on Acous- tics, Speech, and Signal Processing, 2005., Vol. 1, 2005, pp. 473{476. doi:10.1109/ICASSP.2005. [39] K. Paleek, Lipreading using spatiotemporal histogram of oriented gradients, in: 2016 24th Euro- pean Signal Processing Conference (EUSIPCO), 2016, pp. 1882{1885. doi:10.1109/EUSIPCO.2016. [40] R. E. Shor, The production and judgment of smile magnitude, The Journal of General Psychology 98 (1) (1978) 79{96. [41] S. Fagel, Eects of smiling on articulation: Lips, larynx and acoustics, in: Development of multi- modal interfaces: active listening and synchrony, Springer, 2010, pp. 294{303. [42] M. Kienast, A. Paeschke, W. Sendlmeier, Articulatory reduction in emotional speech, in: Sixth European Conference on Speech Communication and Technology, 1999. [43] M. Kienast, W. F. Sendlmeier, Acoustical analysis of spectral and temporal changes in emotional speech, in: ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, 2000. [44] F. Shaw, B.-J. Theobald, Expressive modulation of neutral visual speech, IEEE MultiMedia 23 (4) (2016) 68{78. [45] W. Hamza, E. Eide, R. Bakis, M. Picheny, J. Pitrelli, The ibm expressive speech synthesis system, in: Eighth International Conference on Spoken Language Processing, 2004. [46] N. N. Khatri, Z. H. Shah, S. A. Patel, Facial expression recognition: A survey, International Journal of Computer Science and Information Technologies (IJCSIT) 5 (1) (2014) 149{152. [47] S. Happy, A. Routray, Automatic facial expression recognition using features of salient facial patches, IEEE transactions on Aective Computing 6 (1) (2015) 1{12. [48] J. Yan, W. Zheng, Q. Xu, G. Lu, H. Li, B. Wang, Sparse kernel reduced-rank regression for bimodal emotion recognition from facial expression and speech, IEEE Transactions on Multimedia 18 (7) (2016) 1319{1329. [49] S. Zhang, X. Wang, G. Zhang, X. Zhao, Multimodal emotion recognition integrating aective speech with facial expression, WSEAS Transactions on Signal Processing 10 (2014) (2014) 526{537. [50] T. Cootes, G. Edwards, C. Taylor, Active appearance models, Pattern Analysis and Machine Intel- ligence, IEEE Transactions on 23 (6) (2001) 681 {685. doi:10.1109/34.927467. [51] R. Seymour, D. Stewart, J. Ming, Comparison of image transform-based features for visual speech recognition in clean and corrupted videos, Journal on Image and Video Processing 2008 (2008) 14. [52] G. Potamianos, C. Neti, J. Luettin, I. Matthews, Audio-visual automatic speech recognition: An overview, Issues in Visual and Audio-Visual Speech Processing 22 (2004) 23. [53] A. Sharif Razavian, H. Azizpour, J. Sullivan, S. Carlsson, Cnn features o-the-shelf: an astounding baseline for recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 806{813. [54] Z. Yan, V. Jagadeesh, D. DeCoste, W. Di, R. Piramuthu, Hd-cnn: Hierarchical deep convolutional neural network for image classi cation, in: International Conference on Computer Vision (ICCV), Vol. 2, 2015. 38 [55] M. Sundermeyer, R. Schluter, H. Ney, Lstm neural networks for language modeling., in: Interspeech, 2012, pp. 194{197. [56] W. Byeon, T. M. Breuel, F. Raue, M. Liwicki, Scene labeling with lstm recurrent neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3547{3555. [57] J. S. Chung, A. Zisserman, Lip Reading in the Wild, Springer International Publishing, Cham, 2017, pp. 87{103. doi:10.1007/978-3-319-54184-6_6. URL http://dx.doi.org/10.1007/978-3-319-54184-6_6 [58] M. Wand, J. Koutn k, J. Schmidhuber, Lipreading with long short-term memory, in: Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, IEEE, 2016, pp. 6115{6119. [59] B.-J. Theobald, Visual speech synthesis using shape and appearance models, Ph.D. thesis, Univer- sity of East Anglia (2003). [60] C. A. Binnie, P. L. Jackson, A. A. Montgomery, Visual intelligibility of consonants: A lipreading screening test with implications for aural rehabilitation, Journal of Speech and Hearing Disorders 41 (4) (1976) 530. [61] C. G. Fisher, Confusions among visually perceived consonants, Journal of Speech, Language and Hearing Research 11 (4) (1968) 796. [62] J. R. Franks, J. Kimble, The confusion of english consonant clusters in lipreading, Journal of Speech, Language and Hearing Research 15 (3) (1972) 474. [63] B. E. Walden, R. A. Prosek, A. A. Montgomery, C. K. Scherr, C. J. Jones, Eects of training on the visual recognition of consonants, Journal of Speech, Language and Hearing Research 20 (1) (1977) [64] P. B. Kricos, S. A. Lesner, Dierences in visual intelligibility across talkers., The Volta Review 82 (1982) 219{226. [65] E. Owens, B. Blazek, Visemes observed by hearing-impaired and normal-hearing adult viewers, Journal of Speech and Hearing Research 28 (3) (1985) 381. [66] J. Lander, Read my lips: Facial animation techniques, http://www.gamasutra.com/view/feature/ 131587/read_my_lips_facial_animation_.php, accessed: 2014-01-28 (2014). [67] A. A. Montgomery, P. L. Jackson, Physical characteristics of the lips underlying vowel lipreading performance, The Journal of the Acoustical Society of America 73 (1983) 2134{2144. [68] E. B. Nitchie, Lip-Reading, principles and practise: A handbook for teaching and self-practise, Frederick A Stokes Co, New York, 1912. [69] E. Bozkurt, C. Erdem, E. Erzin, T. Erdem, M. Ozkan, Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation, in: 3DTV Conference, IEEE, 2007, pp. 1{4. [70] J. Jeers, M. Barley, Speechreading (lipreading), Thomas Spring eld, IL:, 1971. [71] S. Lee, D. Yook, Audio-to-visual conversion using Hidden Markov Models, in: PRICAI 2002: Trends in Arti cial Intelligence, Springer, 2002, pp. 563{570. [72] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari, J. Zhou, Audio-visual speech recognition, in: Final Workshop 2000 Report, Vol. 764, 2000. [73] K. E. Finn, A. A. Montgomery, Automatic optically-based recognition of speech, Pattern Recogni- tion Letters 8 (3) (1988) 159{164. [74] F. Heider, G. M. Heider, An experimental investigation of lipreading, Psychological Monographs 52 (232) (1940) 124{153. [75] M. F. Woodward, C. G. Barber, Phoneme perception in lipreading, Journal of Speech, Language and Hearing Research 3 (3) (1960) 212. [76] A. Bhattachayya, On a measure of divergence between two statistical population de ned by their population distributions, Bulletin Calcutta Mathematical Society 35 (1943) 99{109. [77] K. Wilson, The Columbia guide to standard American English, New York : Columbia University Press, 1993. [78] S. Cox, R. Harvey, Y. Lan, J. Newman, B. Theobald, The challenge of multispeaker lip-reading, in: International Conference on Auditory-Visual Speech Processing, 2008, pp. 179{184. [79] H. L. Bear, Decoding visemes: improving machine lip-reading. PhD thesis, University of East Anglia, 2016. [80] W. M. Fisher, G. R. Doddington, K. M. Goudie-Marshall, The DARPA speech recognition research database: speci cations and status, in: Proceedings of the DARPA Workshop on speech recognition, 1986, pp. 93{99. [81] Y. Lan, B.-J. Theobald, R. Harvey, E.-J. Ong, R. Bowden, Improving visual features for lip-reading, in: Proceedings of International Conference on Auditory-Visual Speech Processing, Vol. 201, 2010. 39 [82] Cambridge University, UK. BEEP pronounciation dictionary [online] (1997) [cited Jan 2013]. [83] J. Gower, Generalized procrustes analysis, Psychometrika 40 (1) (1975) 33{51. doi:10.1007/ BF02291478. URL http://dx.doi.org/10.1007/BF02291478 [84] S. Baker, Inverse compositional algorithm, in: K. Ikeuchi (Ed.), Computer Vision, Springer US, 2014, pp. 426{428. doi:10.1007/978-0-387-31439-6_759. URL http://dx.doi.org/10.1007/978-0-387-31439-6_759 [85] T. J. Hazen, Automatic alignment and error correction of human generated transcripts for long speech recordings., in: INTERSPEECH, Vol. 2006, 2006, pp. 1606{1609. [86] H. L. Bear, R. W. Harvey, B.-J. Theobald, Y. Lan, Which phoneme-to-viseme maps best improve visual-only computer lip-reading?, in: Advances in Visual Computing, Springer, 2014, pp. 230{239. doi:10.1007/978-3-319-14364-4_22. [87] J. Demsar, Statistical comparisons of classi ers over multiple datasets, Journal of Machine Learning Research 7 (2006) 1{30. [88] S. J. Young, G. Evermann, M. Gales, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK book version 3.4 (2006). [89] R. R. Bouckaert, E. Frank, Evaluating the replicability of signi cance tests for comparing learning algorithms, in: Advances in knowledge discovery and data mining, Springer, 2004, pp. 3{12. [90] Y. Bengio, Y. Grandvalet, No unbiased estimator of the variance of k-fold cross-validation, The Journal of Machine Learning Research 5 (2004) 1089{1105. [91] H. L. Bear, G. Owen, R. Harvey, B.-J. Theobald, Some observations on computer lip-reading: moving from the dream to the reality, in: SPIE Security+ Defence, International Society for Optics and Photonics, 2014, pp. 92530G{92530G. doi:10.1117/12.2067464. 40 Appendix A. RMAV Speaker-dependent P2V maps Table A.19: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp01 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /dZ/ /m/ /v01/ // /2/ /@/ /ay/ /v01/ /I@/ /t/ /T/ /uw/ /v01/ // /2/ /@/ /ay/ /v02/ /3/ /I/ /iy/ /k/ /eh/ /I@/ /I/ /iy/ /z/ /eh/ /I@/ /I/ /iy/ /n/ /N/ /r/ /s/ /v02/ /6/ /@U/ /v02/ /3/ /I/ /iy/ /k/ /v02/ /S/ /T/ /v/ /w/ /v03/ /ey/ /v03/ /O/ /3/ /ey/ /n/ /N/ /r/ /s/ /z/ /v04/ /@/ /D/ /E/ /eh/ /v04/ /A/ /sil/ /sil/ /sil/ /sil/ /sp/ /v03/ /b/ /d/ /f/ /k/ /U/ /v05/ /uw/ /gar/ /gar/ /A/ // /2/ /O/ /m/ /n/ /N/ /p/ /r/ /v05/ /A/ /v06/ /U/ /@/ /ay/ /@/ /b/ /tS/ /r/ /s/ /t/ /v06/ /I@/ /t/ /T/ /uw/ /v07/ /O@/ /tS/ /d/ /D/ /E/ /eh/ /sil/ /sil/ /sil/ /sp/ /z/ /v08/ /OI/ /eh/ /ey/ /f/ /g/ /H/ /gar/ /gar/ /A/ /O/ /AU/ /@/ /v07/ /6/ /@U/ /p/ /w/ /v09/ /@/ /H/ /dZ/ /m/ /6/ /@U/ /D/ /3/ /ey/ /g/ /H/ /v08/ /S/ /v10/ /AU/ /@U/ /OI/ /p/ /S/ /O@/ /H/ /dZ/ /6/ /@U/ /OI/ /v09/ /O/ /v11/ /b/ /d/ /f/ /k/ /O@/ /U/ /w/ /y/ /Z/ /OI/ /O@/ /U/ /uw/ /Z/ sp01 /v10/ // /m/ /n/ /N/ /p/ /r/ /Z/ /Z/ /v11/ /d/ /g/ /H/ /r/ /s/ /t/ /v12/ /b/ /v12/ /D/ /dZ/ /v13/ /y/ /v13/ /S/ /T/ /v/ /w/ /v14/ /2/ /ay/ /z/ /v15/ /Z/ /v14/ /g/ /v16/ /O@/ /v15/ /tS/ /H/ /v17/ /sil/ /v16/ /Z/ /v18/ /OI/ /sil/ /sil/ /sil/ /sp/ /v19/ /tS/ /v20/ /@/ /v21/ /AU/ /gar/ /gar/ /sp/ Table A.20: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp02 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /l/ /m/ /n/ /p/ /v01/ /@/ /ay/ /E/ /eh/ /v01/ /@/ /ay/ /b/ /d/ /v01/ /@/ /ay/ /E/ /eh/ /s/ /S/ /t/ /v/ /w/ /ey/ /I/ /iy/ /eh/ /ey/ /dZ/ /ey/ /I/ /iy/ /w/ /v02/ /O/ /I@/ /6/ /@U/ /v02/ /l/ /m/ /n/ /p/ /v02/ /b/ /m/ /n/ /N/ /v02/ /g/ /H/ /I@/ /I/ /v03/ // /2/ /AU/ /OI/ /s/ /S/ /t/ /v/ /w/ /r/ /s/ /S/ /t/ /v/ /k/ /v04/ /U/ /uw/ /w/ /v/ /w/ /y/ /z/ /v03/ /@/ /ay/ /b/ /d/ /v05/ /O@/ /sil/ /sil/ /sil/ /sp/ /sil/ /sil/ /sil/ /sp/ /eh/ /ey/ /dZ/ /v06/ /sil/ /gar/ /gar/ /A/ // /2/ /O/ /gar/ /gar/ /A/ // /2/ /O/ /v04/ /A/ /O/ /v07/ /A/ /@/ /tS/ /E/ /3/ /f/ /@/ /tS/ /d/ /D/ /f/ /v05/ /3/ /uw/ /y/ /z/ /v08/ /b/ /m/ /n/ /N/ /f/ /g/ /H/ /I@/ /I/ /f/ /g/ /H/ /I@/ /dZ/ sp02 /v06/ /6/ /@U/ /r/ /s/ /S/ /t/ /v/ /I/ /iy/ /k/ /N/ /6/ /dZ/ /k/ /l/ /6/ /@U/ /v07/ // /2/ /AU/ /OI/ /v/ /w/ /y/ /z/ /6/ /@U/ /OI/ /T/ /O@/ /@U/ /OI/ /T/ /O@/ /U/ /v08/ /f/ /N/ /O@/ /v09/ /dZ/ /O@/ /U/ /uw/ /y/ /z/ /U/ /uw/ /Z/ /v09/ /E/ /v10/ /d/ /D/ /f/ /g/ /z/ /Z/ /v10/ /tS/ /T/ /k/ /l/ /v11/ /Z/ /v11/ /tS/ /T/ /v12/ /U/ /v12/ /Z/ /v13/ /sil/ /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ /@/ /sp/ /gar/ /gar/ /@/ 41 Table A.21: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp03 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /ey/ /f/ /I/ /iy/ /v01/ /E/ /3/ /sil/ /uw/ /v01/ /ey/ /f/ /I/ /iy/ /v01/ /ay/ /eh/ /ey/ /I@/ /k/ /l/ /m/ /n/ /S/ /v02/ /U/ /k/ /l/ /m/ /n/ /S/ /iy/ /6/ /@U/ /S/ /v03/ /ay/ /eh/ /ey/ /I@/ /S/ /v02/ /g/ /k/ /l/ /m/ /v02/ /D/ /g/ /iy/ /6/ /@U/ /v02/ /E/ /r/ /s/ /t/ /p/ /r/ /s/ /t/ /T/ /v03/ /E/ /r/ /s/ /sil/ /v04/ /O/ /z/ /T/ /uw/ /z/ /v05/ /O@/ /sil/ /sil/ /sil/ /sp/ /sil/ /sil/ /sil/ /sp/ /v04/ /d/ /T/ /v/ /w/ /v06/ // /2/ /@/ /gar/ /gar/ /A/ // /2/ /O/ /gar/ /gar/ /A/ // /2/ /O/ /v05/ /O/ /@U/ /p/ /v07/ /@/ /@/ /ay/ /@/ /b/ /tS/ /@/ /@/ /b/ /tS/ /d/ /v06/ // /v08/ /AU/ /tS/ /d/ /D/ /eh/ /3/ /d/ /D/ /E/ /3/ /f/ /v07/ /@/ /ay/ /b/ /tS/ /v09/ /A/ /3/ /g/ /H/ /I@/ /N/ /f/ /H/ /dZ/ /N/ /OI/ sp03 /v08/ /N/ /v10/ /g/ /k/ /l/ /m/ /N/ /6/ /@U/ /OI/ /p/ /OI/ /S/ /O@/ /U/ /uw/ /v09/ /H/ /p/ /r/ /s/ /t/ /T/ /p/ /T/ /O@/ /U/ /v/ /uw/ /v/ /w/ /y/ /z/ /v10/ /A/ /eh/ /3/ /T/ /v/ /w/ /y/ /Z/ /z/ /Z/ /v11/ /O@/ /U/ /v11/ /tS/ /d/ /D/ /f/ /v12/ /2/ /I@/ /v12/ /dZ/ /v/ /w/ /z/ /v13/ /Z/ /v13/ /b/ /v14/ /@/ /v14/ /S/ /Z/ /v15/ /AU/ /v15/ /H/ /N/ /gar/ /gar/ /OI/ /sp/ /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ /OI/ Table A.22: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp05 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ // /O/ /@/ /D/ /v01/ // /O/ /@/ /eh/ /v01/ /ay/ /b/ /d/ /w/ /v01/ /ay/ /uw/ /3/ /ey/ /I/ /iy/ /k/ /ey/ /I/ /iy/ /6/ /@U/ /v02/ // /O/ /@/ /D/ /v02/ /d/ /D/ /f/ /dZ/ /k/ /l/ /n/ /@U/ /3/ /ey/ /I/ /iy/ /k/ /l/ /m/ /n/ /r/ /s/ /v02/ /p/ /r/ /s/ /t/ /v02/ /E/ /U/ /k/ /l/ /n/ /s/ /S/ /z/ /v03/ /ay/ /uw/ /sil/ /sil/ /sil/ /sp/ /sil/ /sil/ /sil/ /sp/ /v03/ /I@/ /N/ /uw/ /v/ /v04/ /O@/ /gar/ /gar/ /A/ /2/ /AU/ /@/ /gar/ /gar/ /A/ // /2/ /O/ /v04/ /tS/ /6/ /v05/ /2/ /AU/ /E/ /f/ /g/ /H/ /I@/ /@/ /@/ /b/ /tS/ /E/ /v05/ /ay/ /b/ /d/ /w/ /v06/ /A/ /I@/ /I@/ /dZ/ /m/ /N/ /6/ /E/ /eh/ /3/ /ey/ /g/ /v06/ /f/ /m/ /v07/ /g/ /H/ /t/ /v/ /6/ /@U/ /OI/ /p/ /r/ /g/ /H/ /I@/ /I/ /iy/ /v07/ /A/ /g/ /H/ /v08/ /p/ /w/ /y/ /r/ /s/ /S/ /t/ /T/ /iy/ /N/ /6/ /@U/ /OI/ sp05 /v08/ /@U/ /S/ /v09/ /d/ /D/ /f/ /dZ/ /T/ /O@/ /U/ /uw/ /v/ /OI/ /p/ /t/ /T/ /O@/ /v09/ /dZ/ /l/ /m/ /n/ /r/ /s/ /v/ /y/ /z/ /Z/ /O@/ /U/ /v/ /w/ /y/ /v10/ /dZ/ /s/ /S/ /y/ /z/ /Z/ /v11/ /E/ /y/ /v10/ /N/ /T/ /v12/ /T/ /v11/ /b/ /tS/ /v13/ /2/ /AU/ /v12/ /Z/ /v14/ /Z/ /sil/ /sil/ /sil/ /sp/ /v15/ /O@/ /gar/ /gar/ /@/ /OI/ /v16/ /j/ /h/ /gar/ /gar/ /@/ /OI/ /sil/ /sp/ 42 Table A.23: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp06 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /@/ /ay/ /d/ /D/ /v01/ /A/ // /2/ /@/ /v01/ /H/ /N/ /6/ /@U/ /v01/ /A/ // /2/ /@/ /eh/ /I/ /k/ /l/ /n/ /3/ /I@/ /I/ /6/ /@U/ /v02/ /@/ /ay/ /d/ /D/ /3/ /I@/ /I/ /6/ /@U/ /n/ /p/ /s/ /t/ /@U/ /eh/ /I/ /k/ /l/ /n/ /@U/ /v02/ /v/ /w/ /y/ /z/ /v02/ /sil/ /uw/ /n/ /p/ /s/ /t/ /v02/ /k/ /l/ /m/ /n/ /v03/ /m/ /v03/ /ay/ /ey/ /iy/ /U/ /sil/ /sil/ /sil/ /sp/ /r/ /s/ /S/ /t/ /v/ /v04/ /H/ /N/ /6/ /@U/ /v04/ /AU/ /O@/ /gar/ /gar/ /A/ // /2/ /O/ /v/ /w/ /y/ /z/ /v05/ /ey/ /iy/ /r/ /S/ /v05/ /E/ /@/ /b/ /tS/ /3/ /ey/ /sil/ /sil/ /sil/ /sp/ /v06/ /I@/ /v06/ /@/ /ey/ /f/ /g/ /I@/ /iy/ /gar/ /gar/ /O/ /AU/ /ay/ /@/ /v07/ /A/ // /2/ /3/ /v07/ /O/ /iy/ /dZ/ /m/ /OI/ /r/ /tS/ /d/ /D/ /E/ /ey/ /v08/ /f/ /T/ /O@/ /v08/ /k/ /l/ /m/ /n/ /r/ /S/ /T/ /O@/ /U/ /ey/ /f/ /g/ /H/ /iy/ sp06 /v09/ /uw/ /r/ /s/ /S/ /t/ /v/ /U/ /uw/ /v/ /w/ /y/ /iy/ /dZ/ /N/ /OI/ /T/ /v10/ /uw/ /v/ /w/ /y/ /z/ /y/ /z/ /Z/ /T/ /O@/ /U/ /uw/ /Z/ /v11/ /b/ /tS/ /g/ /v09/ /b/ /tS/ /d/ /D/ /Z/ /v12/ /O/ /dZ/ /g/ /dZ/ /v13/ /Z/ /v10/ /Z/ /v14/ /sil/ /v11/ /H/ /T/ /v15/ /@/ /v12/ /N/ /v16/ /AU/ /sil/ /sil/ /sil/ /sp/ /v17/ /u/ /w/ /gar/ /gar/ /OI/ /gar/ /gar/ /OI/ /sp/ Table A.24: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp08 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /eh/ /f/ /H/ /I/ /v01/ /A/ // /O/ /@/ /v01/ /eh/ /f/ /H/ /I/ /v01/ /A/ // /O/ /@/ /l/ /m/ /N/ /p/ /r/ /eh/ /ey/ /I/ /iy/ /uw/ /l/ /m/ /N/ /p/ /r/ /eh/ /ey/ /I/ /iy/ /uw/ /r/ /s/ /t/ /uw/ /r/ /s/ /t/ /uw/ /v02/ /A/ // /O/ /@/ /v02/ /U/ /sil/ /sil/ /sil/ /sp/ /v02/ /k/ /l/ /n/ /p/ /ey/ /n/ /U/ /v03/ /6/ /@U/ /gar/ /gar/ /A/ // /2/ /O/ /s/ /t/ /T/ /v/ /w/ /v03/ /ay/ /b/ /uw/ /v04/ /I@/ /@/ /ay/ /@/ /b/ /tS/ /w/ /z/ /v04/ /g/ /v05/ /AU/ /E/ /tS/ /d/ /D/ /E/ /3/ /sil/ /sil/ /sil/ /sp/ /v05/ /tS/ /v06/ /2/ /3/ /3/ /ey/ /g/ /I@/ /dZ/ /gar/ /gar/ /2/ /AU/ /@/ /b/ /v06/ /S/ /y/ /v07/ /@/ /dZ/ /k/ /n/ /6/ /@U/ /d/ /D/ /E/ /3/ /f/ /v07/ /6/ /v08/ /k/ /l/ /n/ /p/ /@U/ /OI/ /S/ /T/ /O@/ /f/ /g/ /H/ /I@/ /dZ/ sp08 /v08/ /k/ /s/ /t/ /T/ /v/ /w/ /O@/ /U/ /uw/ /v/ /w/ /dZ/ /m/ /N/ /6/ /@U/ /v09/ /dZ/ /w/ /z/ /w/ /y/ /z/ /Z/ /@U/ /OI/ /S/ /O@/ /U/ /v10/ /D/ /v/ /w/ /z/ /v09/ /d/ /D/ /f/ /H/ /U/ /y/ /Z/ /v11/ /T/ /Z/ /N/ /v12/ /3/ /I@/ /v10/ /g/ /dZ/ /v13/ /AU/ /@U/ /v11/ /b/ /tS/ /S/ /v14/ /2/ /E/ /v12/ /Z/ /v15/ /O@/ /v13/ /y/ /v16/ /@/ /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ /OI/ /sil/ /sp/ /gar/ /gar/ /OI/ /O@/ 43 Table A.25: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp09 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /D/ /E/ /3/ /g/ /v01/ /E/ /3/ /v01/ /O/ /N/ /6/ /@U/ /v01/ /2/ /@U/ /k/ /l/ /m/ /n/ /p/ /v02/ /A/ // /O/ /@/ /v02/ /D/ /E/ /3/ /g/ /v02/ /A/ // /O/ /@/ /p/ /eh/ /ey/ /I/ /iy/ /6/ /k/ /l/ /m/ /n/ /p/ /eh/ /ey/ /I/ /iy/ /6/ /v02/ /I@/ /y/ /6/ /p/ /6/ /v03/ /ay/ /r/ /s/ /S/ /v03/ /U/ /uw/ /v03/ /ay/ /r/ /s/ /S/ /v03/ /k/ /l/ /m/ /n/ /v/ /w/ /z/ /v04/ /O@/ /v/ /w/ /z/ /p/ /r/ /s/ /S/ /t/ /v04/ // /2/ /@/ /b/ /v05/ /I@/ /sil/ /sil/ /sil/ /sp/ /t/ /T/ /z/ /T/ /v06/ /AU/ /gar/ /gar/ /A/ // /2/ /AU/ /sil/ /sil/ /sil/ /sp/ /v05/ /eh/ /ey/ /f/ /I/ /v07/ /2/ /@U/ /@/ /b/ /tS/ /d/ /eh/ /gar/ /gar/ /AU/ /@/ /b/ /tS/ /v06/ /O/ /N/ /6/ /@U/ /v08/ /sil/ /eh/ /ey/ /f/ /H/ /I@/ /D/ /E/ /3/ /f/ /g/ /v07/ /A/ /v09/ /k/ /l/ /m/ /n/ /I@/ /I/ /dZ/ /OI/ /T/ /g/ /H/ /I@/ /dZ/ /OI/ sp09 /v08/ /A/ /p/ /r/ /s/ /S/ /t/ /T/ /O@/ /U/ /uw/ /y/ /OI/ /O@/ /U/ /uw/ /v/ /v09/ /U/ /uw/ /t/ /T/ /z/ /y/ /Z/ /v/ /w/ /y/ /Z/ /v10/ /dZ/ /v10/ /f/ /v11/ /tS/ /v11/ /d/ /D/ /dZ/ /v12/ /Z/ /v12/ /g/ /v/ /w/ /y/ /v13/ /O@/ /v13/ /b/ /v14/ /sil/ /v14/ /tS/ /H/ /v15/ /H/ /v15/ /Z/ /v16/ /AU/ /sil/ /sil/ /sil/ /sp/ /v17/ /a/ /a/ /gar/ /gar/ /@/ /OI/ /gar/ /gar/ /@/ /OI/ /sp/ Table A.26: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp10 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /I/ /iy/ /dZ/ /l/ /v01/ /@/ /ay/ /eh/ /3/ /v01/ /@/ /uw/ /v/ /w/ /v01/ // /2/ /O/ /U/ /N/ /I/ /iy/ /6/ /@U/ /v02/ /H/ /n/ /6/ /@U/ /v02/ /@/ /ay/ /eh/ /3/ /v02/ /H/ /n/ /6/ /@U/ /v02/ // /2/ /O/ /U/ /r/ /s/ /t/ /T/ /I/ /iy/ /6/ /@U/ /r/ /s/ /t/ /T/ /v03/ /O@/ /sil/ /sil/ /sil/ /sp/ /v03/ /d/ /D/ /f/ /H/ /v03/ /b/ /v04/ /E/ /uw/ /gar/ /gar/ /A/ // /2/ /O/ /l/ /m/ /n/ /p/ /r/ /v04/ // /d/ /D/ /E/ /v05/ /A/ /I@/ /ay/ /@/ /b/ /tS/ /d/ /r/ /s/ /t/ /v/ /w/ /ey/ /f/ /v06/ /AU/ /d/ /D/ /E/ /eh/ /3/ /w/ /z/ /v05/ /k/ /v07/ /sil/ /3/ /ey/ /f/ /g/ /I@/ /v04/ /b/ /tS/ /y/ /v06/ /@/ /uw/ /v/ /w/ /v08/ /OI/ /I@/ /I/ /iy/ /dZ/ /k/ /sil/ /sil/ /sil/ /sp/ /v07/ /ay/ /S/ /sil/ /v09/ /@/ /k/ /l/ /m/ /N/ /OI/ /gar/ /gar/ /A/ /AU/ /@/ /E/ sp10 /v08/ /U/ /v10/ /d/ /D/ /f/ /H/ /OI/ /S/ /O@/ /U/ /z/ /I@/ /dZ/ /N/ /OI/ /S/ /v09/ /2/ /O/ /z/ /l/ /m/ /n/ /p/ /r/ /z/ /Z/ /S/ /T/ /O@/ /uw/ /Z/ /v10/ /I@/ /r/ /s/ /t/ /v/ /w/ /Z/ /v11/ /tS/ /g/ /w/ /z/ /v12/ /@/ /3/ /v11/ /S/ /v13/ /A/ /AU/ /v12/ /g/ /dZ/ /N/ /v14/ /Z/ /v13/ /b/ /tS/ /y/ /v15/ /O@/ /v14/ /Z/ /v16/ /OI/ /v15/ /T/ /gar/ /gar/ /sp/ /sil/ /sil/ /sil/ /sp/ 44 Table A.27: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp11 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /iy/ /k/ /m/ /n/ /v01/ /uw/ /v01/ /O/ /@/ /ay/ /tS/ /v01/ // /@/ /ay/ /E/ /6/ /p/ /r/ /s/ /t/ /v02/ // /@/ /ay/ /E/ /ey/ /3/ /ey/ /I/ /iy/ /t/ /3/ /ey/ /I/ /iy/ /v02/ /d/ /D/ /f/ /v02/ /dZ/ /k/ /l/ /m/ /v02/ /v/ /v03/ /A/ /v03/ /iy/ /k/ /m/ /n/ /N/ /p/ /r/ /s/ /t/ /v03/ /O/ /@/ /ay/ /tS/ /v04/ /2/ /O/ /@U/ /6/ /p/ /r/ /s/ /t/ /t/ /w/ /ey/ /v05/ /6/ /t/ /sil/ /sil/ /sil/ /sp/ /v04/ /d/ /D/ /f/ /v06/ /U/ /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ /A/ /2/ /O/ /AU/ /v05/ /w/ /v07/ /O@/ /gar/ /gar/ /A/ // /2/ /AU/ /b/ /tS/ /d/ /D/ /f/ /v06/ /S/ /v08/ /sil/ /b/ /E/ /3/ /g/ /H/ /f/ /g/ /H/ /I@/ /6/ /v07/ /A/ // /2/ /b/ /v09/ /OI/ /H/ /I@/ /I/ /dZ/ /l/ /6/ /@U/ /OI/ /S/ /T/ /v08/ /AU/ /E/ /3/ /I/ /v10/ /I@/ /l/ /@U/ /OI/ /S/ /T/ /T/ /O@/ /U/ /uw/ /v/ /v09/ /T/ /O@/ /v11/ /AU/ /T/ /O@/ /U/ /uw/ /v/ /v/ /y/ /z/ /Z/ sp11 /v10/ /@U/ /v12/ /dZ/ /k/ /l/ /m/ /v/ /w/ /y/ /z/ /Z/ /v11/ /g/ /y/ /z/ /N/ /p/ /r/ /s/ /t/ /Z/ /v12/ /H/ /l/ /t/ /w/ /v13/ /I@/ /uw/ /v13/ /d/ /f/ /g/ /H/ /v14/ /Z/ /v14/ /S/ /v15/ /U/ /v15/ /y/ /z/ /v16/ /sil/ /v16/ /D/ /T/ /v/ /v17/ /dZ/ /v17/ /tS/ /v18/ /@/ /v18/ /Z/ /gar/ /gar/ /OI/ /sp/ /v19/ /b/ /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ /@/ Table A.28: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp13 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /O/ /d/ /I/ /k/ /v01/ // /O/ /@/ /ay/ /v01/ /O/ /d/ /I/ /k/ /v01/ // /O/ /@/ /ay/ /n/ /p/ /s/ /uw/ /v/ /3/ /ey/ /I@/ /I/ /iy/ /n/ /p/ /s/ /uw/ /v/ /3/ /ey/ /I@/ /I/ /iy/ /v/ /z/ /Z/ /iy/ /v/ /z/ /Z/ /iy/ /v02/ /I@/ /v02/ /E/ /6/ /@U/ /uw/ /sil/ /sil/ /sil/ /sp/ /v02/ /d/ /f/ /g/ /k/ /v03/ /3/ /f/ /g/ /r/ /v03/ /AU/ /gar/ /gar/ /A/ // /2/ /AU/ /m/ /n/ /N/ /p/ /s/ /v04/ /b/ /D/ /E/ /eh/ /v04/ /2/ /U/ /ay/ /@/ /b/ /tS/ /D/ /s/ /t/ /v/ /w/ /z/ /v05/ /tS/ /v05/ /A/ /O@/ /D/ /E/ /eh/ /3/ /ey/ /z/ /v06/ /AU/ /iy/ /6/ /@U/ /v06/ /sil/ /ey/ /f/ /g/ /H/ /I@/ /sil/ /sil/ /sil/ /sp/ /v07/ /@/ /U/ /v07/ /@/ /I@/ /iy/ /dZ/ /m/ /N/ /gar/ /gar/ /A/ /2/ /AU/ /@/ /v08/ // /2/ /@/ /ay/ /v08/ /dZ/ /r/ /S/ /y/ /N/ /6/ /@U/ /OI/ /r/ /tS/ /D/ /E/ /H/ /dZ/ sp13 /v09/ /A/ /y/ /v09/ /d/ /f/ /g/ /k/ /r/ /S/ /t/ /T/ /O@/ /dZ/ /6/ /@U/ /OI/ /r/ /v10/ /m/ /sil/ /t/ /T/ /m/ /n/ /N/ /p/ /s/ /O@/ /U/ /w/ /y/ /r/ /S/ /T/ /O@/ /U/ /v11/ /S/ /s/ /t/ /v/ /w/ /z/ /U/ /uw/ /y/ /Z/ /v12/ /ey/ /z/ /v13/ /O@/ /w/ /v10/ /H/ /v14/ /N/ /v11/ /b/ /tS/ /D/ /gar/ /gar/ /OI/ /sp/ /v12/ /Z/ /v13/ /T/ /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ /OI/ 45 Table A.29: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp14 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /tS/ /iy/ /dZ/ /m/ /v01/ // /O/ /@/ /ay/ /v01/ // /3/ /ey/ /f/ /v01/ // /O/ /@/ /ay/ /@U/ /p/ /r/ /s/ /t/ /eh/ /3/ /ey/ /I/ /iy/ /v02/ /S/ /v/ /w/ /y/ /eh/ /3/ /ey/ /I/ /iy/ /t/ /T/ /iy/ /v03/ /tS/ /iy/ /dZ/ /m/ /iy/ /v02/ /@/ /ay/ /N/ /v02/ /uw/ /@U/ /p/ /r/ /s/ /t/ /v02/ /D/ /f/ /H/ /k/ /v03/ /O/ /b/ /d/ /D/ /v03/ /U/ /t/ /T/ /m/ /n/ /r/ /s/ /S/ /l/ /v04/ /I@/ /6/ /@U/ /v04/ /O/ /b/ /d/ /D/ /S/ /t/ /v/ /w/ /v04/ /S/ /v/ /w/ /y/ /v05/ /2/ /sil/ /l/ /sil/ /sil/ /sil/ /sp/ /v05/ /g/ /H/ /k/ /v06/ /AU/ /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ /A/ /2/ /AU/ /@/ /v06/ /E/ /U/ /v07/ /A/ /gar/ /gar/ /A/ /2/ /AU/ /@/ /tS/ /d/ /g/ /I@/ /dZ/ /v07/ // /3/ /ey/ /f/ /v08/ /A/ /@/ /E/ /g/ /H/ /I@/ /dZ/ /N/ /6/ /@U/ /OI/ /v08/ /A/ /uw/ /v09/ /O@/ /I@/ /k/ /N/ /6/ /OI/ /OI/ /p/ /T/ /O@/ /U/ /v09/ /I@/ /v10/ /a/ /a/ /OI/ /O@/ /U/ /uw/ /Z/ /U/ /uw/ /y/ /z/ /Z/ sp14 /v10/ /2/ /6/ /v11/ /D/ /f/ /H/ /k/ /Z/ /Z/ /v11/ /I@/ /m/ /n/ /r/ /s/ /S/ /v12/ /Z/ /S/ /t/ /v/ /w/ /v13/ /O@/ /v12/ /z/ /v14/ /sil/ /v13/ /y/ /v15/ /AU/ /v14/ /b/ /tS/ /d/ /T/ /v16/ /i/ /a/ /v15/ /p/ /gar/ /gar/ /@/ /OI/ /sp/ /v16/ /g/ /v17/ /dZ/ /N/ /v18/ /Z/ /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ /@/ /OI/ Table A.30: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition con- fusions for RMAV speaker sp15 Speaker Bear1 Bear2 Bear3 Bear4 Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes /v01/ /@/ /d/ /D/ /ey/ /v01/ /@/ /ay/ /eh/ /ey/ /v01/ /@/ /d/ /D/ /ey/ /v01/ /@/ /ay/ /eh/ /ey/ /I/ /iy/ /k/ /l/ /m/ /iy/ /@U/ /uw/ /I/ /iy/ /k/ /l/ /m/ /iy/ /@U/ /uw/ /m/ /n/ /y/ /v02/ /2/ /O/ /AU/ /E/ /m/ /n/ /y/ /v02/ /b/ /d/ /D/ /f/ /v02/ /I@/ /p/ /r/ /s/ /OI/ /sil/ /sil/ /sil/ /sp/ /k/ /l/ /m/ /n/ /N/ /t/ /T/ /z/ /v03/ /6/ /gar/ /gar/ /A/ // /2/ /O/ /N/ /p/ /v/ /v03/ /eh/ /@U/ /v04/ /A/ // /3/ /ay/ /@/ /b/ /tS/ /E/ /sil/ /sil/ /sil/ /sp/ /v04/ /A/ // /2/ /O/ /v05/ /sil/ /O@/ /E/ /eh/ /3/ /g/ /H/ /gar/ /gar/ /A/ // /2/ /O/ /v05/ /6/ /v06/ /U/ /H/ /I@/ /dZ/ /N/ /6/ /@/ /tS/ /E/ /3/ /H/ /v06/ /N/ /uw/ /v/ /v07/ /@/ /6/ /@U/ /OI/ /p/ /r/ /H/ /I@/ /dZ/ /6/ /OI/ /v07/ /U/ /v08/ /b/ /d/ /D/ /f/ /r/ /s/ /S/ /t/ /T/ /OI/ /r/ /s/ /S/ /t/ sp15 /v08/ /g/ /H/ /dZ/ /k/ /l/ /m/ /n/ /N/ /T/ /O@/ /U/ /uw/ /v/ /t/ /T/ /O@/ /U/ /w/ /v09/ /3/ /N/ /p/ /v/ /v/ /w/ /z/ /Z/ /w/ /y/ /z/ /Z/ /v10/ /b/ /tS/ /v09/ /r/ /s/ /S/ /t/ /v11/ /3/ /z/ /v12/ /ay/ /E/ /v10/ /dZ/ /v13/ /sil/ /O@/ /v11/ /Z/ /v14/ /AU/ /OI/ /v12/ /w/ /y/ /v15/ /Z/ /v13/ /H/ /v16/ /e/ /r/ /v14/ /tS/ /gar/ /gar/ /@/ /sp/ /sil/ /sil/ /sil/ /sp/

Journal

Electrical Engineering and Systems Science – arXiv (Cornell University)

Published: May 8, 2018

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Phoneme-to-viseme mappings: the good, the bad, and the ugly

Phoneme-to-viseme mappings: the good, the bad, and the ugly

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Phoneme-to-viseme mappings: the good, the bad, and the ugly

Phoneme-to-viseme mappings: the good, the bad, and the ugly

References (89)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies