Does it work? - Making Faces { State-Space Models Applied to Multi-Modal Signal Processing

(a)

(b)

(c)

Figure 5.5: Characteristic images taken from the test sequence. The predicted face is to the left and the true face to the right.

5.3 Does it work? 89

read from the sequences. There are systems that outperform this one in terms of lipreading abilities e.g. Ezzat et al. (2002) andCohen and Massaro (2002).

These systems are based on a phonetic alphabet, and hence requires a phonetic transcription of the spoken sentences.

It is no surprise that a phoneme to image mapping produces better results than a sound to image mapping, just think of the ambiguities revealed by the McGurk eect. The lipreading ability of the phonetic based systems indicates that, as speech recognition gets better phoneme representations are a more natural way to go when creating visual speech. However, besides the diculties in transcribing speech correctly a main drawback with the phonetic approach is that there are many sounds that are not present in the phonetic alphabet.

Yawning, sneezing, and `hmm' sounds does not have a phonetic transcription.

Further more, the construction of models in dierent languages requires dierent phonetic alphabets and hence extraction of dierent key frames.

In contrast to this, the approach proposed in this work requires only that the training data contains the desired sounds and the matching movements. The speech can be in any language and non-speech sounds are mapped as well as speech sounds. One of the greatest strength of this of mapping is the easy set up for new conditions. Creating a system for a new person, a new vocabulary or an entire new language is as easy as collecting video of the desired condition.

Then it is simply extracting the features and creating the map, both of which can be set up to run automatically.

Finally, unlike other approaches the use of continuous State-Space Models en-sures that the video is smooth, there are no problems with jerky motion or unnatural transitions.

Possible uses:

As discussed later there is a range of possibilities for improving the system, but even with the drawback that lipreading is not possible there are still applications where cosmetic corrections or generation of lip-movements can be used:

Low bandwidth transmission for mobile phones and video conferences (lip reading accuracy not important).

Correction of lip-movements in synchronized movies and commercials.

Rough animation in cartoons (could be ne tuned by hand afterwards).

In-game dialog.

In low-bandwidth communication it is important to get a sense of presence often on a relatively low resolution screen. Lip-movements that are time-aligned with the speech is an important factor for this. Even with a higher resolution temporal accuracy is more important than entirely correct movements. The same holds for synchronized movies, removing the most obvious miss-matches

between would improve the illusion of language change greatly. In computer games more and more in game dialog are entering, the dialog is part of the game play and even though the speech is prerecorded, lip-movements can be generated when they are needed.

Problems (and xes):

Besides the fact that the same information is not present in the sound and image domains there are a number of other reasons that the mapping is not perfect:

The training corpus is to small.

The system is speaker dependent.

The model is of entire face, not only the mouth.

The MFCC features are probably not optimal.

The lip movements have to small amplitude.

The model is Linear and Gaussian.

The most prominent reason for the deciencies in the mapping is that the train-ing corpus is to small, the VidTimit database contain only ten examples leavtrain-ing at most nine to train the system. The homemade recordings contain more ex-amples but, still far from the 20 min. used in Ezzat et al. (2002). To improve performance a more complete training set needs to be created. Increasing the amount of data is a two edged sword. The computations are already quite time consuming as it is, and increasing the training set would only make this worse.

An alternative way to get more training data is to train the system on several dif-ferent speakers. The assumption would be that most of the variation is governed by the speech and that only a relatively small fraction of the face movements where person specic. Even if the face movements are highly correlated with the specic speaker a large enough number of speakers would ensure that these inter person variabilities where handled as noise. Preliminary experiment with this approach however, did not produce viable results. Trying to understand why, two main obstacles comes to mind. Firstly, the approach taken in this work models the entire face and not just the mouth. As described elsewhere, the reason for this is that a free oating mouth is very unnatural to look at.

However, when trying to model several persons at a time the appearance of the rest of the face becomes an important factor. A conceptual simple but time consuming modication to the AAM, would allow a hierarchical model, where the mouth was model by itself and then pasted back onto the face. Such a hier-archical scheme would not only help in the multi-person case, but, it would also reduce the Kalman Model complexity. Such a model would also allow a separate control of the eyes, giving the possible to blink and thereby add realism.

In document Making Faces { State-Space Models Applied to Multi-Modal Signal Processing (Sider 87-91)