• Ingen resultater fundet

Emotion Recognition

9.4 Combing Modalities

9.4.2 Emotion Recognition

For the emotion recognition task it is likewise investigated if the motion capture features aects the performance of the classier in a positively way.

The number of dyads for which the synchronization dierence has been extracted as well as where the manual annotations of the child's emotional state have been executed connes to 6 dyads. This means that 5 are used to t the HMM and one as test set.

9.4.2.1 Parameter Optimization

After the synchronization has been performed, two models are again constructed:

one for the sound-based features only, referred to as the sound-based HMM, and one that combines the sound features with the motion capture features, the sound/mocap-based HMM. Since the data set is reduced by half from the original set-up, see section6.1, it is decided to estimate the optimal codebook size and the number of states again for each model with full feature vectors.

Here the same approach is used as for the original model, which is to x the number of states atS = 5and estimate K and then xK at this value whilst varyingS to determine the optimal number of this.

The feature vectors of the two models can be seen in table9.18.

For the sound-based HMM, the estimation of optimalK andS is shown in

g-Model Features

Sound-based HMM MFCC, delta-MFCC, energy, zcr

sound/mocap-based HMM MFCC, delta-MFCC, energy, zcr, energy, distance, mocap-angle

Table 9.18: The feature compositions for the two models that exclude and include mocap features, respectively.

ures9.22(a)and9.22(b), respectively. It is to be noted again that the choice of number of states is based on the best codebook size.

Figure9.22(a)illustrates that the optimal codebook size clearly isK= 2. For K= 3the mean error rate increases heavily, whereupon it stabilizes around 45

% forK > 3. This course of error rate as a function of codebook size, is very dierent from that of the full data set from section 9.2. This must be due to the much smaller data set size, which, as can be seen, has a large impact on the

9.4 Combing Modalities 123

(a) (b)

Figure 9.22: The choice of parameters for the sound-based HMM. (a)shows the estimation of the size of the codebook, and(b)the estimation of number of states. The y-axis on both gures are the obtained mean error rates of 15 replications. The red vertical lines indicate the standard deviation of the mean.

parameter estimation.

WithK= 2xed,Sis varied to nd the optimal number of states. From gure 9.22(b) it is observed that for increasingS the error rate stabilizes around 22

%. The rst time the error rate reaches 22 % is when the number of states S equals 10, where also a small standard deviation of the mean can be seen.

S = 10 should therefore be chosen. Thus the optimal error rate obtained with this sound-based HMM on 5 dyads is therefore 22 %.

For the sound/mocap-based HMM, the estimation ofKandS can be seen from the following gures, 9.23(a)and9.23(b), respectively.

In gure 9.23(a) it can be observed that the best error rate is obtained for a codebook size of K = 3. The error rate here is 46 %, which is observed to increase heavily with increasingK. Again the surprisingly bad error rates must be caused by the much smaller data set as well as the inclusion of the motion capture features.

With thisK= 3, theSis varied to extract the most optimal number of states.

Figure 9.23(b)illustrates the stabilization of error rate with increasingS. The optimal number of states is observed to beS= 14. The best error rate for this feature combination is thereby 46 %.

(a) (b)

Figure 9.23: The choice of parameters for the sound/mocap-based HMM.(a) shows the estimation of the size of the codebook, and (b) the estimation of number of states. The y-axis on both gures are the obtained mean error rates of 15 replications. The red vertical lines indicate the standard deviation of the mean.

9.4.2.2 Test of Features

Since the optimal combination of sound-based features was tested for the orig-inal set-up with full data set, this is assumed to be valid for the sound-based HMM investigated here for the smaller data set as well.

Dierent feature compositions are, on the other hand, tested for the sound/mocap-based HMM. The full set-up of the sound-sound/mocap-based features are included in all compositions since this composition showed the best performance in the model for the larger data set. The feature combinations for the mocap-based features are on the other hand varied. The following table, 9.19, illustrates the compo-sitions of features. It can be observed in the table that the optimal feature composition is found using the mocap-energy feature only in combination with the sound-based features. Although the best, the error rate obtained with the same data set, but for sound-based features only, was observed to reach its min-imum at an error rate of 22 %. From this it must be concluded that the motion capture features are deteriorating for the classier of the emotion recognition task. If more synchronized les were available, and thereby a larger data set was at hand, it is possible that the inclusion of motion capture features could have a positive eect on the classier's performance. Or at least be indierent to the classication, as was the case in the previous section on including mocap features in the speaker identication task.

9.4 Combing Modalities 125

Feature Composition Error rate

MFCC, delta-MFCC, energy, zcr, energy,

mocap-distance, mocap-angle 46 %

MFCC, delta-MFCC, energy, zcr, energy,

mocap-distance 46 %

MFCC, delta-MFCC, energy, zcr, energy,

mocap-angle 65 %

MFCC, delta-MFCC, energy, zcr, distance,

mocap-angle 46 %

MFCC, delta-MFCC, energy, zcr, mocap-energy 36 % MFCC, delta-MFCC, energy, zcr, mocap-distance 46 % MFCC, delta-MFCC, energy, zcr, mocap-angle 66 % Table 9.19: The error rates for the sound/mocap-based HMM from 5 dyads

based on features from both sound and motion caption. The fea-tures with prex mocap are from the motion capture modality whereas the ones with no prex are the features from the sound modality.