• Ingen resultater fundet

Confusion Matrix

8.2 Information from Video

9.1.2 Confusion Matrix

9.1 Speaker Identication 93

the imbalanced data set and thereby unequal prior probabilities for each class, recall table5.4that showed the prior probabilities for each class in the training set and the test set.

From table5.4is can be seen that the prior probability of the classes both and mother are 13 % and 32 %, respectively. This, as mentioned, might be the rea-son why no observations from the class no one are classied as the both class.

That the two classes mother and no one have very large priors is probably the explanation of the 121 and 166 misclassied observations for the ANN and TREE, respectively. The dierence between these two classes is presumed to be large in the feature space and thereby few misclassications between the two is expected. Clearly, this is an example of the eect of the prior probabilities in classication tasks.

Because GMM is the most applied classier in the speaker identication prob-lem [57], [31], [33], it is interesting to analyse the confusion matrix for GMM as well. This is shown in gure9.8. From this it can be seen that the confusion is largest when the child speaks. Furthermore it can be observed that the mis-classications of the GMM are somewhat similar to the mistakes made by ANN and TREE, only with a higher frequency.

In continuation of the discussion of the confusion matrices shown in this

sec-Figure 9.8: Confusion matrix for GMM.

tion, it is interesting to move deeper into the discussion of the errors that the classiers make. The confusion matrices shows that 274, 227 and 263 observa-tions belonging to the class mother are classied as the class no one by the tree classiers ANN, TREE and GMM, respectively. The manually annotated labels made at Babylab are, as already stated, used as the true classes and naturally human errors will occur in the coding process. In continuation hereof, certain

9.1 Speaker Identication 95

guidelines are made at Babylab for these codings. One of the instructions at Babylab for the manual annotations is that when the mother whispers, the true class label is set to the class mother.

Due to the low power in the signal at time instances where the mother whispers, this could be the reason why the 274, 227 and 263 observations are classied as the class no talks. Of course it should be kept in mind that this class, no one, as seen from table 5.4 has a prior probability of 45 %, but because the class mother also has a high prior probability (32 %) this fact is probably not capable of explaining but a few of the misclassications of the mother to the no one group. In continuation hereof, the intuition is that the classier would normally be able to distinguish between these two groups, due to their presumed dissimilarity in the feature space. It is therefore assumed that by far the ma-jority of the misclassications of the class mother to the class no one is due to the manually annotated labels, where whisperings of the mother is assigned to the class mother.

Another example of these guidelines of the manual codings is that the child's burpings and hiccups are not included in the class child with the argument that this is not to be used in the further analysis of the labels. This could also cause a confusion in the classication of speaker identity due to the fact that the signal segments of these occurrences contain energy as well as spectral content.

When the annotations are carried out at Babylab one coder annotates the full 10 minutes of the recordings while another coder annotates 2 minutes of the same recoding, for the sake of reliability testing. The confusion between two coders can be seen for two dierent dyads, 018 and 012, in the confusion matrices in gure9.9. The confusion between two coders for dyad 006 and 020 can be found in appendixD.3.

The confusion matrix in gure9.9(a)and9.9(b)between two coders show that the error rate between their labels are 19% and 7%, respectively, whereas the ones shown in appendix are 8 % and 31 %, respectively. This gives rise to the question; what are the true labels?

The striving after an as small as possible error rate should of course be held up against the fact that no exact denition of the true labels is available. This means that if an error rate of 3 % is obtained when comparing the automatic estimated labels with the annotations of one coder, an error rate of 15 % might be obtained if the same automatic estimated labels were compared to the anno-tations made by another coder.

It should be mentioned that the confusion of 31 % is being re-annotated by Babylab due to this very bad reliability. In continuation of the discussion about the true labels it should be noticed that the precision with which Babylab per-forms the annotations in Praat is 10 ms. It could be doubted that a precision of

(a) (b)

Figure 9.9: The human confusion between two coders at Babylab shown for (a)dyad 018 and(b)dyad 012.

this size will always result in the true class labels because the human ear simply cannot validate the reliability of these labels.

A note should be made that only these four data sets were available from Baby-lab with double codings for reliability. If double codings were avaiBaby-lable for all dyads, a probability density function over all the human error rates could be calculated. Assuming this probability density function over human error rates was available, it would be possible to see if the error rates obtained with the ma-chine learning approach in this thesis, would belong to this probability density function. If this was the case, the classier performance would be just as good as the manual annotations. Since this is not possible, it can only be assumed that the classier confusions are similar in size to the human confusions.

As a conclusion on this section, it is chosen to exclude the MNR and KNN classier in the subsequent sections due to their higher error rates compared to ANN and TREE. This also explains why the ANN and TREE are used in the following experiments. It is furthermore chosen to investigate the GMM as well because this classier, as mentioned, is the most commonly used in the literature regarding speaker identication.