• Ingen resultater fundet

Analysis .1 Annotation.1Annotation

Stød detection

4.3 Detection experiments

4.3.5 Analysis .1 Annotation.1Annotation

Phones Samples Full PLP Select+

l? l 26 0.788 0.788 0.588

m? m 5 0.700 0.900 0.600

n? n 58 0.638 0.664 0.569

N? N 7 0.500 0.714 0.571

6? 6 5 0.800 0.500 0.700

Mean accuracy 0.685 0.713 0.600 Std.dev. 0.220 0.266 0.104

Table 4.7: Stød occurrence and mean classification accuracy on the JHP sample for three feature sets.

4.3.5 Analysis

interpretability of the projected features. Training and development set evaluation shows that the perfor-mance of classifiers trained on the 40-dimensional projected features are not as good as those trained using simple feature selection with 17 features. Using select+ features, an improvement can be observed only for the NB classifier while the remaining classifiers perform similarly according to F1, yet with different precision and recall scores.

The high number of features is also due to the (almost) zero-knowledge approach to stød detection.

That a number of features are irrelevant to stød detection is therefore not surprising, e.g. in the case of ΔPLP, ΔΔPLP, ΔMFCC and ΔΔMFCC because these features are engineered to model the speed and acceleration of speech organs that are not correlated with the glottis.

There is also the possibility of collinear features. Both non-salience and collinearity are supported by the regularisation parameterαfor LogReg. and SVM classifiers which are constantly below 1.0. Forα <1.0, the classifiers learn a sparse model. In sparse models, a number of learned coefficients become zero, yielding classification that relies on information from only a subset of features.

Though feature selection outperforms PCA, an average decrease in performance of 12.5% relative F1 (0.04 points absolute) compared to the full feature set is observed for classifiers trained on select+ features.

Considering that 85.8% of the full feature set is discarded and similar performance con be obtained on the JHP sample, select+ retains most of the salient information for stød detection. Unweighted LogReg.

and SVM classifiers achieve the same F1 score (with different precision/recall balance) using both full and select+ feature sets.

Although salient features have been discovered, the classes are not adequately separated in the feature space for linear classification. This is especially clear when some of the most salient features are plotted against each other in Figure 4.8.

While some samples are separated from each other and could be classified using a linear classifier, there are many samples from both classes that are clustered together. The optimal sample weight found using F1 optimisation has consistently been lower than 1 for the stød-bearing class. This indicates noisy annotation and is supported by the sample weight found when re-estimating a decision boundary using logistic regression or SVM and by the scatter plots in Figure 4.8.

4.3.5.3 Class skewness

Due to the skewness of the distribution of stød, stød detection is a difficult classification task. Accounting for skewness using an inverse prior improves the precision/recall balance and optimising F1 rather than

(a) Pitch vs. energy (b) Δprobability-of-voicing vs. energy

(c) Peak slope vs. energy (d) Peak slope vs. Pitch

(e) Peak slope vs. Δprobability-of-voicing (f) Pitch vs. Δprobability-of-voicing

Figure 4.8: Training samples plotted by salient features according to feature selection. Stød-less samples are blue and stød-bearing samples are purple.

n 6685 2707

p 163 297

n p

(a) Unnormalised detection counts for stød detection.

(b) Normalised confusion matrix for stød detection.

Figure 4.9: Raw classification counts and confusion matrix normalised by class support for visual presen-tation. The counts in 4.9a correspond to classes in 4.9b. Classified using unweighed linear SVM in the extended annotation condition on JHP.

accuracy prevents the classifiers from learning a decision boundary that simply classifies all samples as stød-less8. While the results are insufficient for practical application, the success of the classification is difficult to determine because of the skewness.

A classifier with a low number of false positives could still be useful for downstream applications in both academia and industry, i.e. a high precision classifier with low recall. The best performing classifier on JHP data is the unweighted SVM trained in the raw/select+ condition. The success can be visualised using confusion matrices. The raw development set classification counts and normalised confusion matrix can be seen in Figure 4.9.

The matrix in Figure 4.9a shows the counts of true negatives, false positives9, false negatives and true positives from top left to bottom right. Figure 4.9b illustrates the true negative rate, false positive rate, false negative rate and true positive rate in the same order. Darker greens illustrate a higher rate after normalisation by the number of true class samples.

In this case, the true positive rate is high and the false positive rate is low which is desired in a high precision/low recall classifier. However, the proportion of false positives out of the total predicted positives

8Results in ca. 95% accuracy.

9Aka. false alarms.

Figure 4.10: Reciever operating characteristic curves for different sample weights on development data.

(false discovery rate) is 0.9 and indicate that due to the low prevalence of stød, even the best classifier in our experiments is not able to learn a good decision boundary.

The classifiers do exhibit some desirable properties. This is best illustrated using Receiver Operating Characteristic curves. Figures 4.10 and 4.11 show graphs that plots true positive rate as a function of false positive rate. The dashed line corresponds to random classification. This plot illustrates that the classifiers predict stød better than chance on both development and test data.

4.3.5.4 Discrimination experiment

Unlike stød detection, phone discrimination experiments shows a high degree of accuracy. Like Yoon et al.

(2006), PLP features can to a certain degree discriminate between the stød-bearing and stød-less variants of the same phone. Those features can be replaced by full and select+ features to obtain similar results.

If some distinctions maintained in semi-fine IPA are removed, the evaluation improves significantly which indicates that the features included in select+ contain information that model distinctions that are not directly related to the stød+phone discrimination task. Removing these distinctions also adds a significant amount of training data, e.g. in the case of [O:?] which increases the number of training samples for [O?] by 1/3. Especially the PLP-based SVM gain significantly from the larger sample sizes.

Figure 4.11: Reciever operating characteristic curves for all annotations on test data.

On the JHP sample, mean accuracy decreases while variance increases. The low variance in cross validation indicates that the statistical model overfits the training data and is not able to generalise to unseen data from a different speech genre. This is corroborated by the small variance for the classes from Table 4.7 in Appendix A.3 and the significant increase in variance on test data. Similar effects were observed in F1 evaluation in the binary classification experiments and feature ranking experiments.

The variance does not increase in the evaluation of the SVM trained on select+ features and while the mean accuracy is lower, the model did not overfit to the training data and therefore perform similarly.