Analysis .1 Annotation.1Annotation - Detection experiments

Stød detection

4.3 Detection experiments

4.3.5 Analysis .1 Annotation.1Annotation

Phones Samples Full PLP Select+

l^? l 26 0.788 0.788 0.588

m^? m 5 0.700 0.900 0.600

n^? n 58 0.638 0.664 0.569

N^? N 7 0.500 0.714 0.571

6^? 6 5 0.800 0.500 0.700

Mean accuracy 0.685 0.713 0.600 Std.dev. 0.220 0.266 0.104

Table 4.7: Stød occurrence and mean classiﬁcation accuracy on the JHP sample for three feature sets.

4.3.5 Analysis

interpretability of the projected features. Training and development set evaluation shows that the perfor-mance of classiﬁers trained on the 40-dimensional projected features are not as good as those trained using simple feature selection with 17 features. Using select+ features, an improvement can be observed only for the NB classiﬁer while the remaining classiﬁers perform similarly according to F1, yet with diﬀerent precision and recall scores.

The high number of features is also due to the (almost) zero-knowledge approach to stød detection.

That a number of features are irrelevant to stød detection is therefore not surprising, e.g. in the case of ΔPLP, ΔΔPLP, ΔMFCC and ΔΔMFCC because these features are engineered to model the speed and acceleration of speech organs that are not correlated with the glottis.

There is also the possibility of collinear features. Both non-salience and collinearity are supported by the regularisation parameterαfor LogReg. and SVM classiﬁers which are constantly below 1.0. Forα <1.0, the classiﬁers learn a sparse model. In sparse models, a number of learned coeﬃcients become zero, yielding classiﬁcation that relies on information from only a subset of features.

Though feature selection outperforms PCA, an average decrease in performance of 12.5% relative F1 (0.04 points absolute) compared to the full feature set is observed for classiﬁers trained on select+ features.

Considering that 85.8% of the full feature set is discarded and similar performance con be obtained on the JHP sample, select+ retains most of the salient information for stød detection. Unweighted LogReg.

and SVM classiﬁers achieve the same F1 score (with diﬀerent precision/recall balance) using both full and select+ feature sets.

Although salient features have been discovered, the classes are not adequately separated in the feature space for linear classiﬁcation. This is especially clear when some of the most salient features are plotted against each other in Figure 4.8.

While some samples are separated from each other and could be classiﬁed using a linear classiﬁer, there are many samples from both classes that are clustered together. The optimal sample weight found using F1 optimisation has consistently been lower than 1 for the stød-bearing class. This indicates noisy annotation and is supported by the sample weight found when re-estimating a decision boundary using logistic regression or SVM and by the scatter plots in Figure 4.8.

4.3.5.3 Class skewness

Due to the skewness of the distribution of stød, stød detection is a diﬃcult classiﬁcation task. Accounting for skewness using an inverse prior improves the precision/recall balance and optimising F1 rather than

(a) Pitch vs. energy (b) Δprobability-of-voicing vs. energy

(e) Peak slope vs. Δprobability-of-voicing (f) Pitch vs. Δprobability-of-voicing

Figure 4.8: Training samples plotted by salient features according to feature selection. Stød-less samples are blue and stød-bearing samples are purple.

n 6685 2707

p 163 297

n p

(a) Unnormalised detection counts for stød detection.

(b) Normalised confusion matrix for stød detection.

Figure 4.9: Raw classiﬁcation counts and confusion matrix normalised by class support for visual presen-tation. The counts in 4.9a correspond to classes in 4.9b. Classiﬁed using unweighed linear SVM in the extended annotation condition on JHP.

accuracy prevents the classiﬁers from learning a decision boundary that simply classiﬁes all samples as stød-less⁸. While the results are insuﬃcient for practical application, the success of the classiﬁcation is diﬃcult to determine because of the skewness.

A classiﬁer with a low number of false positives could still be useful for downstream applications in both academia and industry, i.e. a high precision classiﬁer with low recall. The best performing classiﬁer on JHP data is the unweighted SVM trained in the raw/select+ condition. The success can be visualised using confusion matrices. The raw development set classiﬁcation counts and normalised confusion matrix can be seen in Figure 4.9.

The matrix in Figure 4.9a shows the counts of true negatives, false positives⁹, false negatives and true positives from top left to bottom right. Figure 4.9b illustrates the true negative rate, false positive rate, false negative rate and true positive rate in the same order. Darker greens illustrate a higher rate after normalisation by the number of true class samples.

In this case, the true positive rate is high and the false positive rate is low which is desired in a high precision/low recall classiﬁer. However, the proportion of false positives out of the total predicted positives

8Results in ca. 95% accuracy.

9Aka. false alarms.

Figure 4.10: Reciever operating characteristic curves for diﬀerent sample weights on development data.

(false discovery rate) is 0.9 and indicate that due to the low prevalence of stød, even the best classiﬁer in our experiments is not able to learn a good decision boundary.

The classiﬁers do exhibit some desirable properties. This is best illustrated using Receiver Operating Characteristic curves. Figures 4.10 and 4.11 show graphs that plots true positive rate as a function of false positive rate. The dashed line corresponds to random classiﬁcation. This plot illustrates that the classiﬁers predict stød better than chance on both development and test data.

4.3.5.4 Discrimination experiment

Unlike stød detection, phone discrimination experiments shows a high degree of accuracy. Like Yoon et al.

(2006), PLP features can to a certain degree discriminate between the stød-bearing and stød-less variants of the same phone. Those features can be replaced by full and select+ features to obtain similar results.

If some distinctions maintained in semi-ﬁne IPA are removed, the evaluation improves signiﬁcantly which indicates that the features included in select+ contain information that model distinctions that are not directly related to the stød+phone discrimination task. Removing these distinctions also adds a signiﬁcant amount of training data, e.g. in the case of [O:^?] which increases the number of training samples for [O^?] by 1/3. Especially the PLP-based SVM gain signiﬁcantly from the larger sample sizes.

Figure 4.11: Reciever operating characteristic curves for all annotations on test data.

On the JHP sample, mean accuracy decreases while variance increases. The low variance in cross validation indicates that the statistical model overﬁts the training data and is not able to generalise to unseen data from a diﬀerent speech genre. This is corroborated by the small variance for the classes from Table 4.7 in Appendix A.3 and the signiﬁcant increase in variance on test data. Similar eﬀects were observed in F1 evaluation in the binary classiﬁcation experiments and feature ranking experiments.

The variance does not increase in the evaluation of the SVM trained on select+ features and while the mean accuracy is lower, the model did not overﬁt to the training data and therefore perform similarly.

In document Danish Stød and Automatic Speech Recognition (Sider 98-103)