• Ingen resultater fundet

Binary classification experiment

Stød detection

4.3 Detection experiments

4.3.3 Binary classification experiment

In the binary stød detection experiments, performance of the classifiers will be evaluated on the development set according to recall, precision and F1-score. Precision is measured as

P recision= T rue positives

T rue positives+F alse positives (4.4) For a perfect classifier, precision (and recall) becomes 1. As the proportion of False positives increases, precision decreases. In the context of the experiments in this chapter, precision can be described as the ability of a classifier to not classify samples that are stød-less as stød-bearing.

Recall is measured as

Recall= T rue positives

T rue positives+F alse negatives (4.5) Recall will decrease as the proportion of False negatives increase. Recall is interpreted as the ability of a classifier to label all samples in the data annotated with stød as stød-bearing.

F1 is the harmonic mean of precision and recall where recall and precision both have equal weight. The parameter set and sample weight that optimises F1 on the development set is used to train the classifier on all the training data before evaluating on the test data.

Results

To distinguish the different data sets used, experiments using the original data set will be referred to as rawwhile experiments using extended annotation will be denoted asextended. In addition, the feature set namesfull,select+andPCAwill indicate the acoustic features used in the experiment.

The classification results on development data with classifiers trained on the full features set is displayed in Table 4.1.

Grid search finds regularisation values from 0.001 and smaller (SVM and LogReg.) indicating that the best performance is obtained with sparse statistical models. This is an indication that some parameters are superfluous for the classification task, i.e. the coefficients for some features become zero and does not inform the classification.

F1 is maximised using a sample weight of 0.3 for the stød-bearing class in the raw condition. When the annotation is extended, we see an increase in F1 for all classifiers with the exception of GMM. The optimal sample weight found in the extended condition for both classifiers is 0.4, which is higher than the sample weight found in the raw condition.

Raw Extended Classifier Precision Recall F1 Precision Recall F1

NB 0.10 0.69 0.18 0.14 0.69 0.24

GMM 0.08 0.94 0.14 0.02 0.09 0.03

LogReg 0.13 0.79 0.22 0.17 0.76 0.28

LogReg+sw 0.24 0.40 0.30 0.22 0.61 0.32

SVM 0.13 0.80 0.22 0.17 0.78 0.28

SVM+sw 0.27 0.36 0.31 0.21 0.60 0.32

Table 4.1: Precision, Recall and F1 for classifiers trained on the full feature set. The best metrics in a column is bold faced. LogReg and SVM was refit using sample weight 0.3/1 in the raw condition and 0.4/1 in the extended condition for stød-bearing/stød-less classes, respectively.

4.3.3.1 Classification with feature selection

The experiments are repeated using theselect+feature set and the results are reported in Table 4.2. Apart from the feature set, the experimental setup is identical to the experiments in the section above.

Raw Extended

Classifier Precision Recall F1 Precision Recall F1

NB 0.12 0.31 0.17 0.16 0.32 0.21

GMM 0.12 0.76 0.21 0.12 0.76 0.21

LogReg 0.14 0.73 0.23 0.15 0.68 0.24

LogReg+sw 0.18 0.50 0.27 0.19 0.43 0.26

SVM 0.14 0.74 0.23 0.14 0.74 0.23

SVM+sw 0.18 0.51 0.27 0.20 0.39 0.27

Table 4.2: Precision, Recall and F1 for classifiers trained on select+ features. The best metrics in a column is bold faced. LogReg and SVM was refit using sample weight 0.4/1 for stød-bearing/stød-less classes in the raw condition and 0.5/1 in the extended condition.

Comparing the raw/select+ condition in Table 4.2 to the raw/full condition in Table 4.1, the GMM clas-sifier performs better by 0.07 F1 absolute and NB, where recall is less than half of the previous experiment, performs worse by 0.01 F1. F1 scores are higher using select+ features for LogReg and SVM, but only by

0.01 F1 absolute and we see that an increase in precision counterbalances a lower recall. Sample weighting reverses the comparison and the raw/full condition obtains higher F1. The effect of sample weighting is to even out the imbalance between precision and recall but neither LogReg. nor SVM outperforms the classifiers in Table 4.1.

Using the extended annotation, the NB classifier performance improves by 0.04 F1 absolute compared to the raw condition in Table 4.2. GMM results are not influenced by the change in annotation and the SVM classifier performs identically. As the only classifier, GMM performs better using select+ features in both annotation conditions. Sample weighting improves F1 scores, but the classifiers do not reach the same level of performance using select+ features.

4.3.3.2 Classification with feature projection

The performance of classifiers trained on the full feature set outperforms classifiers trained on select+

features. Linear dimensionality reduction using PCA may be able to reduce the number of features and retain more information than feature selection. Due to hardware limitations, batch PCA could not be applied to the full feature set and Incremental PCA is applied to learn a model to project the full feature vectors to 40 dimensions. The results can be seen in Table 4.3.

Train Development

Classifier Precision Recall F1 Precision Recall F1

NB 0.14 0.31 0.19 0.14 0.31 0.19

GMM 0.03 0.24 0.06 0.03 0.24 0.06

LogReg. 0.10 0.74 0.18 0.10 0.74 0.18

SVM 0.10 0.76 0.18 0.10 0.77 0.18

Table 4.3: Precision, Recall and F1 for classifiers trained on 40-dimensional PCA projected training data.

This experiment only includes a raw/PCA condition. Using incremental PCA for stød detection reduces precision compared to the select+ conditions, but does not lead to an increase in recall and results in comparatively low F1 scores on both training and development data. Incremental PCA projection to 60, 80 and 100 features have also been performed to determine whether a 40-dimensional feature space was too small to retain salient information, but PCA projection does not outperform feature selection.

Exponential features

1st, 2nd and 3rd order exponential features were computed on select+. The best NB classifier obtained 0.18 F1 on both training and development data and the best SVM achieved 0.22 F1. Like incremental PCA projection, exponential and interaction features did not improve performance over classifiers trained on full or select+ features.

4.3.3.3 JHP evaluation

In this section, classifiers trained on raw/full, raw/select+, extended/full and extended/select+ conditions are evaluated on the JHP data set. The annotation created by IPA3 will form the basis for evaluation.

IPA3 was chosen because the MACE evaluation ranked IPA3 as one of the two most competent annotators.

The evaluation of classifiers trained on raw annotations can be seen in Table 4.4. Compared to training and development set evaluation, a decrease in all measures are observed in Table 4.4. The spontaneous speech genre poses a difficult task for the classifiers trained on data extracted from elicited speech. Using select+ features leads to generally higher precision while classifiers trained on full feature set favours recall.

It is however possible to achieve the same F1 score using both feature sets.

The effect of sample weights is a large decrease in F1 in the JHP evaluation. While sample weighted classifiers achieve the best precision, recall is between 0.13 and 0.02.

Full Select+

Classifier Precision Recall F1 Precision Recall F1

NB 0.09 0.28 0.13 0.08 0.08 0.08

LogReg 0.11 0.45 0.17 0.12 0.29 0.17

LogReg+sw (0.3/0.4) 0.18 0.08 0.11 0.33 0.02 0.03

SVM 0.10 0.52 0.16 0.12 0.33 0.17

SVM+sw (0.3/0.4) 0.19 0.13 0.15 0.24 0.04 0.07

GMM 0.03 0.28 0.05 0.07 0.73 0.13

Table 4.4: Precision, Recall and F1 evaluation for binary classification on the JHP sample. The best metrics in a column is bold face.

In the test condition with extended annotation, sample weighting leads to worse performance than using equal sample weighting (No sample weights). SVM achieves a higher F1 score than other classifiers using both select+ and full feature sets in Table 4.5. While extending the annotation has a tendency to decrease

recall on training and development data, a general increase in recall is observed for the full feature set and a slight reduction in precision. The same influence on performance for extended annotation can be seen in the select+ conditions.

Full Select+

Classifier Precision Recall F1 Precision Recall F1

NB 0.08 0.31 0.13 0.08 0.08 0.08

LogReg 0.10 0.59 0.17 0.09 0.31 0.14

LogReg (0.3/0.5) 0.16 0.09 0.11 0.18 0.02 0.03

SVM 0.10 0.65 0.17 0.10 0.32 0.15

SVM (0.3/0.5) 0.14 0.13 0.13 0.19 0.09 0.12

GMM 0.07 0.72 0.13 0.07 0.73 0.13

Table 4.5: Precision, Recall and F1 evaluation for binary classification on the JHP sample with extended annotation. The highest F1 scores are in bold face.

The effect of sample weight is stable across annotation and feature set conditions. Sample weights optimised according to F1 on development data have an adverse effect on evaluation on JHP data.