Binary classiﬁcation experiment - Detection experiments

Stød detection

4.3 Detection experiments

4.3.3 Binary classiﬁcation experiment

In the binary stød detection experiments, performance of the classiﬁers will be evaluated on the development set according to recall, precision and F1-score. Precision is measured as

P recision= T rue positives

T rue positives+F alse positives (4.4) For a perfect classiﬁer, precision (and recall) becomes 1. As the proportion of False positives increases, precision decreases. In the context of the experiments in this chapter, precision can be described as the ability of a classiﬁer to not classify samples that are stød-less as stød-bearing.

Recall is measured as

Recall= T rue positives

T rue positives+F alse negatives (4.5) Recall will decrease as the proportion of False negatives increase. Recall is interpreted as the ability of a classiﬁer to label all samples in the data annotated with stød as stød-bearing.

F1 is the harmonic mean of precision and recall where recall and precision both have equal weight. The parameter set and sample weight that optimises F1 on the development set is used to train the classiﬁer on all the training data before evaluating on the test data.

Results

To distinguish the diﬀerent data sets used, experiments using the original data set will be referred to as rawwhile experiments using extended annotation will be denoted asextended. In addition, the feature set namesfull,select+andPCAwill indicate the acoustic features used in the experiment.

The classiﬁcation results on development data with classiﬁers trained on the full features set is displayed in Table 4.1.

Grid search ﬁnds regularisation values from 0.001 and smaller (SVM and LogReg.) indicating that the best performance is obtained with sparse statistical models. This is an indication that some parameters are superﬂuous for the classiﬁcation task, i.e. the coeﬃcients for some features become zero and does not inform the classiﬁcation.

F1 is maximised using a sample weight of 0.3 for the stød-bearing class in the raw condition. When the annotation is extended, we see an increase in F1 for all classiﬁers with the exception of GMM. The optimal sample weight found in the extended condition for both classiﬁers is 0.4, which is higher than the sample weight found in the raw condition.

Raw Extended Classiﬁer Precision Recall F1 Precision Recall F1

NB 0.10 0.69 0.18 0.14 0.69 0.24

GMM 0.08 0.94 0.14 0.02 0.09 0.03

LogReg 0.13 0.79 0.22 0.17 0.76 0.28

LogReg+sw 0.24 0.40 0.30 0.22 0.61 0.32

SVM 0.13 0.80 0.22 0.17 0.78 0.28

SVM+sw 0.27 0.36 0.31 0.21 0.60 0.32

Table 4.1: Precision, Recall and F1 for classiﬁers trained on the full feature set. The best metrics in a column is bold faced. LogReg and SVM was reﬁt using sample weight 0.3/1 in the raw condition and 0.4/1 in the extended condition for stød-bearing/stød-less classes, respectively.

4.3.3.1 Classiﬁcation with feature selection

The experiments are repeated using theselect+feature set and the results are reported in Table 4.2. Apart from the feature set, the experimental setup is identical to the experiments in the section above.

Raw Extended

Classiﬁer Precision Recall F1 Precision Recall F1

NB 0.12 0.31 0.17 0.16 0.32 0.21

GMM 0.12 0.76 0.21 0.12 0.76 0.21

LogReg 0.14 0.73 0.23 0.15 0.68 0.24

LogReg+sw 0.18 0.50 0.27 0.19 0.43 0.26

SVM 0.14 0.74 0.23 0.14 0.74 0.23

SVM+sw 0.18 0.51 0.27 0.20 0.39 0.27

Table 4.2: Precision, Recall and F1 for classiﬁers trained on select+ features. The best metrics in a column is bold faced. LogReg and SVM was reﬁt using sample weight 0.4/1 for stød-bearing/stød-less classes in the raw condition and 0.5/1 in the extended condition.

Comparing the raw/select+ condition in Table 4.2 to the raw/full condition in Table 4.1, the GMM clas-siﬁer performs better by 0.07 F1 absolute and NB, where recall is less than half of the previous experiment, performs worse by 0.01 F1. F1 scores are higher using select+ features for LogReg and SVM, but only by

0.01 F1 absolute and we see that an increase in precision counterbalances a lower recall. Sample weighting reverses the comparison and the raw/full condition obtains higher F1. The eﬀect of sample weighting is to even out the imbalance between precision and recall but neither LogReg. nor SVM outperforms the classiﬁers in Table 4.1.

Using the extended annotation, the NB classiﬁer performance improves by 0.04 F1 absolute compared to the raw condition in Table 4.2. GMM results are not inﬂuenced by the change in annotation and the SVM classiﬁer performs identically. As the only classiﬁer, GMM performs better using select+ features in both annotation conditions. Sample weighting improves F1 scores, but the classiﬁers do not reach the same level of performance using select+ features.

4.3.3.2 Classiﬁcation with feature projection

The performance of classiﬁers trained on the full feature set outperforms classiﬁers trained on select+

features. Linear dimensionality reduction using PCA may be able to reduce the number of features and retain more information than feature selection. Due to hardware limitations, batch PCA could not be applied to the full feature set and Incremental PCA is applied to learn a model to project the full feature vectors to 40 dimensions. The results can be seen in Table 4.3.

Train Development

Classiﬁer Precision Recall F1 Precision Recall F1

NB 0.14 0.31 0.19 0.14 0.31 0.19

GMM 0.03 0.24 0.06 0.03 0.24 0.06

LogReg. 0.10 0.74 0.18 0.10 0.74 0.18

SVM 0.10 0.76 0.18 0.10 0.77 0.18

Table 4.3: Precision, Recall and F1 for classiﬁers trained on 40-dimensional PCA projected training data.

This experiment only includes a raw/PCA condition. Using incremental PCA for stød detection reduces precision compared to the select+ conditions, but does not lead to an increase in recall and results in comparatively low F1 scores on both training and development data. Incremental PCA projection to 60, 80 and 100 features have also been performed to determine whether a 40-dimensional feature space was too small to retain salient information, but PCA projection does not outperform feature selection.

Exponential features

1st, 2nd and 3rd order exponential features were computed on select+. The best NB classiﬁer obtained 0.18 F1 on both training and development data and the best SVM achieved 0.22 F1. Like incremental PCA projection, exponential and interaction features did not improve performance over classiﬁers trained on full or select+ features.

4.3.3.3 JHP evaluation

In this section, classiﬁers trained on raw/full, raw/select+, extended/full and extended/select+ conditions are evaluated on the JHP data set. The annotation created by IPA3 will form the basis for evaluation.

IPA3 was chosen because the MACE evaluation ranked IPA3 as one of the two most competent annotators.

The evaluation of classiﬁers trained on raw annotations can be seen in Table 4.4. Compared to training and development set evaluation, a decrease in all measures are observed in Table 4.4. The spontaneous speech genre poses a diﬃcult task for the classiﬁers trained on data extracted from elicited speech. Using select+ features leads to generally higher precision while classiﬁers trained on full feature set favours recall.

It is however possible to achieve the same F1 score using both feature sets.

The eﬀect of sample weights is a large decrease in F1 in the JHP evaluation. While sample weighted classiﬁers achieve the best precision, recall is between 0.13 and 0.02.

Full Select+

Classiﬁer Precision Recall F1 Precision Recall F1

NB 0.09 0.28 0.13 0.08 0.08 0.08

LogReg 0.11 0.45 0.17 0.12 0.29 0.17

LogReg+sw (0.3/0.4) 0.18 0.08 0.11 0.33 0.02 0.03

SVM 0.10 0.52 0.16 0.12 0.33 0.17

SVM+sw (0.3/0.4) 0.19 0.13 0.15 0.24 0.04 0.07

GMM 0.03 0.28 0.05 0.07 0.73 0.13

Table 4.4: Precision, Recall and F1 evaluation for binary classiﬁcation on the JHP sample. The best metrics in a column is bold face.

In the test condition with extended annotation, sample weighting leads to worse performance than using equal sample weighting (No sample weights). SVM achieves a higher F1 score than other classiﬁers using both select+ and full feature sets in Table 4.5. While extending the annotation has a tendency to decrease

recall on training and development data, a general increase in recall is observed for the full feature set and a slight reduction in precision. The same inﬂuence on performance for extended annotation can be seen in the select+ conditions.

Full Select+

Classiﬁer Precision Recall F1 Precision Recall F1

NB 0.08 0.31 0.13 0.08 0.08 0.08

LogReg 0.10 0.59 0.17 0.09 0.31 0.14

LogReg (0.3/0.5) 0.16 0.09 0.11 0.18 0.02 0.03

SVM 0.10 0.65 0.17 0.10 0.32 0.15

SVM (0.3/0.5) 0.14 0.13 0.13 0.19 0.09 0.12

GMM 0.07 0.72 0.13 0.07 0.73 0.13

Table 4.5: Precision, Recall and F1 evaluation for binary classiﬁcation on the JHP sample with extended annotation. The highest F1 scores are in bold face.

The eﬀect of sample weight is stable across annotation and feature set conditions. Sample weights optimised according to F1 on development data have an adverse eﬀect on evaluation on JHP data.

In document Danish Stød and Automatic Speech Recognition (Sider 91-95)