Analysis - Stød detection - Danish Stød and Automatic Speech Recognition

Stød detection

5.3 Experiment

5.3.5 Analysis

The MFCC+stød tri3b system is signiﬁcantly slower than other MFCC-based tri3b systems, but the diﬀerence between the MFCC baseline and MFCC+stød+pitch is not signiﬁcant.

The tri4a MFCC+stød+pitch systems is signiﬁcantly faster than all other tri4a systems, with the ex-ception of the MFCC baseline. The PLP+stød+pitch system is signiﬁcantly faster than both baseline and PLP+stød (p= 0.025).

There is no signiﬁcant diﬀerence between MFCC-based tri4b systems and the tri4b PLP+stød+pitch system is signiﬁcantly faster than the MFCC+stød system, but not the PLP baseline.

The PLP+stød+pitch nnet5c system is signiﬁcantly faster than other PLP-based nnet5c systems (p <

0.001) while there is no signiﬁcant change betwen MFCC+pitch+

stød and MFCC+stød systems.

Figure 5.5: The impact of modelling stød in the phonetic dictionary and adding pitch-related features on the real-time factor. Adding stød increases the factor, but adding pitch-related features compensates.

between the MFCC baseline and MFCC+stød conditions goes from not signiﬁcant (tri2a) to signiﬁcant (tri2b) and to highly signiﬁcant (tri3b) with the change in feature type.

The correlation between WER improvement can be observed in Tables 5.14 and 5.15 and suggests that features which represent a wide acoustic context are better at modelling stød.

GMM AM complexity

Table 5.17 gives an indication of the change in the AM as a result of adding stød in tri4b systems.

When adding stød, the number of transition-states increase by approximately 10000 states and 22000-25000 more transitions need to be trained, while the number of Gaussians remain almost the same. The amount of estimated probability distribution functions (pdf) remains stable across baseline and stød-informed systems and is not impacted by the increase in transition-states and transition-ids. So stød increases the complexity of the AM and the number of transitions, but does not increase the descriptive power of the AM in terms of estimated pdfs.

Though the model becomes more complex, the increase in performance from the MFCC baseline to MFCC+stød is signiﬁcant atp= 0.001. The increase from the PLP baseline to PLP+stød is signiﬁcant at p= 0.0241 and supports the conjecture that explicit stød modeling in the PM is beneﬁcial in Danish ASR.

tri4b Phones PDFs Transition-states Transition-ids Gaussians

PLP 325 3834 31965 63970 60102

MFCC 325 3834 31555 63150 60097

PLP+stød 453 3848 44182 88404 60088

MFCC+stød 453 3761 43039 86118 60113

Table 5.17: AM statistics for tri4b systems. When estimating the PDT, the maximum number of leaves was speciﬁed as 4800 and max number of Gaussians was 60000. Phones refer to the number of word-position dependent phones and include 5 silence phones.

Stød independence

We have now determined that stød has an impact on the AM, but whether the AM actually models stød-bearing phones separately has not been conﬁrmed. If the AM models stød-stød-bearing phones separately, we can observe this in system-speciﬁc equivalence classes.

For the tri4b MFCC+stød system, there are 165 equivalence classes of word-position dependent phones out of which 59 contain stød-bearing phones. 43 are independent equivalence classes such as[’?e_B, ?e_B], [’?y_I, ?y_I]or[?o_E, ’?o_E]which are word-position dependent phones that we have forced to become phone aliases. Some independent equivalence classes contain phones from diﬀerent word positions such as [’?y_B, ’?y_S, ?y_S, ?y_B]and[?A_E, ?A_I, ’?A_I,

’?A_E, ?A_S, ’?A_S]. Table 5.18 shows the statistics of all stød-informed systems and the independent and mixed equivalence classes can be seen in Appendix B.4.

Experimental condition Classes Independent Mixed

PLP+stød 167 45 15

MFCC+stød 165 43 16

PLP+stød+pitch 151 34 24

MFCC+stød+pitch 171 43 19

Table 5.18: Stød equivalence classes for tri4b systems. Independent classes contain only stød-bearing phones and mixed classes contain both stød-less and stød-bearing phones. All phones are word-position dependent and silence phones are not included.

Irrespective of the experimental condition, the number of independent equivalence classes outnumber the mixed classes. The PLP+stød+pitch system has fewer independent classes and more mixed classes out

of fewer total classes, but also features much larger equivalence classes, e.g.[?W+_S, ?W_E, ’?W_S, ?W+_E,

’?W_E, ’?W+_S, ’?W+_E, ?W_S]. Appendix B.4 also shows the equivalence classes for nnet5c systems where only the MFCC+stød nnet5c system diﬀers by having an extra mixed class ([?m_E, m_E]).

Because independent classes contain phones from diﬀerent word positions, merging word-position de-pendent phones decrease the likelihood of the data less than merging stød-bearing and stød-less phones in some cases which indicates that the distinction between stød-bearing and stød-less phones is sometimes more important than word-position.

5.3.5.2 Recognition Errors

The top 10 confusion pairs, substitutions, deletions and insertions for MFCC+stød and MFCC+stød+pitch nnet5c systems evaluated on Stasjon06 are displayed in Appendix B.5. For both systems, the common recognition errors are small function words which is common in large-vocabulary ASR systems. The top 6 confusion pairs are phonetically similar such as [de/di], [u/o], [i/e] and [Ob@-n/Obn].

The most common recognition error is deletion, insertion or substitution of the wordpunktum(EN:

period).punktumis not part of any frequent confusion pair, but is included in 337 low-frequent confusion pairs which are displayed in Appendix B.5.3. The words often confused withpunktumbear no phonetic resemblance topunktumor each other and can be both noun, verb, function word, named entity etc. No pattern is discernible from the confusion pairs, but manual investigation of the alignment suggests the problem is inconsistent transcription in the training and test data.

The text preprocessing converts sentence ﬁnal punctuation to spoken form because punctuation is usually dictated, however the dictation turns out to be inconsistent, i.e. there is a period at the end of a sentence whether it is spoken aloud or not, and frequently recognition errors such as the one in Figure 5.6 can be seen in the evaluation.

id: (46-r6110007-379) Scores: (#C #S #D #I) 6 0 1 0

REF: arrang´er alle markerede afsnit efter længde PUNKTUM HYP: arrang´er alle markerede afsnit efter længde *******

Eval: D

Figure 5.6: Recognition error report from sctk featuringpunktum(EN: Sort all high-lighted paragraphs by length). Capitalised words are erroneous and the error typeDstands for deletion.

The ASR output (HYP) does not end inpunktumbecause it is not spoken in the audio. Unfortunately, the reference contains the word and a deletion is counted towards the ﬁnal WER. This would explain the high

number of errors, the inconsistent pattern in the confusion pairs in Appendix B.5.3 and the error occurring across ASR systems with diﬀerent feature and stød combinations.

The reason for the large number of insertions seems to occur primarily when the ASR system tries to decode named entities or repetition utterances. Named entities are e.g. names of towns or ﬁrst name and surname without utterance-ﬁnal period and repetition utterances are the same word repeated three times.

If the decoding fails as in the second and fourth example in Appendix B.5.3.2, the LM frequently inserts punktumbefore predicting the end of the utterance. We conjecture that many occurrences of sentence-ﬁnal punktumin the LM training data leads to over-generation.

This problem seems to be speciﬁc to the data set. We use the transcripts from the training data to estimate the language model rather than a much larger newswire corpus because it ﬁts the domain which includes utterances that consist only of repetitions and named entities. However, the transcription does not always faithfully reﬂect the utterance and the text preprocessing cannot take this into account.

In document Danish Stød and Automatic Speech Recognition (Sider 136-140)