Chapter conclusions - Stød detection - Danish Stød and Automatic Speech Recognition

Stød detection

4.5 Chapter conclusions

that signal stød can occur where stød is not perceived and there is no lexical function to fulﬁll. The true distribution of stød is not reﬂected in the binary classiﬁcation experiments, but is present in the phonetic symbols and the de facto factoristion of stød classes into stød-bearing phones in the multi-class classiﬁcation experiments are therefore beneﬁcial.

We do not conclude that it is not possible to detect stød in audio and one avenue of research we can identify is to normalise the acoustic features based on a mean and variance estimated for each phone, e.g.

estimate mean and variance on samples labelled [a] and use that to standardise features extracted for [a^?] or similar standardisation. This research is beyond the scope this thesis because we will not be able to apply the standardisation in ASR without predicting the phone ﬁrst.

The feature overlap between select+ and standard ASR features, the poor binary classiﬁcation results and the relatively good multi-class classiﬁcation results suggest that the best way to integrate stød in ASR is to extend the acoustic feature vector input rather than adding a speciﬁc feature for stød and jointly model phone and stød.

Chapter 5 Modelling stød in automatic speech recognition

The intended application for the stød detection experiments in Chapter 4 is automatic speech recognition (ASR). Stød has a distinguishing lexical function and to implement and exploit this function in ASR is the objective of the experiments reported in this chapter. In the previous chapters, we have conﬁrmed our assumptions on stød, namely that stød annotation is reliable, that we can use stød annotation to discover features that signal the presence of stød and that we can detect stød from acoustic features. The last assumption was only partially conﬁrmed because it was necessary to predict phone and stød jointly.

ASR systems can model stød in the acoustic model (AM) only if the phone set includes stød-bearing phones. The studies in Chapter 4 demonstrated that a support vector machine with a radial basis function kernel trained on Perceptual Linear Perception (PLP) features can discriminate between stød-bearing and stød-less phone variants. Using the select+ feature set improves classiﬁcation accuracy on semi-ﬁne IPA annotation and using a coarser-grained set of classes, select+ performs well, but is outperformed by both PLP features and the full feature set. The conclusion is that stød detection is possible using standard ASR features, but can potentially be improved with voice quality features.

This chapter presents the development of a baseline ASR system as well as experiments where stød is integrated into an ASR system. The purpose is to implement and exploit stød using conventional ASR tools and techniques and the experiments entail adding stød annotation to the phonetic dictionary and extending the feature input with pitch-related features. Extending the phone set should be suﬃcient because the classiﬁers in Section 4.3.4 were able to discriminate stød-bearing and stød-less phones using standard ASR features.

There is little existing publicly-available research on or resources for Danish ASR. In a white paper on the state of Danish language technology and NLP (Pedersen et al., 2012), the quality of speech technology was not ranked due to disagreements between researchers and industry, and the availability of speech technology is ranked as poor or fragmented. Danish speech corpora are ranked as medium quality, with low coverage and maturity. The existing corpora we know of that can be used to train ASR systems are subject to access barriers. DanPASS, DK-Parole and LANCHART are not publicly available, and EUROM1 and Aurora2-3 can be acquired for a fee. The white paper concludes that support for speech technology as a whole is fragmentary.

Fortunately, ASR toolkits can be shared across natural languages and there are open and freely available ASR toolkits such as Sphinx (Placeway et al., 1997), Kaldi (Povey et al., 2011), the Hidden Markov model toolkit (Young, 1993) and RASR (Rybach et al., 2011) to name a few. The toolkits are based on machine learning techniques and can therefore be trained as long as data is available and contain scoring software to evaluate performance.

Though DK-Parole is a single speaker corpus and Aurora3 only contains spoken digits, ASR systems have been trained on these corpora (Henrichsen & Kirkedal, 2011; Kirkedal, 2013; Rajnoha & Poll´ak, 2011).

These systems are academic systems for restricted domains (speaker-dependent ASR and spoken digits in noise) and the Word Error Rate (WER) performance is summarised in Table 5.1.

Corpus Task %WER

DK-Parole (Henrichsen & Kirkedal, 2011) Single speaker 5.7 Aurora3 (Rajnoha & Poll´ak, 2011) Spoken digits 24.39

Table 5.1: Published ASR evaluations for spoken Danish.

The methodology orrecipefor training these systems is unavailable and the results might not be re-producible, which is necessary for meaningful comparison to the present work. To facilitate reproducible research, we have added the recipe developed for the experiments in this chapter in Appendix B.2.1. To develop the recipe, we ﬁrst developed a Danish ASR system that does not model stød, which we use as a baseline to evaluate the inﬂuence of adding stød.

We wish to experiment with both standard GMM-based ASR systems and systems that make use of AMs based on neural networks. Of the ASR toolkits mentioned above, Kaldi is distributed as open source under a permissive license, has the necessary code to train deep neural network (DNN) AMs and contain several recipes describing methodologies for training ASR systems on English, Arabic, Czech etc. for a variety of tasks. We use recipes for similar corpora as inspiration for the baseline system.

To train ASR systems and especially DNN AMs, a large amount of data is required – more than is available in DK-Parole and DanPASS (LANCHART is suﬃciently large). It turns out that a large Danish corpus that was not listed in Pedersen et al. (2012) exists. The Norwegian National Library Service hosts a large public domain corpus of read-aloud speech that also contains a Danish part. The corpus – Spr˚akbanken – is large enough that it is possible to train DNN AMs and because the speech genre is read-aloud speech, it is easier to work with than LANCHART.

We describe the Spr˚akbanken corpus in Section 5.1 and the development of an open source Danish ASR system using Kaldi and Spr˚akbanken in Section 5.2. The recipe described in this setup forms the basis for all subsequent experiments. Baseline evaluation and experiments with stød modelling will be reported in Section 5.3 and the results analysed in Section 5.3.5. Section 5.4 will discuss the insights from Section 5.3.5 on acoustic and language modelling as it relates to stød.

In document Danish Stød and Automatic Speech Recognition (Sider 107-111)