Data - Stød detection - Danish Stød and Automatic Speech Recognition

Stød detection

4.1 Data

The JHP sample from Chapter 3 is modiﬁed and reused in this chapter. The length of the JHP sample is 1 minute 38.54 seconds. The annotation starts 16.67 seconds into the recording and ends 91.95 seconds after the start of the recording. In the annotation study, initial and ﬁnal silence are counted as 2 items. When audio is sampled at a 10 ms rate, initial and ﬁnal silence add to the skewness of the test data. Based on start and end times, unannotated parts of the JHP sample are discarded.

Stød support in the JHP sample, i.e. the number of 10 ms samples that are labelled with stød, ranges from 339-554 (depending on the annotator) out of a total 7535 samples, if a 10 ms sampling frequency is applied. Thus 4.5-7.4% of the data is labelled as stød-bearing. This is a very low number of samples and because statistical analysis relies on the law of large numbers, more data is necessary to apply statistical analyses to the acoustic features. Data from three corpora will be used in the experiments in this chapter.

The JHP sample will serve as a test set, while data from DanPASS and DK-Parole will serve as training data. The training data is chosen because it has been manually annotated by phonetic experts and it is the only one of its kind available.

4.1.1 Danish Phonetically Annotated Spontaneous Speech corpus (DanPASS) DanPASS (Grønnum, 2006, 2009) consists of monologues and dialogues of unscripted speech. Only the monologues are used in this experiment and were collected during three separate tasks: two description tasks and a map task.

The ﬁrst task is a description task. The speaker is presented with a network of geometric shapes and asked to describe the network. The task was designed to reveal whether the speakers look ahead and signal utterance boundaries using prosodic information prior to the boundary.

The second task is a map task where the speaker guides the experimenter through 4 diﬀerent routes on a city map.

In the last task, the speaker is given a model of a house and the individual building blocks of the house.

The speaker describes how to assemble the blocks to resemble the house.

The monologues were recorded in 1996 using a Sennheiser Microphone ME64 in lab conditions and later digitised with a 48 kHz sampling rate. The recorded speech is one-way communication with the experimenter who oﬀered no feedback once instructions were given.

The group of speakers consisted of 13 men and 5 women aged 20-68. They all originate from the Greater Copenhagen area and had no known language deﬁciencies. The monologues total 2 hours and 51 minutes of speech, 1075 word forms and 21170 running words.

The DanPASS annotation includes orthography, detailed and simpliﬁed parts-of-speech and semi-ﬁne IPA annotation at the word and syllable levels. Phonetic annotation was carried out by two annotators separately using Praat and in all, 3 pairs of annotators have been involved. For each ﬁle and speaker, the annotation was compared and in cases where the annotators disagreed, Grønnum served as arbiter.

An overall good agreement between annotators is cited as an indication of the validity of the phonetic annotation. With regards to the reliabilty of stød annotation, there is an overlap between the annotators used in DanPASS and the JHP sample.

Because only the monologues are used, the DanPASS sub-corpus will be referred to as DanPASS-mono in subsequent chapters.

4.1.2 DK-Parole

DK-Parole (Henrichsen, 2007) contains text from newspapers and recordings of read-aloud speech from a single male speaker. Like DanPASS and JHP, Praat TextGrids contain all annotations. The annotation uses time-coded X-SAMPA transcription and is not as ﬁne-grained as the DanPASS annotation because the granularity of the phonetic transcription is at the word level. The transcription is manual and there is an overlap with the annotators from the JHP sample.

The audio was recorded in 2006 and 2008 at Copenhagen Business School. The speaker was situated in a lab and the recordings are without noise. DK-Parole is much larger than DanPASS-mono, approximately 17 h. To balance the need for additional data and letting a single male speaker dominate the data, a sub corpus of 48 min. was selected randomly from DK-Parole.

For simplicity, the 48 min DK-Parole sub-corpus will be referred to as Parole48 henceforth.

4.1.3 Phonetic alignment

All phonetic annotations are mapped into IPA and represented in utf-8 character encoding and a boolean indicator variable is extracted from the phonetic annotation to indicate the presence of stød in a sam-ple. Unlike the phonetic transcription in the JHP sample, stress is annotated as a diacritic on phones in DanPASS-mono and Parole48. To make the phonetic symbols as comparable as possible, stress annotation is removed from all phonetic annotations. DanPASS-mono and Parole48 also lack a phone level alignment as shown in Figure 4.1. Creating a phone level alignment manually was not feasible and the transcription in Parole48 and DanPASS-mono is automatically segmented.

In ASR, alignment between transcription and sound is computed using a two-step iterative machine learning algorithm. The method is known as embedded training using the Expectation-Maximisation algo-rithm and computes an alignment and a segmentation at the same time. In the ﬁrst step, an equidistant segmentation of the speech data is assumed for each recording and aligned to the transcription symbols.

From this alignment, a simple model for each annotation symbol is computed in the Expectation step and using these models, a new segmentation is computed in the Maximisation step and aligned to the transcrip-tion symbols. The algorithm continues until a ﬁxed number of iteratranscrip-tions have been computed or convergence is reached, i.e. there is little or no diﬀerence in the segmentation/alignment between each iteration.

Embedded training is scale-dependent and the data in this study (3 h 39 min. in total) cannot be consid-ered large scale and the representation of individual labels is very small. Instead, a heuristic segmentation approach similar to the ﬁrst step in embedded training was employed to segment and align the data. The heuristic approach applied here uses the following steps:

(a) Phonetic transcription ofmulighedernefrom Parole48.

(b) Phonetic transcription offornedenfrom DanPASS-mono.

Figure 4.1: Phonetic transcription from DanPASS-mono and Parole48. The segmentation is above segment level and also sometimes above syllable level, e.g. [heD!C].

1. Divide a transcriptionD, e.g. ["kElA:n

"] (DA:kælderen, EN: basement) , intodiphones, i.e. ["k E l A:

n"] (I= 5)

2. Detect whetherdicontains annotation for duration such as [A:]

3. Weight segmentsd₁, d₂, ...dIaccording to duration. w(di) = 2 for all phones unless the phone is suﬃxed with [:] wherew(di) = 3.

4. Divide the transcription durationD^Tby the sum of the segment weights:

D^T _I

iw(d_i)=d^t_i. 2d^t_iis the duration of a segment,w(di) is the duration weight of the segment at index i. 3d^t_iis the duration of long vowels e.g. [A:].

If the pronunciation of our example, ["k E l A: n

"], takes 650 ms., the duration of a phone is estimated to be 2d^t_i= 118.18ms and the duration of a long vowel is estimated to be 3d^t_i= 177.27ms.

The heuristic relies on the existing time-coded transcription to extract the duration of words or syllables and uses the syllable and word boundaries to guide segmentation. The quality depends on the manual annotation of time-codes and the original annotation level, i.e. word-to-phone segmentation is likely to

be less accurate than syllable-to-phone segmentation. We have applied the heuristic alignment to map word-level and syllable-level alignment to a phonetic alignment.

4.1.4 Feature extraction

All features described here are extracted using short-term acoustic analysis (See Section 2.4). We use three diﬀerent software toolkits – Praat, Covarep and Kaldi – to extract the features described in Chapter 2, because all features cannot be extracted using a single toolkit. We use a sample shift of 10 ms because the application scenario is ASR where 10 ms seems to be a de facto standard². The size of the context window used depends on the feature.

Amplitude and harmonics-to-noise ratio are extracted using theTo Harmonicity (cc) ...function in Praat. The function outputs one measurement for amplitude and an n-best list for harmonics-to-noise ratio measurements. The most likely hypothesis is chosen as harmonics-to-noise ratio for experimentation.

24 MFCC features, 8 glottal ﬂow parameters and 38³phase features are extracted using Covarep. Co-varep is a repository for speech analysis tools implemented in Matlab/Octave. Degottex et al. (2014) created the repository to share implementations of complex methods for speech analysis such as phase processing, glottal ﬂow parametrisation and pitch tracking with other researchers and make it easier to reproduce research results.

39 PLP, 3 probability-of-voicing, 3 Pitch and 3 ΔPitch⁴ features are extracted using Kaldi (Povey et al., 2011). The aims of the Kaldi project is similar in many respects to Covarep. The Kaldi Pitch Tracker (Ghahremani et al., 2014) implements a version of the Robust Algorithm for Pitch Tracking (Talkin, 1995)⁵. The main diﬀerence is that the algorithm does not make binary voicing decisions for a frame, but assigns a probability and in unvoiced regions, interpolate pitch values from adjacent frames in a straight line.

4.1.4.1 Feature preprocessing

As a ﬁrst step, audio is band-ﬁltered through a low-pass Hann ﬁlter. The band ﬁlter removes frequencies above 1 kHz. The boundary was chosen manually by the author by listening to the DanPASS monologues, so stød is maximally audible while removing high frequencies. Then all features mentioned in Section 2.4 are extracted from the band-ﬁltered audio.

2See the CMU Sphinx FAQ:http://www.speech.cs.cmu.edu/sphinxman/FAQ.html. All English Kaldi recipes also use 10 ms sampling shift.

325 PDM and 13 PDD measurements.

4First and second order derivatives are included.

5Before calculating pitch values, the Kaldi Pitch tracker also low-pass ﬁlters audio at 1 kHz.

Harmonics-to-noise ratio becomes undeﬁned for non-harmonic regions of speech. Praat assigns a value of−200 to these regions which is problematic for estimation of means and variances, which is necessary for machine learning algorithms that assume a Gaussian distribution of the data. To alleviate the problem, we compute the minimum harmonics-to-noise ratio valueHN Rminon the harmonic regions of speech in the training data, i.e. where HNR=-200.HN Rlowboundis thenHN Rminrounded down to the nearest 10. The equation is

HN Rlowbound=

HN Rmin

∗10 (4.1)

All samples with harmonics-to-noise ratio values of -200 are reset toHN Rlowbound. Test data is nor-malised usingHN Rlowboundcalculated on training data. If resetting is not done, the subsequent scaling will be meaningless as the value -200 is arbitrary and chosen by the developers of Praat.

Subsequently, the acoustic features are standardised. Standardising features is a prerequisite for many machine learning classiﬁcation methods such as GMM or SVM. A standard approach is to applymean subtractionorcenteringfollowed byfeature scaling. In the ﬁrst step, we subtract a mean value calculated on the training data to ‘center’ the data. We then divide each feature by that features standard deviation to scale the variance across features. By standardising the parameter scales, the information contribution of a feature with a range of e.g. [-0.5, 0.5] (Peak Slope) is not overshadowed by a feature with a range of [0, 440] (Pitch).

The aim is to make the data resemble a Gaussian with zero mean and unit variance, because classiﬁers may not perform as expected unless data is properly standardised. However, mean and standard deviation can be computed on diﬀerent basis. The most common examples are per utterance, recording session, speaker or corpus. In ASR, Cepstral Mean and Variance Normalisation subtracts a mean estimated per utterance and is designed to reduce channel noise whereas Vocal Tract Length Normalisation estimates a mean per speaker.

We experimented with speaker-, corpus- and gender-based means but did not observe any change in performance. A simple global cross-corpus feature standardisation is computed on the training set and applied to the acoustic features of both the training and test set.

In document Danish Stød and Automatic Speech Recognition (Sider 76-81)