• Ingen resultater fundet

Annotation study

3.2 Experimental setup

3.2.3 Analysis

Majority voting The definition of majority in the majority voting scheme has a large impact on the agreement measure as can be seen in Table 3.1. Agreement on the amount of stød annotations fluctuate between 78% (Nagree >= 2) and 55% (Nagree>= 4) out of 78 possible stød assignments (according to Any-majority). More informative statistics are necessary to evaluate the quality of the transcriptions.

Majority Any Tie Majority All Total #labels

#stød 78 61 50 43 995

Table 3.1: Number of stød annotations using different majority definitions:Any=1 annotator annotates for stød,Tie=2 annotators agree on stød,Majority=3 annotators agree,All=4 annotators agree.

Inter-annotator agreement Tables 3.2 and 3.3 are pairwise Cohen’sκconfusion matrices.

When evaluating on the full label set, there are 178 observed labels and 995 items. All pairwise com-parisons as well as the per-annotator averageκare above the lower bound of 0.6 for adequate annotation.

The mean agreement2, 0.74, is also significantly above the lower bound. With a standard deviation of 0.02,

2Average of Avg.κ.

IPA1 IPA2 IPA3 IPA4

labels 107 94 99 107

IPA1 1.00 0.69 0.74 0.74 IPA2 0.69 1.00 0.75 0.76 IPA3 0.74 0.75 1.00 0.78 IPA4 0.74 0.76 0.78 1.00 Avg. 0.72 0.74 0.76 0.76

Table 3.2: Inter-annotator agreement confusion matrix calculated with Cohen’sκon the JHP sample. The basis for this matrix is all observed labels.#labels=the number of labels used by that annotator.

only IPA1 is more than one standard deviation below the mean. This is high agreement considering that any single annotator only uses at most 60% (107178= 0.601) of the full label set and that annotators only have 55 labels in common.

The reason can be seen in Figure 3.2. The values on the x-axis are raw agreement counts per item. The raw count is 0 if the annotator assigned a label none of the other annotators used for a given item and 3 if all annotators assigned the same label. The bar plots are different from the confusion matrix in the respect that they compare one annotator to all other annotators per item instead of making a global and a pairwise agreement comparison.

The graphs suggest that there is a prevalence of a subset of labels and that the annotators agree on these labels. In such a scenario, it is likely that the disagreements are few and not systemic. The plots for IPA2, IPA3 and IPA4 all show a Zipfian tendency. We assume that the label distribution depends on word distribution, i.e. the distribution of labels will change if word distribution changes, but it does not follow that the label distribution should be Zipfian. It seems reasonable to attribute the difference to pronunciation variation and inter-annotator disagreement.

The distribution for IPA1 is not Zipfian. IPA1 assign different labels than other annotators for approx-imately 50 items. This is reflected in theκstatistics in Table 3.2 where IPA1 receives the lowest pairwise and average agreement scores.

To investigate the assumption that annotators agree on a small subset of highly frequent labels, the label frequency histogram is plotted in Figure 3.3a. Indeed, there is a small number of prevalent labels and as can be seen from Figure 3.3b, the annotators agree to a high extent on this small subset. The only differences

Figure 3.2: Pairwise label agreement by annotator.

between the plots in Figure 3.3 are the missing [a] from 3.3b and [n?] from 3.3a and the sequence of the labels and label pairs.

Another result that can be extracted from the plots in Figure 3.3 is that the annotators only agree on one or two labels containing stød. The raw agreement counts for stød in 3.3b is 36 and 29. Note that the method of counting used here and in Table 3.1 is not directly comparable. Majority is defined using>=.

As a result, an item with 3 identical labels would count as 1 in majority voting usingNagree>= 2 but as two agreement pairs because annotatorA1used that same label asA2and the same label asA3. This difference is important to not be deceived into believing that the two highly frequent pairs make up 90%

of the stød agreement in Table 3.1. Applying majority counting to the stød labels in Figure 3.3b gives an agreement of 6 and 11.

Binary labelling While it is a positive indication that two labels containing stød are among the 30 most frequently used, it is not a strong enough indicator that stød annotation is reliable. To further investigate the reliability of stød annotation, the agreement of the binary label set, which only considers stød, is studied next. This filtering is motivated by annotators disagreeing only on a segment, but not on stød annotation, which is also the case in Figure 3.1.

(a) Label frequency (b) Label pair agreement frequency

Figure 3.3: Histograms of raw label frequency (a) and raw label pair agreement frequency (b) across all annotators. Two assignments of the same label to an item is one agreement pair. Histograms only display the 30 most frequent labels.

The label sequences IPA1-segment, IPA2-segment, IPA3-segment and IPA4-segment are binarised, e.g.

[æ:?], [D?] [D:G?] and [?n] in Figure 3.1 become 1 and the remaining labels become 0.

A side-effect of this filtering is a skewed data set. The non-stød class represents 92%-96% of labels assigned3and would not produce meaningful statistics for the interpretation of reliability of stød annotation.

Again, the chance agreement correction ofκbecomes important for the trustworthiness of the statistical analysis.

The agreement scores for the binary label set in Table 3.3 are even higher than theκ-scores in Table 3.2.

This is expected as a labelling task with two labels is easier than a task with 178 (observed) labels, even though prosodic annotation is a very difficult task. The average agreement is above 0.8 for all annotators and the mean agreement is above 0.8 at 0.82. The standard deviation is 0.02 and again IPA1 is more than one standard deviation below the mean.

Figure 3.4 illustrates the background for theκ-scores. To reduce skewness, the analysis ignores items which have not been labelled with stød by any annotator. Agreement statistics for the remaining items will only focus on the agreement of stød annotation. As expected from Table 3.1, the annotators agree completely on 43 label assignments. There is a low number of midrange disagreements and 20-27 label assignments per annotator where they do not agree with any other annotator.

Error analysis The high number of disagreements contradict the hypothesis that stød annotation as reliable and warrants manual investigation. I discovered that off-by-one errors in the alignment are frequent.

In 10 cases, the assignment of stød labels are off-by-one. In Figure 3.1, an alignment error is visible in the

3By majority counting.

IPA1 IPA2 IPA3 IPA4

#stød 53 58 62 59

IPA1 1.00 0.78 0.84 0.85 IPA2 0.78 1.00 0.75 0.88 IPA3 0.84 0.75 1.00 0.83 IPA4 0.85 0.88 0.83 1.00 Avg 0.82 0.80 0.81 0.85

Table 3.3: Inter-annotator agreement confusion matrix calculated with Cohen’sκon the JHP sample. The basis for this matrix is binary +/- stød labels. #stød=the number of stød annotations made by that annotator.

labelling of the second to last interval in tier IPA3-segment ([?n

"]). It is an error because stød is prefixed to the phone [n

"] rather than a suffix. This is not a phone or segment according to any definition of IPA known to the author and not a well-formed label. As is reflected in the transcription of the entire word in tier IPA3, stød should have been affixed to the previous phone.

Additional examples where stød labels are off-by-one can be seen in Figure 3.5. The discrepancy can be caused by genuine disagreements or be due to the different interpretations of semi-fine IPA annotation. In 3.5a, the annotators disagree on whether to label the sound they heard as [æ:?] or [æI

?]. Similarly in 3.5b, the annotators disagree on [o:?] vs. [o5

?]. In 3.5c, the disagreement also stems from whether to use [y:], the long version of the vowel, vs. a combination of [y] and [5

“]. While IPA1 and IPA3 most often uses two labels

Figure 3.4: Pairwise stød label agreement per item by annotator.

with stød on the last sonorant instead of a long vowel, IPA2 and IPA4 prefers to assign a long vowel with stød, though on one occasion, IPA2 opts for two labels.

(a) EN: High school (b) EN: Think that (c) EN: The board

Figure 3.5: Examples of off-by-one alignment errors.

As mentioned in Section 2.1.1, prosodic features do not only affect the segment they are affixed to, but also the phonetic context. The scope of prosodic features are in fact syllables rather than segments and it is possible that stød influences both segments. However for statistical purposes, a discrepancy such as this will give misleading results and the notational variation must be addressed.

In all off-by-one examples found by the author, the reason for the discrepancy was either an alignment error as in Figure 3.1 or due to the annotator label choice for the vowel in a syllable. Irrespective of the label choice of the annotators, stød is annotated on the nucleus of the same syllable.

Correction We create a modified version of the stød annotation where alignment errors were corrected and cases where stød was annotated on the same syllable, as in Figure 3.5, were aligned to each other.

These modifications entailed all 10 cases found during manual inspection. The corrected pairwise stød label agreement is shown in Figure 3.6.

Correction produced a different picture than was painted by Figure 3.4. A significant reduction in disagreements, i.e. where an annotator labelled an item with stød and no other annotators did, and an expected increase in total agreement pairs is observed. There is a clear indication of reliability in stød annotation.

Competence This is furthermore indicated by the annotator competence statistics in Table 3.4. It is also clear that competence statistics for the original label set and the corrected stød labelling are correlated.

Competence phoneuses the observed phonetic annotation as labels andCompetence stødrefers to the binary labelling corrected for alignment errors.

Annotator #labels Competence phone #stød Competence stød

IPA1 107 0.760 53 0.770

IPA2 99 0.813 58 0.840

IPA3 94 0.823 62 0.894

IPA4 107 0.833 59 0.856

Table 3.4: Annotator statistics on the JHP sample computed with item-response models trained with MACE.