Inter-annotator agreement - Data description

3 Data description

3.4 Annotation

3.4.1 Inter-annotator agreement

The annotation of linguistic structures above the sentence is by no means an easy task. During my time as an annotator in the Copenhagen Dependency Treebanks from 2008 to 2010, I experienced how annotating low level features such as syntactic structures required much less effort than annotating high level features such as discourse structure; some of the struggles are documented in Buch-Kromann (2010). The same problems were experienced when the RST Treebank was created. Carlson et al. (2003, p. 103) note that:

Developing corpora with these kinds of rich annotation is a labour-intensive effort.

Building the RST Corpus involved more than a dozen people on a full or part time basis over a one-year time frame (Jan-Dec 2000). Annotation of a single document could take anywhere from 30 minutes to several hours, depending on the length and topic. Re-tagging a large number of documents after major enhancements to the annotation guidelines was also time consuming.

One of the main reasons for this is that syntactic roles such as subject and object are much better defined, recognised and accounted for in the literature than rhetorical relations. The extra effort needed in annotating higher level features consists not only in interpreting which relations hold between the various units of a texts but also in obtaining a certain degree of tagging consistency, often referred to as inter-annotator agreement⁶ in cases where more than one annotator is involved (e.g. Cook & Bildhauer, 2011; Marcu et al., 1999).

The purpose of computing inter-annotator agreement scores is to demonstrate that the annotation guidelines can be understood and applied by people other than those who developed the coding schemes (in my case, the schemes include the EDU segmentation principles, RST trees and the inventory of RST relations). In addition, inter-annotator agreement is computed in order to ensure reproducibility of annotations. In one of the most frequently cited review articles on inter-annotator agreement, the goals of agreement studies are summarised as follows:

Researchers who wish to use hand-coded data—that is, data in which items are labelled with categories, whether to support an empirical claim or to develop and test a computational model—need to show that such data are reliable. The fundamental assumption behind the methodologies discussed in this article is that

6 Other places in the literature, inter-annotator agreement is also referred to as inter-coder agreement (e.g. Artstein

& Poesio, 2008).

data are reliable if coders can be shown to agree on the categories assigned to units to an extent determined by the purposes of the study […]. If different coders produce consistently similar results, then we can infer that they have internalised a similar understanding of the annotation guidelines, and we can expect them to perform consistently under this understanding. Reliability is thus a prerequisite for demonstrating the validity of the coding scheme—that is, to show that the coding scheme captures the “truth” of the phenomenon being studied, in case this matters:

If the annotators are not consistent then either some of them are wrong or else the annotation scheme is inappropriate for the data. (Just as in real life, the fact that witnesses to an event disagree with each other makes it difficult for third parties to know what actually happened.) However, it is important to keep in mind that achieving good agreement cannot ensure validity: Two observers of the same event may well share the same prejudice while still being objectively wrong.

(Artstein & Poesio, 2008, pp. 556–557)

The literature on computing agreement scores proposes a number of ways to calculate these according to the nature of the study and to the methods and theories employed. In the analysis of RST trees as applied in this thesis, Marcu et al. (1999, p. 52) propose mapping the hierarchical RST structures into sets of units that are labelled with categorical judgments using Cohen’s (1960) Kappa coefficient. This is because decisions at one level of the discourse tree affect decisions at other levels, which means that the levels are not independent of each other (Van der Vliet & Redeker, forthcoming (b)). The parameters and categorical judgments considered are:

- EDU segmentation (categories: yes or no) - Spans (categories: yes and no)

- Nuclearity (categories: (nucleus, satellites or none)

- Relation labelling (categories: the 32 different RST relations; see Appendix B)

For the present study, approximately five per cent of the 150 texts have been selected for computing agreement scores. The fellow annotator chosen was an experienced annotator from the Copenhagen Dependency Treebanks Project (CDT), who was familiar with the RST annotation style, the RST relations and the theory in general. However, as the other annotator had done most of his annotations with the CDT relation inventory, which is quite different from the RST inventory (see Buch-Kromann, Gylling, Jelsbech Knudsen, Korzen, & Müller, 2010), I

arranged a meeting prior to the annotation task, during which the most important aspects of RST annotation and relations were presented and discussed. Moreover, I selected a number of troublesome cases from the corpus, which I anticipated the fellow annotator would also find challenging. After the meeting, the annotator was given seven texts to annotate following my guidelines. During the annotation process, a number of clarifying questions were asked about the EDU segmentation principles and the relation inventory, which I answered as generically as possible without being introduced to the specific cases.

Table 3.3 shows the overall agreement scores using the methods proposed by Marcu et al.

(1999). It can be seen from the table that there are acceptable levels of agreement on all four parameters, from the highest agreement of K=0.95 on EDU segmentation to the lowest agreement of K=0.63 on relation labelling. Landis & Koch (1977) regard Kappa values between 0.6 and 0.8 as ‘substantial’ results, and Kappa values between 0.8 and 1.0 (the maximum) as

‘perfect’. See also Artstein & Poesio (2008) and Marcu et al. (1999) for further information on how Kappa values and inter-annotator agreement scores are computed in general.

Agreement on Kappa values

EDU segmentation 0.95

Spans 0.85

Nuclearity 0.80

Relation labelling 0.63

Table 3.3: Inter-annotator agreement

Although Carletta (1996) suggests that Kappa values between 0.67 and 0.8 allow only ‘tentative conclusions’, the agreement obtained on relation labelling of 0.63 must be seen in light of the results obtained by others. Considering the scale and the limits of being a one-man project, cf.

also the discussion in Section 3.2, I regard the obtained agreement numbers as satisfactory. In addition, the results are also very much in line with previous studies (Buch-Kromann et al., 2010; Marcu et al., 1999; Van der Vliet, Berzlánovich, Bouma, Egg, & Redeker, 2011; Wolf &

Gibson, 2005), in which the Kappa values on relation labelling are significantly lower than the other values such as EDU segmentation. Artstein & Poesio (2008, p. 580) also note that:

The analysis of discourse structure—and especially the identification of discourse segments [EDU segmentation]—is the type of annotation that, more than any other, led C[omputational] L[linguistic] researchers to look for ways of measuring reliability and agreement, as it made them aware of the extent of disagreement on

even quite simple judgments … Subsequent research identified a number of issues with discourse structure annotation, above all the fact that segmentation, though problematic, is still much easier than marking more complex aspects of discourse structure, such as identifying the most important segments or the “rhetorical”

relations between segments of different granularity. As a result, many efforts to annotate discourse structure concentrate only on segmentation.

In general, acceptable agreement scores are important as coding or annotation of data plays a crucial role in the analysis of a study. In the present cross-linguistic study, however, what is perhaps more important is that the annotations have been carried out in exactly the same way across the three languages under investigation. This allows us to conduct analyses of the annotations in Danish, English and Italian without having to worry about different annotation preferences or diverse readings of guidelines. Moreover, as I myself have annotated all 150 texts in all three languages, it has been possible for me to continuously adapt the segmentation principles to language-specific constructions and to compare textualisations of rhetorical relations in one language with textualisations in the other two languages by comparing both L1 with L1 texts and L1 with L2 texts.

Lastly, it must be mentioned that the increased focus on inter-annotator agreement scores has been criticised by a number of scholars. Buch-Kromann (2010, p. 9) warns scholars building treebanks, that is, annotated corpora, about excluding detailed descriptions of various linguistic phenomena only to achieve high agreement scores:

Measuring Treebank quality is probably one of the hardest and most important outstanding problems in the field, and any research that can address these problems even tentatively should be encouraged by the field. […] [M]ore importantly, if used as a proxy for annotation quality by treebank designers and reviewers, an exaggerated focus on agreement may lead to distortions in the way treebanks are designed.

In the same way, Reidsma & Carletta (Reidsma & Carletta, 2008) argue that even Kappa measures above 0.8 are no guarantee that the results actually are reliable. Instead of concentrating on numbers purely, scholars should look for any patterns in the disagreement among annotators and assess what impact they will have; in an RST context, the scholars behind the MTO Corpus (Van der Vliet & Redeker, forthcoming (b)) propose a reconciliation of

different annotations as a possible solution. Figure 3.4 gives examples of each parameter considered in Table 3.3 where there was disagreement between me and the other annotator. In the first example of EDU segmentation, the disagreement relates to whether the interrogative sentence should be segmented into one or three EDUs. In the second example, we can see how differently the first three sentences of a text can be annotated in terms of spans, nuclearity and relation labelling, simply because the relation between two EDUs (#2+#4/6) has been interpreted differently (Concession versus Contrast).

EDU segmentation My

anno-tation

[Hvilke regler gælder,] [og hvilke rettigheder har vi,] [hvis noget går galt?]

What rules apply, and what rights do we have if something goes wrong? <ep-99-05-06.txt:42>

Fellow annota-tor

[Hvilke regler gælder, og hvilke rettigheder har vi, hvis noget går galt?]

Spans, nuclearity and relation labelling My

anno-tation

[Mr President, International Consumers' Day was celebrated in March, with the theme of electronic commerce and consumer protection.] [Commerce on the Internet does not in itself raise new problems of consumer policy,] [but [since we are talking about a new medium,] there is a need to establish security and confidence.]

[Commerce on digital networks should be at least as secure and safe as commerce in the physical world.]<ep-99-05-06.txt:42>

Handel via digitale net bør være mindst lige så tryg og sikker som handel i den fysiske verden.

1-5

Background

Hr. formand, den internationale forbrugerdag i marts blev markeret under temaet elektronisk handel og forbrugerbeskyttelse.

2-5 Elaboration

3-5

men

Same-unit

4-5 Same-unit

er der behov for, at der etableres tryghed og tillid.

da der er tale om et nyt medie,

Nonvolitional-cause Handlen på

Internettet rejser ikke i sig selv nye forbrugerpolitiske problemstillinger,

Concession

69 Fellow

annota-tor

Table 3.4: Disagreement examples

The question to be raised here is really whether a large number of texts can be annotated by two persons in exactly the same manner. Admittedly, some interpretations are more correct than others, but since annotators are asked the extremely difficult task of having to infer what type of rhetorical relations the writer intended to express between two units, I believe that if one were to achieve a Kappa agreement score on relation labelling of 1.00, something would be scientifically inaccurate simply because texts are ambiguous. Or as argued by Buch-Kromann (2010, p. 10) based on the experience from the annotations in CDT:

The experience of the CDT annotators, and many others in the field, is that semantic distinctions are really hard to make, and that disagreements are often caused by truly ambiguous texts where the two differing analyses either lead to essentially the same meaning, or the context does not contain sufficient information about the speaker’s true intentions. But that does not necessarily imply that the distinction does not encode important information, it is just noisy information.

That being said, the agreement numbers of my annotations are acceptable, and no further reservations will be made in this regard.

In document The Structure of Discourse A Corpus-Based Cross-Linguistic Study (Sider 66-71)