Data, Evaluation and Tools
To evaluate the results of the different approaches we investigate, we need to decide what evaluation metrics to use. In this section we will describe the evaluation metrics used in the following sections.
In the introduction we mentioned that if the purpose of the created tree-banks is to help human-annotators in creating a hand-aligned treebank, then the ideal metric will be measuring how much time a human annotator will need to correct the errors made by the automatic method used. This is not a realistic measure because it will require human annotation every time we need to evaluate the output of a system. Instead, some kind of edit-distance can be used under the assumption that this is a reﬂection of the ideal measure. An even simpler approach is to measure the amount of errors in the output of the system, as these are the ones that the annotators need to address. We choose to use metrics based on the number of errors as this is simple and allows us to use standard metrics from parsing and alignment, as these are based on the number of errors in the output.
In dependency parsing the standard metrics are the following:
Labeled Attachment Score (LAS)The percentage of tokens that have the correct head and the correct label.
Unlabeled Attachment Score (UAS)The percentage of tokens that have the correct head.
Labeled Accuracy score (LA)The number of tokens with the correct label.
Often only non-punctuation tokens are included in the evaluation. This is the case in CoNLL shared tasks on dependency parsing (Buchholz and Marsi, 2006; Nivre et al., 2007), and as we use the evaluation script2 from CoNLL-07 we also exclude punctuation in the evaluation.
1The exact snapshots used for experiments is available by contacting the author.
Another commonly used metric is Exact Match. This is the percentage of sentences that are parsed completely correct. We do not use this metric here.
In the word alignment literature there is often a distinction between sure andpossiblelinks, where the latter are more questionable links. Some align-ers also include this distinction in their output, but not all. The metrics used in alignment areprecision,recallandAER. IfSare the sure links in the gold-standard, P the possible (and sure) links and Athe links in the alignment being evaluated the metrics are deﬁned as follows (Och and Ney, 2003):
Precision= |P ∩A|
|A| Recall= |S∩A|
AER= |P ∩A|+|S∩A|
It is important to note that theP set includes both the possible and the sure links. The idea is that you will get rewarded for having correct possible links but not punished for having missed possible links.
If the distinction between probable and sure alignments is dropped, the metrics will be standard recall and precision, and AER will be equal to 1−F1-score, where theF1- score is the harmonic mean between precision and recall.
We will use AER to evaluate alignments. We will also report precision and recall on both sure an possible links. I.e. we will report standard pre-cision and recall on both of these, not the combined predicion and recall deﬁned above3.
3.2.3 Joint Parsing and Alignment
In most experiments we will simply report both parsing metrics for both languages and AER for the alignment. In some cases, we will also report
3This is was is reported by thewa eval align.pl-script from the shared task in the ACL 2005 Workshop on Building and Using Parallel Texts. The script is available from http://www.cse.unt.edu/˜rada/wpt/code/wa_check_align.pl
3.2 Evaluation 49 a joint metric for the whole task of joint parsing and alignment. In most other work on creating parallel treebanks, phrases-structure based parsing is used. For this F1-scores are often used, and so it is straight forward to useF1-scores for the whole structure - i.e. the two trees and the alignment.
F1-score is not used in dependency parsing, but this is simply because that the single-head requirement implies that recall is equal to precision. The number of edges the parser suggest will always be equal to the number of edges in the gold-standard. Therefore we could useF1-score over the entire structure as well. We will almost do this. We will use a weighted average of UAS for the two sentences and1−AERfor the alignment, i.e. the parallel treebank score (PTS) will be:
P T Sαa,αb,αabαa·U ASa+αb·U ASb+αab·(1−AER) αa+αb+αab
We do this to retain the sure/possible distinction and the exclusion of punc-tuation tokens in the parsing evaluation. We use UAS instead of LAS be-cause we are generally more interested in the structure of the parsers than the labels. We useαa = αb = αab = 1/3, but this can be changed if one the parts is to be weighed higher than the others.
3.2.4 Signiﬁcance Tests
We test statistical signiﬁcance of the results from different approaches in all experiments. For parsing, we test using McNemar’s test - we do this with MaltEval (Nilsson and Nivre, 2008). For word alignments we use Dan Bikel’scompare.plscript4. The test uses a type of stratiﬁed shufﬂing. We adapt the script to word alignments, and test only onsurelinks.
Unless otherwise stated we assume that results are signiﬁcant if p <
If we compare more than two systems, we use cross-tables to report results from signiﬁcance tests. If we compare only two systems, we use † to mark signiﬁcance.