Alignment - Data-Driven Bitext Dependency Parsing and Alignment

Background

2.3 Alignment

Shift Push the token at the head of the buffer onto the stack.

Reduce Pop the token on the top of the stack.

Left-Arc_l Add to the analysis an arc with labellfrom the token at the head of the buffer to the token on the top of the stack, and push the buffer-token onto the stack.

Right-Arc_l Add to the analysis an arc with labell from the token on the top of the stack to the token at the head of the buffer, and pop the stack.

Learning

The third part of the system is a classiﬁer, and this is because transition-based dependency parsing reduces parsing to consecutive multi-class clas-siﬁcation. From each conﬁguration one amongst some predeﬁned num-ber of transitions has to be chosen. This means that any classiﬁer can be plugged into the system. The training instances are created by the oracle, so the training is ofﬂine. This implies that any classiﬁcation algorithm can be used.

2.3 Alignment

In NLPalignment is the task of aligning different linguistic entities - typi-cally in two different languages. The two most common tasks are sentence alignment, where sentences with the same meaning in two or more lan-guages are aligned, and word alignment where words with the same mean-ing are aligned. Another type of alignment issub-treealignment where the sub-trees in a syntactic tree in two (or more) languages are aligned.

Here we will only describe word alignment and sub-tree alignment.

And as it turns out, the two are actually the same when we consider de-pendency structures. The nodes in a dede-pendency tree are the words in the sentence, so aligning the sub-trees rooted in these nodes is equivalent to aligning the words.

2.3.1 Formal Deﬁnition

We will here formally deﬁne what we mean by word alignment in this work. We use the same deﬁnition of a sentence as we did in section 2.2.1, except we leave out the artiﬁcial root token.

Deﬁnition 9. S =w₁, w₂. . . w_n

In alignment we have two sentences so we super-scribe both theS and thew, so thatS^a =w^a₁, w₂^a. . . w^b_n. Now we can deﬁne a word alignment:

Deﬁnition 10. A word alignment of two sentencesS^a and S^b is an undirected bipartite graph G = (V, A)consisting of the two disjoint set of nodesV^a andV^b such that the following holds:

1. V^a ⊆ {w^a₁, w₂^a, . . . , w^a_n} 2. V^b ⊆ {w₁^b, w₂^b, . . . , w^b_m} 3. V =V^a ∪V^b

4. A ⊆V ×V

The bipartite condition implies that A ⊆ V_a × V_b, i.e. all edges are from vertices representing words from one sentence to vertices represent-ing words from the other sentence.

Sometimes we will distinguish between sureand possible links. In this case we will instead need to deﬁne a labeled graph with the label setR = {sure,possible}and thenA ⊆V ×R×V

2.3.2 Graph Based

One way to try and create alignments is viewing the problem from a graph-theoretic point of view and using algorithms from this ﬁeld to solve the problem. If we assume that each word or node in a sentence is represented by a vertex and links between words are edges, the graph will be bipar-tite, i.e. all edges connects a vertex from one sentence to a vertex from the other sentence. If all possible links are assigned a score (a weight in graph-theoretic terms) then the task of ﬁnding a maximum scoring align-ment can be solved using different graph-algorithms, for instance a maxi-mum matching algorithm. Figure 2.5 shows how such a graph would look for two sentences with three words each. The edges are possible links.

2.3 Alignment 31 One approach using this method is presented by Taskar, Lacoste-Julien, and Klein (2005) who present a discriminative word aligner using a max-imum matching approach. In the aligner they use linear programming to ﬁnd the maximum matching as this combines well with their learning al-gorithm, but combinatorial algorithms, as they note, can also be used.

They use Max-Margin Markov Networks (Taskar, Guestrin, and Koller, 2003) for learning and achieve very good results. Especially when using the output from GIZA++ they get very low alignment error rates, and this underlines one of the advantages of discriminative learning - that features like this can easily be used.

Figure 2.5: Graph representing a word alignment task. By doing maxi-mum matching the highest scoring alignment can be found.

This approach requires edge-factored features. I.e. the score of one edge cannot be dependent on the rest of the structure. If the features are not edge-factored, the maximum matching algorithms cannot be used. A drawback of the approach is that it only allows 1-1, 0-1, 1-0 links.

Lacoste-Julien et al. (2006) present a word aligner that deals with this.

The problem can be solved by viewing the alignment problem as a mini-mum cost maximini-mum ﬂow problem. In this, the problem is to ﬁnd the max-imum ﬂow with the minmax-imum cost through a network represented by a directed graph. The maximum matching problem can also be viewed this way by adding a so-called source and sink vertex. To allow fertility more

than 1, extra edges from the source to the source word vertices and extra edges from the target word vertices to the sink are added. Figure 2.6 shows how the initial graph would look with three words in both sentences and a maximum fertility of 3. It should be noted that compared to the graph shown in (Lacoste-Julien et al., 2006) our graph contains some extra inter-mediary fertility nodes. This is to keep to a simple graph, i.e. maximum one edge from one vertex to another.

Figure 2.6: Directed graph representing a word alignment task with fer-tility higher than 1 (3 in this example). Solid dots represent the words.

Dashed lines have no cost. The leftmost vertex is called the source and the rightmost thesink. The problem of ﬁnding a maximum scoring align-ment can be solved using a minimum cost maximum ﬂow algorithm.

As with the maximum matching problem, efﬁcient algorithms exist to solve this problem. Apart from allowing fertility more than one, Lacoste-Julien et al. (2006) introduce second-order features. They solve the infer-ence problem as a quadratic assignment problem and achieve very low alignment error rates on the Hansard corpus.

The results with this aligner are very good but the approach has one major drawback which is the training time. Solving the inference problem is quite slow but especially the learning algorithm is not feasible for larger

In document Data-Driven Bitext Dependency Parsing and Alignment (Sider 51-55)