• Ingen resultater fundet

3 Data description

3.2 Data criteria

55

optional and regards the preparation phase; the second is the actual delivery during the sitting;

the third is the transcription conducted by the Secretariat of the Parliament; the fourth consists of the revision; and the fifth of the translation into the other languages. Since this process and the constraints mentioned above apply to all three languages analysed in this study, no further reservations will be made regarding this.

Another constraint on corpus observations is related to the nature of speeches of all kinds: it can often be difficult to ascertain their specific origin, i.e. who composed, elaborated or dictated the text since many politicians have professional speech writers (to help) write their speeches.

This calls into question whether the individual speaker can also be considered the originator of the text. This constraint will also not be of high importance to the findings of this study as the study object is language use in general, and not the language use of individuals. In summary, this example from the English part of the corpus neatly captures the various constraints of the Europarl texts, as argued above.

3) Mr President, I am aware of the bad parliamentary habit of writing one's speeches in advance and then making them without listening to the rest of the debate. Indeed I have even been guilty of that myself on occasion. <ep-99-05-04.txt:159>

56

conversations with a total of 16,000 words; the RST Treebank (Carlson, Marcu, & Okurowski, 2003) 385 texts totalling some 176,000 words from the Wall Street Journal Corpus; the Penn Discourse Treebank (Prasad et al., 2008) over one million words also from the Wall Street Journal Corpus; the Potsdam Commentary Corpus (Stede, 2004) 170 texts from newspaper commentaries; Wolf & Gibson (2005) 135 texts from the Wall Street Journal Corpus; and in the Copenhagen Dependency Treebanks approximately 60,000 words were annotated in Danish, English, German, Italian and Spanish (Korzen & Buch-Kromann, 2011). Note, however, that these studies have employed different annotation techniques, had different scopes, drew on different theoretical frameworks and, most importantly, employed a varying number of trained annotators.

Figure 3.1: Extraction process

The 150 texts were chosen from the Europarl subcorpus following a number of textual requirements: that there should be no more than one text by the same speaker, no texts from the same date and no texts from the same debate (referred to as CHAPTER in the metadata), or at least, that there should be as much variety as possible. This was to some extent feasible in English and Italian, since there was a larger pool from which to choose. In Danish, I was forced to use more texts by the same speaker; see Appendix A for an overview.

A textual requirement was also laid down concerning speech length, so that the texts were also comparable in terms of size. The texts in the balanced subcorpus contain between 150 and 700 words. The reason for this choice of size is found in van Halteren (2008, pp. 937–938), who argues that the short texts in Europarl (<380 words) tend to be more argumentative than the long texts (>2500 words), which he characterises as presentations of written reports (i.e. descriptive

Balanced subcorpora Unbalanced subcorpus

Original source corpus Version 2 Europarl

L1+L2 Europarl DA+EN+IT

L1 Europarl DA+EN+IT

L2 Europarl DA+EN+IT

57

or expository text types). It may be difficult to ascertain how long these speeches were in terms of delivery, but a qualified estimate based on the following extract of an English speech from the corpus containing 268 words, and based on own read aloud experiments, would be between two and eight minutes. The first line introduced by the tag SPEAKER ID contains relevant metadata such as a specific speaker id (54), the language in which the speech was held (‘EN’ for English), the name of the speaker (Bushill-Matthews), and sometimes political affiliation (not indicated here). The numbers in the second line indicate year (‘01’ for 2001), month (‘01’ for January), date (15th) and repetition of SPEAKER ID (54).

Figure 3.2: Excerpt from a Europarl text

Gender and political affiliation are not variables in the study, and no generalisations are made with respect to these. The main reason for excluding these variables was, as mentioned above, that it is hard to determine whether all speeches were actually written by the MEP indicated in the metadata or by a professional speech writer. Nonetheless, one might argue that it could be relevant to take the speakers’ gender, social background or political standpoints into consideration and investigate whether speakers belonging to either the left-wing or right-wing groups in the Parliament use sentences of a certain length or more non-finite verb forms than their opponents. But since the way of analysing discourse in the present thesis is not a Critical Discourse Analysis approach studying, for instance, the distribution of power in the Parliament, those parameters have not been taken into consideration. A quick look at the overall statistics of the texts (cf. Appendix A) also reveals that it is arguable whether such a pattern is actually present: in the Danish texts, where there are several speeches by the same speakers, the difference between the shortest and longest sentences in speeches held by the same MEP is above 100 % (e.g. the speeches by Blak and Krarup). In addition, I found no evidence that gender and political standpoint (i.e. left or right) affect text structure (e.g. differences in syntactic structures), discourse structure (e.g. the use of different rhetorical relations) or

<SPEAKER ID=54 LANGUAGE="EN" NAME="Bushill-Matthews">

<ep-01-01-15.txt:54>

Mr President, I wish to begin by saying that although I have been allocated four minutes, I should like to do my bit for the simplification and streamlining of bureaucracy by speaking for less than half that time.

58

information structure (e.g. differences in sentence lengths). Table 3.1 summarises the basic facts and data on the balanced L1 subcorpus of Europarl.

Danish L1 texts

English L1 texts

Italian L1 texts

Total

Number of texts 50 50 50 150

Number of words 14,737 14,666 14,781 44,184

Number of

different speakers

21 39 40 100

Table 3.1: Basic numbers of the balanced L1 Europarl subcorpus

As we can see from Table 3.1, the pools of Danish, English and Italian speakers are slightly different. Whereas the vast majority of the English and Italian speeches have been held by different speakers, the Danish speeches are more often held by the same speakers. The main reason for the lower number of different speakers in the Danish texts compared to the English and Italian texts has to be found in the discrepancy of the number of seats allocated in the European Parliament, and in the fact that only a few Danish speakers are frequently participating in the parliamentary debates through speeches. At the time of writing, the countries represented by the languages in this study have been assigned the following number of seats in the Parliament: Denmark 13 (1.8 % of the total number of seats in the Parliament), Ireland 12 (1.6

%), Italy 72 and United Kingdom 72 (9.8 %). This could pose a problem for generalisations based on the Danish data because the sample is not as heterogeneous as the English and Italian samples are.

In addition to the subcorpus of Danish, English and Italian L1 texts, a parallel corpus of corresponding L2 in the three languages was created. The idea behind creating this parallel subcorpus was that although most translations were very similar to their source text in terms of linguistic structures, some translators had changed the syntactic structures in the translations, rendering explicit the rhetorical relations that they had inferred between two or more discourse units. This is, for example, the case in the Danish L2 text excerpt in example 4) translated from English L1, where the underlined English relative clause has been transformed into a subordinate finite clause with discourse cue fordi (because), shown in bold-faced type.

59

4) Turning to the definition of child pornography in the Karamanou report, my group has problems with the definition which includes creating the impression that the person depicted is a child. <ep-01-06-11.txt:59>

Så til definitionen af børnepornografi i Karamanou-betænkningen. Her har min gruppe problemer med definitionen, fordi den indbefatter tilfælde, hvor man skaber det indtryk, at den afbildede er et barn.

Of course, the translator’s reproduction of the L1 construction needs to be approached critically.

But in most cases, I found the syntactic changes very useful, sometimes confirming my own analyses of rhetorical relations. The L2 subcorpus has not been annotated in any way and has only been used as potential support for my L1 annotations and analyses. The L2 subcorpus was not used in the same way as the L1 subcorpus because parallel texts are best suited for improving machine translation since they permit L1-L2 text alignment and evaluation, a matter which has been pointed out by several scholars (Baroni & Bernardini, 2005; McEnery et al., 2006). On the other hand, comparable texts (i.e. texts in different languages or varieties that deal with the same overall topic) are well-suited as the empirical basis for descriptive, and possibly cross-linguistic, comparisons. Translated texts are inappropriate because the filter of the translator and the translation strategies get in the way, and L2 texts may end up with a text structure very similar to that of the L1 (i.e. non-translated). Baroni & Bernardini (2005, p. 260) refer to this phenomenon as ‘translationese’, a term adopted from Gellerstam (1986):

It is common, when reading translations, to feel that they are written in their own peculiar style. Translation scholars even speak of the language of translation as a separate ‘dialect’ within a language, which they call third code … or translationese ... Translationese has been originally described ... as the set of “fingerprints” that one language leaves on another when a text is translated between the two.

In the same vein, McEnery et al. (2006, p. 49) state that:

source and translated texts … alone serve as a poor basis for cross-linguistic contrasts, because translations (i.e. L2 texts) cannot avoid the effect of translationese ... [C]omparable corpora are a useful resource for contrastive studies and translation studies, when used in combination with parallel corpora.

60

A further precaution to take when using parallel corpora is ensuring that one knows whether the translation has been produced directly or indirectly, that is, through another language. This is particularly important in EU texts due to the EU’s system of relay or pivot languages; remember that my subcorpora contain only Europarl texts from the period 1996-2003:

From personal discussion with a translator at the European Parliament, we know that after 2003, a pivot language was used (English), which implies that all statements were first translated into English and then into the 22 other target languages. Before 2003, however, it seems that the translations were made directly from all languages into others. (Cartoni & Meyer, 2012, p. 3)

Table 3.2 summarises the total number of words in the three parallel L2 subcorpora of Europarl.

from

into Danish L1 English L1 Italian L1

Danish L2 - 13,718 15,569

English L2 16,732 - 15,909

Italian L2 15,799 14,456 -

Table 3.2: Total numbers of words in the parallel L2 Europarl subcorpora