Note: this is no longer a wiki, only a static archive of the orginal!

Home

Overview of the data used in the multilingual track

The table below gives basic statistics on the data sets of the multilingual track. There are a few notable differences compared to the corresponding table from the 2006 shared task, resulting from different design decisions. In particular, in 2006 all data sets were converted (if necessary) to make sure that every punctuation token (a) was attached as a dependent of some other token, and (b) did not have any dependents of its own. While this increased the homogeneity of the data sets, it also introduced some distortions in the analysis of specific constructions, notably coordination. In 2007 we have therefore decided to leave the annotation decisions of different treebanks intact, which means that there is more variation in the treatment of punctuation across data sets. More specific comments can be found below the table.

Arabic

Basque

Catalan

Chinese

Czech

English

Greek

Hungarian

Italian

Turkish

no. of tokens (*1000)

112

51

431

337

432

447

65

132

71

65

no. of sents (*1000)

2.9

3.2

15.0

57.0

25.4

18.6

2.7

6.0

3.1

5.6

tokens per sent

38.3

15.8

28.8

5.9

17.0

24.0

24.2

21.8

22.9

11.6

LEMMA present

Yes

Yes

Yes

No

Yes

No

Yes

Yes

Yes

Yes

no. of different CPOSTAG values

15

25

17

13

12

31

18

16

14

14

no. of different POSTAG values

21

64

54

294

59

45

38

43

28

31

no. of different FEATS values (separated by '|')

21

359

33

0

71

0

31

50

21

78

no. of different DEPREL values

29

35

42

69

46

20

46

49

22

25

no. of different DEPREL values with HEAD=0

18

17

1

1

8

1

22

1

1

1

% of tokens with HEAD=0

8.7

9.7

3.5

16.9

11.6

4.2

8.3

4.6

5.4

12.8

% of tokens with HEAD to the left

79.2

44.5

60.0

24.7

46.9

49.0

44.8

27.4

65.0

3.8

% of tokens with HEAD to the right

12.1

45.8

36.5

58.4

41.5

46.9

46.9

68.0

29.6

83.4

no. of tokens with HEAD=0 per sentence

3.3

1.5

1.0

1.0

2.0

1.0

2.0

1.0

1.2

1.5

% of relations that are non-projective

0.4

2.9

0.1

0.0

1.9

0.3

1.1

2.9

0.5

5.5

% of sentences with at least 1 non-projective relation

10.1

26.2

2.9

0.0

23.2

6.7

20.3

26.4

7.4

33.3

punctuation attached (always (A), sometimes (S), never (N))

S

S

A

S

S

A

S

A

A

S

no. of DEPRELS f. punct.

10

13

6

29

16

13

15

1

10

12

Note the following:

DataOverview (last edited 2007-02-09 18:54:27 by SandraKuebler)