The multilingual track of the shared task is organized in the same way as the 2006 task (see http://nextens.uvt.nl/~conll/) with annotated training and test data from a wide range of languages, which should be processed with one and the same parsing system.
The task in this track is to learn how to derived labeled dependency structures for a range of language by means of a fully automatic dependency parser. The input consists of (minimally) tokenized and part-of-speech tagged sentences. Each sentence is represented as a sequence of tokens plus additional features such as lemma, part-of-speech, or morphological properties. For each token, the parser must output its head and the corresponding dependency relation (secondary dependencies are not taken into consideration). (More information about the representation of input and output can be found on the data format page.)
Although data and settings may vary between languages, the same parser should handle all languages. The parser must therefore be able to learn from training data, to generalize to unseen test data, and to handle multiple languages, possibly by adjusting a small number of hyper-parameters. Participants in the multilingual track are expected to submit parsing results for all languages involved (see below).
The main evaluation metric is the labeled attachment score, i.e. the proportion of tokens that are assigned the correct head and dependency relation, although scores will also be given for unlabeled attachment score (tokens that are assigned the correct head) and label accuracy (tokens that are assigned the correct dependency relation). Unlike in 2006, punctuation tokens will be included in all evaluation metrics. Some gold standard dependency structures against which systems are scored will be non-projective. A system that produces only projective structures will nevertheless be scored against the partially non-projective gold standard. (More information about the evaluation procedure can be found on the software page.)
The multilingual track will include data from the following languages:
The selection of languages is motivated by the desire to have a typologically diverse set of languages, to include languages that were not included in the 2006 shared task, and to include languages for which genuine dependency treebanks are available (as opposed to conversions from other types of annotation).
Training sets will vary in size from 50,000 to 500,000 tokens, while test sets for all languages will consist of approximately 5,000 tokens. (The decision to reduce the size of the largest data sets, as well as limit the number of languages, was based on experience from the 2006 shared task, where some teams simply did not have time to complete the training phase for all languages.)