Note: this is no longer a wiki, only a static archive of the orginal!


Domain Adaptation Track

The domain adaptation track of the shared task will investigate techniques for adapting current parsing technologies to domains outside of the data from which they were trained. See the background page for some further information on previous approaches.

Task Definition

The task in this track is to learn how to derive labeled dependency structures for English by means of a fully automatic domain independent dependency parser. The input format follows that of the Multilingual Track, which can be found on the data format page. For these data sets there will be no lemma or morphological properties of each word -- only part-of-speech. As in the Multilingual Track, for each token, the parser must output its head and the corresponding dependency relation.

The purpose of this track is to examine how a parser trained in one domain can perform in another. A single parsing framework must be used, but it is conceivable that the parsing models are trained with different parameters or unlabeled data depending on the domain in which it will be used.

Evaluation will be identical to the Multilingual Track.

For the domain adaptation track, users may submit to either a closed class or an open class, which is described below.

Closed class: Systems can lean and be developed **only** on the data provided by the organizers. This also prohibits the use of any additional taggers or other components that have either been trained or hand developed on another set of data.

Open class: Resources permitted include: additional annotations for the data provided, additional data, or additional system components that have been trained or developed on data not provided by the organizers. However, in the spirit of the shared task, systems should only use resources from WSJ like domains (i.e., news) to avoid training or developing systems on the test domains. The only exception is for unlabeled data, which may come from any domain. If a participant is unsure of whether a particular resource is allowed, they can forward their inquiry to the organizers.

Participants can submit to either the closed class or the open class or both.

Data Sets

The training set will be derived from the Penn WSJ Treebank. It will be identical to the English training set from the Multilingual Track. It will contain rougly 500,000 tokens of parsed data.

The development and each test set will be drawn from non-news related sources and each will contain roughly 5,000 tokens of parsed data (human curated). The domains of the test data will not be known until the official release of that data. There will most likely be two test sets from domains different than the training and development sets. The test data will include either gold standard part-of-speech tags or automatically derived part-of-speech tags.

**Note: There will be no in-domain training data provided for the development or test sets.**

In additon, large unlabeled corpora for each data set (training, development, test) will be available. This data will be tokenized and will be included as part of the download and will be clearly indicated. The unlabeled data for the test domains will be released at the same time as the labeled test data.

Each data set comes from a different source. As a result attempts were made to standardize tokenization, the label and part-of-speech tag set, as well as ensure as much consistency as possible in the annotated dependency relations.

English was chosen for this task due to availability of resources in multiple domains.

The data can be downloaded as follows:

DomainAdaptationTrack (last edited 2007-03-26 15:02:41 by RyanMcDonald)