Note: this is no longer a wiki, only a static archive of the orginal!

Home

Data Download

For some datasets licenses were required to be filled out by the participants and sent to us. Some others licenses were to be sent to the LDC. Some were and are freely available.

Freely Available

The following datasets are freely available:

Dataset

Blind Test Data

Labeled Test Data

CHILDES (domain adaptation)

download

download

Note that the CHILDES test set is optional in the domain adaptation track. A newer, but unformatted, data set can be found here

Licensed by Treebank Providers (Not valid Anymore!)

During the course of the Shared Task 2007 it was possible to get the some datasets by sending a collection of signed licenses to one single fax number. However, these licenses are not valid anymore! We list them here for completeness. If you are still interested in the datasets mentioned here you need to get in touch with the corresponding treebank providers.

Dataset

License

Training Data

Blind Test Data

Labeled Test Data

Basque

download

download

download

download

Catalan

download

download

download

download

Chinese

download

download

download

download

Greek

download

download

download

download

Hungarian

download

download

download

download

Italian

download

download

download

download

Turkish

download

download

download

download

Note that a new version of the Basque gold standard test set was released on April 23, 2007. The new version differs only with respect to one sentence.

Note that new versions of the English and Hungarian test sets were released on March 28, 2007. The only difference compared to the first release is the removal of trailing tab stops.

Note that a new version of the Chinese data set was released on February 6, 2007. In the new version, a small number of erroneous CPOSTAG and DEPREL labels have been corrected and seven errors involving collapsed rows or columns have been fixed. As a consequence, the number of tokens has increased from 337,162 to 337,175. (The number of sentences is the same as before.)

Note that a new version of the Basque data set was released on February 9, 2007. In the new version, a few features, which are really output features for a parser, have been removed from the FEATS field. Prior to this, a new version had been released February 7, 2007, where a number of erroneous CPOSTAG and DEPREL labels have been corrected. As a consequence, the number of distinct DEPREL labels decreased from 161 to 35.

Note that a new version of the Catalan data set was released on February 9, 2007. In the new version, a small number of errors have been corrected and the number of distinct DEPREL labels has been reduced from 161 to 35.

Licensed by the LDC

For Czech, Arabic and all English data a license needs to be faxed to the LDC.

Note that the English download contains data for both the multilingual track and the domain adaptation track, which includes unlabeled data for the latter.

Note that new versions of the Arabic and Czech data sets were released on February 5, 2007.

DataDownload (last edited 2008-03-08 14:32:06 by SebastianRiedel)