Treebank Troubles

Treebanks are necessary resources for data-driven parsing. However, they are sometimes not as perfect as one would wish for a gold standard. See the paper (link to be added) by Sabine Buchholz and Darren Green at the LREC 2006 Workshop on "Quality assurance and quality measurement for language and speech resources" for a general overview of problems one can encounter with treebanks.

If the treebank providers themselves do not fix or log the problems, all researchers who want to work with the treebank will have to discover and solve these problems on their own. This page wants to encourage the sharing of these discoveries and possible solutions, so that not everybody has to "reinvent the wheel".

You might also want to share other treebank-specific information, e.g. which parts of a treebank are from which text source/genre etc. if that information is not obvious from the annotation or documentation, or which parts you used for as training or test material.

