Recommendations for treebank projects or How to let your users work for you :-)
- Explicitly describe the format you are using in as much
detail as possible. This refers to all levels of format,
e.g.
- tag set, function set, constituent label set, feature set (this is the most obviously needed information about a treebank and mostly present, although not always in electronic form)
- file naming conventions (do file names convey any information?)
- file encoding (Latin-1, -2, etc., UTF-8, ...)
-
encoding of special characters (if not using Unicode), e.g. by entity name such as " for double quotes; provide full list of all such entities and a translation
-
Note: XML does not allow ' inside a single quoted string ('...') or " inside a double quoted string ("..."), so you have to use '/" instead
- are features ordered in any way; if not, say so explicitly where is whitespace (tabs, spaces) allowed (e.g. end of line?) or required (e.g. between features?)
- where are blank lines allowed/required?
- Write a format checker and use it after
each change or at least before each release
- for XML, this can be a DTD/Schema but make sure that the DTD/Schema does actually contain all the important details
-
if you do not have a format checker/DTD/etc. encourage users to write one for you
based on the above-mentioned detailed description
- Consider setting up a Wiki page where users can post information about software they have written for the treebank, or maintain a webpage yourself
- Encourage users to send information about
errors they think they have found in the treebank, then
correct them
-
even easier: encourage users to send "patches", i.e. proposed corrections; then either reject them with an explanation (keep patch and explanation on web page) or accept and incorporate them into the next release (then delete from web page)
-
- Have a clear version numbering scheme, so that researchers, when reporting about experiments, can specify "We used version x.y of the treebank". Keep old versions for reference.
- If the treebank contains text from different genres: have a (preferably electronic) list of which files/sentences belong to which genre
- If your treebank consists of more than one file: state whether file boundaries correlate with any divisions in the original texts
- If file boundaries do not coincide with original text boundaries: have a (preferably electronic) list of which files belong to which original text
- If possible: preserve original tokenization
somehow/somewhere, e.g.
- keep relation between annotation and text through pointers into raw text instead of phycially inserting the annotations into the text (missing: reference to such annotation schemes)
- have an additional field "original sentence"
-
use special marks to indicate whether two tokens where originally joined (e.g. SUSANNE's "+", see Section 4.5) or one token originally split (underscores for multiwords)
- For XML only: Most information should be put in tags, not in attributes of tags because much more format checking can be done for the former (to do: ask Erwin for reference...). It is particularly bad to have string attributes with internal structure.
Comments/additions welcome,
Sabine Buchholz