Note: this is no longer a wiki, only a static archive of the orginal!

Recommendations for writers of treebank conversions scripts

Note: There are at least three different kinds of treebank conversion scripts:

Obviously, scripts can also combine several of these conversions. Most of the recommendations below apply to all these kinds of scripts. If not, we will explicitly say so.

These recommendations are inspired by Murphy's law "If anything can go wrong, it will." Don't let them depress you :-)

They are only recommendations and you might not have the time to implement all of them (now) but you should at least be aware of them and document which ones you did or did not follow (so somebody else might do the rest later).

Validate input format

Do not assume that the input treebank complies with any specifications unless you have proof to the contrary!

For example, if the treebank is in XML, check whether there is a DTD or something equivalent. If so, validate the treebank. If not, consider writing one first, ideally in collaboration with the treebank authors. If the treebank is in any other format, the treebank authors might still have validation software for it. Ask them for it. But do not assume that they have actually validated the version you received.

Try to understand the details of what the DTD/validator software checks. It might not check everything that is necessary. E.g. if the treebank contains internally structured attribute strings, such as postag="N(common,masc,pl)", which generally is a bad idea, the DTD might not check these.

If no DTD or other validator exists, try to find a detailed format description. If that does not exist, write one, as detailed as possible, and check it's content with the treebank authors. Not everything that looks like a mistake is one, it might just be an undocumented special case.

Try to be as specific as possible, e.g.

Warn informatively

If no validator exists, you could write one as a stand-alone application, give it to the treebank authors and hope for them to correct the treebank until it is valid. However, that might be unpractical, as it can take a long time. Alternatively, you have to write a conversion script that is robust enough to deal with buggy input. However, your script should never just silently deal with format problems. Instead, it should output a line (possibly to SDTERR) with as much information as possible. This should include at least the following:

Send warnings/error messages to treebank authors

They might just fix them eventually... Or at least warn other users about them...

Use options

In some cases, there are several alternative ways how to deal with recoverable problems. For example, some users might want to be strict and ignore any sentence that has a problem. Others might want to use either an existing value for missing information (e.g. anything without a POS could be called "name") or a special value (e.g. '???'). Provide options for choosing among these alternatives. Also consider options for the level of warnings (e.g. do or do not warn about incorrect whitespace). However, the default should always be to warn.

Don't use hard-coded strings in the code

Don't write code like this:

if (some_condition_holds) {
   $pos = '???';

Instead have a constant for the string at the beginning of the script:

$unknownPosValue = '???'; # ... some useful comment ...


if (some_condition_holds) {
   $pos = $unknownPosValue;

Ideally, allow users to specify the value through an option.

Document your code

This should be obvious.

Validate output format

Your script might not be fool-proof. Or new versions/extensions of the original treebank might contain new unexpected cases.Order Viagra Online Viagra online Buy Viagra Generic Buy cialis Buy Viagra Buy Viagra online Hoodia Gordonii Provillus hair loss Cialis Online Mp3 Download Mp3 songs Download

For XML input/output

Use existing XML libraries. They do exist for most newest versions of programming/scripting languages, e.g. Perl or Python. Parsing and writing XML "by hand" is error prone.

Special case of phrase structure to dependency conversion

Comments/additions welcome,

Sabine Buchholz

(1) Based on an idea by Montserrat Civit (p.c.), although with a slightly different interpretation.

RecommendationsForScripts (last edited 2008-02-02 09:55:26 by 79-126-29-236)