Recommendations on how to split a treebank into training and test set for machine learning experiments
- respect original text boundaries: don't put some sentences from the same original text into the training set and others in the test set as this would articially
- if treebank contains different genres: if purpose of
experiment is to:
- estimate general parsing performance
- take some sentences from each genre into the training and some into the test set:
-
don't just follow the usual Penn Treebank practice of taking one section as test set; see Bikel "A Statistical Model for Parsing and Word-Sense Disambiguation": ... we created a small test set, blindly choosing the last 117 sentences, or 1%, of our 220k word corpus, sentences which were, as it happens, from section r of the Brown Corpus. After some disappointing parsing results using both the regular parser and our WordNet extended version, we peeked in (Francis and Kuera, 1979) and discovered this was the humor writing section; our initial test corpus was literally a joke.
- compare parsing for different genres
- do x-fold cross-validation (x is number of genres), each time taking one genre as test set
- estimate general parsing performance