this directory contains various summary views on the LOGON development and test corpus. the files in here, typically, are produced using a combination of one or more of the scripts in `$LOGONROOT/uio/data/bin/' (which are provided kindly by lars nygaard) and some of the standard Un*x text utilities. please read on to see what the individual files are, and how they were constructed. where a recipe for creation of individual files is provided, the resulting files have been compiled by me; all others (i.e. everything requiring access to the test data) were provided by lars. (6-nov-06; oe) + jh.no.forms, ps.no.forms, tg.no.forms, jhpstg.no.forms these four files are lists of tokens (word forms) from the three development corpora, ordered by frequency (plus one combined list, `jhpstg.no.forms'). capitalization is preserved from the original texts, and sentence-initial forms are flagged with an asterisk (`*'). some punctuation marks have been removed. cat $LOGONROOT/uio/data/jh{0,1,2,3,4,5}.txt > $LOGONROOT/uio/data/jh.txt $LOGONROOT/uio/data/bin/extr_vocab_fan.pl < $LOGONROOT/uio/data/jh.txt \ | sed -e 's/^\*\*/\*/g' -e 's/^\*+/\*/g' | sort | uniq -c | sort -nr \ > $LOGONROOT/uio/data/lists/jh.no.forms $LOGONROOT/uio/data/bin/extr_vocab_fan.pl < $LOGONROOT/uio/data/ps.txt \ | sed -e 's/^\*\*/\*/g' -e 's/^\*+/\*/g' | sort | uniq -c | sort -nr \ > $LOGONROOT/uio/data/lists/ps.no.forms $LOGONROOT/uio/data/bin/extr_vocab_fan.pl < $LOGONROOT/uio/data/tg.txt \ | sed -e 's/^\*\*/\*/g' -e 's/^\*+/\*/g' | sort | uniq -c | sort -nr \ > $LOGONROOT/uio/data/lists/tg.no.forms cat $LOGONROOT/uio/data/{jh,ps,tg}.txt \ | $LOGONROOT/uio/data/bin/extr_vocab_fan.pl \ | sed -e 's/^\*\*/\*/g' -e 's/^\*+/\*/g' | sort | uniq -c | sort -nr \ > $LOGONROOT/uio/data/lists/jhpstg.no.forms + jhk.no.forms, jhk.en.forms, psk.no.forms, tgk.no.forms word lists (most likely compiled using the same `extr_vocab_fan.pl' script as for the development corpus) for the known-vocabulary test segments of JH, PS, and TG. + jhk.no.new, psk.no.new, tgk.no.new set differences of, for example, `jhk.no.forms' minus `jhpstg.no.forms'. in other words, word forms found exclusively in one of the known-vocabulary test segments, but not anywhere in the development corpus. perl $LOGONROOT/uio/data/bin/diff_vocab_lists.pl \ $LOGONROOT/uio/data/lists/{jhpstg,jhk}.no.forms \ | sort > $LOGONROOT/uio/data/lists/jhk.no.new perl $LOGONROOT/uio/data/bin/diff_vocab_lists.pl \ $LOGONROOT/uio/data/lists/{jhpstg,psk}.no.forms \ | sort > $LOGONROOT/uio/data/lists/psk.no.new perl $LOGONROOT/uio/data/bin/diff_vocab_lists.pl \ $LOGONROOT/uio/data/lists/{jhpstg,tgk}.no.forms \ | sort > $LOGONROOT/uio/data/lists/tgk.no.new + old/jhpstg.no.forms a legacy copy of the former (incomplete) `$LOGONROOT/uio/data/lists/a.form'. + old/psk.no.forms, old/psk.en.forms, old/tgk.no.forms, old/tgk.en.forms legacy copies of the former word lists for the known-vocabulary held-out parts of PS and TG (which used to be in `$LOGONROOT/uio/data/test-vocab/'). these are now superseded by files of the same name in the parent directory, because the PS and TG parts of the test corpus had to be reduced in size, in order to make the proportions of items from the three distinct sources parallel to the distribution in the development data. with a total of 200 JH items held out, the test segments of PS and TG had to be limited to 30 and 95 items, respectively (see `maintainers' email around 2-nov-06). + old/jhpstg.no.new the set difference of the current (complete) word list for the development parts of JHPSTG, minus the earlier (incomplete) list of JHPSTG word forms. perl $LOGONROOT/uio/data/bin/diff_vocab_lists.pl \ $LOGONROOT/uio/data/lists/{old/jhpstg,jhpstg}.no.forms \ | sort -u > $LOGONROOT/uio/data/lists/old/jhpstg.no.new note that this list (with 1348 entries) over-estimates the size of missing vocabulary in the original list. the original list was compiled downcasing all forms and not putting the asterisk flag on sentence-initial forms, hence quite some of the gaps reported in `old/jhpstg.no.new' are non-issues. + old/jhpstg.no.surprise another take at the same set difference, attempting to wash out the effect of capitalization and initial asterisks. gawk '{ sub(/^\*/, "", $2); printf("%s %s\n", $1, tolower($2)); }' \ $LOGONROOT/uio/data/lists/jhpstg.no.forms > /tmp/jhpstg.forms.new gawk '{ printf("%s %s\n", $1, tolower($2)); }' \ $LOGONROOT/uio/data/lists/old/jhpstg.no.forms > /tmp/jhpstg.forms.old perl $LOGONROOT/uio/data/bin/diff_vocab_lists.pl \ /tmp/jhpstg.forms.old /tmp/jhpstg.forms.new \ | sort -u > $LOGONROOT/uio/data/lists/old/jhpstg.no.surprise this time we end up with 461 forms, and the list looks plausible to me (oe). however, this set could in principle under-estimate the number of forms that were missing in the earlier JHPSTG word list: once everything is downcased, it could happen that the proper name `Ås' gets conflated with the common noun `ås'. assuming we wanted both, if the former were in the incomplete list but not the latter, the common noun would not be in `old/jhpstg.no.surprise'. + handon.no.forms, handon.en.forms cat $LOGONROOT/uio/data/*.no.txt $LOGONROOT/uio/data/tg+.txt\ | egrep '^\[[0-9]+\]' | grep -v '
' | egrep -v '^[\t]*$' \ | $LOGONROOT/uio/data/bin/extr_vocab_fan.pl \ | sort | uniq -c | sort -nr > $LOGONROOT/uio/data/lists/handon.no.forms cat $LOGONROOT/uio/data/*.en.txt \ | egrep '^\[[0-9]+\]' | grep -v '
' | egrep -v '^[\t]*$' \ | $LOGONROOT/uio/data/bin/extr_vocab_fan.pl \ | sort | uniq -c | sort -nr > $LOGONROOT/uio/data/lists/handon.en.forms + handon.no.new, handon.en.new perl $LOGONROOT/uio/data/bin/diff_vocab_lists.pl \ $LOGONROOT/uio/data/lists/{jhpstg,handon}.no.forms \ | sort -u > $LOGONROOT/uio/data/lists/handon.no.new perl $LOGONROOT/uio/data/bin/diff_vocab_lists.pl \ $LOGONROOT/uio/data/lists/{jhpstg,handon}.en.forms \ | sort -u > $LOGONROOT/uio/data/lists/handon.en.new