LOGON training corpus - v 0.7.3


CAVEATS

The corpus has not reached a stable version:

* The non-sentence annotation is in fairly poor shape pending
clarification of annotation guidelines (ref. posting on the LOGON
workspace). There may also be other errrors (aligment, segmentation)
but they should be relatively few. 

* Word-aligned versions are not yet provided.


FORMAT

The file format is:
- one sentence per line
- each line identified by [filenumber-sentencenumber]
- headings, list items and other "non-sentences" are marked with a
  vertical bar:

	[1-2 |] Preikestolen

The directories called "a" contains the Norwegian texts.
The directories called "b" contains English translations from Michael Brady.
The directories called "c" contains English translations from Tim Challman.
The directories called "d" contains the original English translations.

Text ids are from 0 to 27:
0-5: Jotunheimen texts
6: the Preikestolen text
7-8: "gruppe" texts
9-27: "turglede" texts 

In addition to the text and alignment files, the package incluedes:
- transformations to HTML
- transformations to a [incr tsdb()]-friendly format (.fan files)
- tagged versions of the texts (.tagged files)
- vocabulary counts (in the 'lists' folder) in various versions:
	- word forms only (.form files)
	- lemma forms only (.lemma files)
	- lemma forms and POS tag (.lemma+pos files)
	- word forms and full tags (.form+tag files)
	- word forms, lemma forms and full tags (.form+lemma+tag files)
         - tokens, types, instances, and POS use (.summary files)
         - lemmas not in the grammar lexica (.gramdiff files)
         - lemma-pos touples not in the grammar lexica (.gramdiffpos files)

Note that the gramdiff-files for NorGram contain some frequent words that have special treatment in NorGram (the lexical statistics program currently only understand the NKL-derived lexicon files). Also, the gramdiffpos-files for ERG are also slightly misleading since the finding the POS of words in the ERG is a non-trivial task.

Note, also, that only jotun/b and jotun/c are translations made for LOGON.
The other translations have numerous alignments that are not 1 to 1,
which will presumably yeild some fairly weird BLEU scores.


TOOLS

Also included are the current (still preliminary, but useful)
versions of some scripts that can be used to create additional
training or testing sets. Some documentation for these scripts
will be included shortly. One hint, though: The most useful script,
preprocess.pl, can be run like this (for Norwegian):

$ bin/preprocess.pl --id=1 --abbr=dat/abbr-no.dat --verbs=dat/verbs-no.dat

(the program uses standard input and output). "id" refers to the file id.


COMMENTS/COMPLAINTS

lars.nygaard@iln.uio.no or bugs.emmtee.net