The ‘raw’ version of the Brown corpus has been constructed by starting with the tagged version of the corpus available from http://archive.org/details/BrownCorpus and applying various (automatic and manual) transformations. Paragraph breaks from the tagged corpus were maintained. The Bergen Format I version of the data was used to inform the transformation decisions made. First, the automatic changes: * normalises spacing * drop the tags * treats sentences tagged as headlines as one-line paragraphs * remove extra spaces around punctuation, where this can be done automatically * maps LaTex quotes back to double straight quotes, where this can be done automatically * removes the double punctuation that wasn't in the raw for x in c*[0-9]; do base=`echo $x|perl -pe 's/(..)\d+/$1/;'`; if [ ! -d ../cooked/$base ]; then mkdir ../cooked/$base; fi; cat $x|../scripts/normalisetagged.pl > ../cooked/$base/$x; done Text matching the following patterns was manually corrected according to Bergen Format I version: /^['"] / / ['"]$/ / ['"] / /"' / / '[",?!;]/ /''/ /``/ Other errors were opportunistically corrected, if they came up while searching, but no other systematic corrections were made. To create the unsegmented.txt file, used in the segmentation experiments: cat cooked/*/* |perl -pe 's/ +/ /g;'|perl -pe 's/\n/ /;'|\ perl -pe 's/ +/\n\n/g' > unsegmented.txt And the segmented.txt, used for evaluation: cat cooked/*/* |grep -v "^$" > segmented.txt