The segmentation experiments reported in Read et.al (2012) used four sections of the recreated WSJ text in the ‘cooked’ directory. (See README in that directory for details on how the data was produced.) To create the unsegmented.txt file, used in the segmentation experiments: cat cooked/wsj0{3,4,5,6}.txt|perl -pe 's/^\[\d+\] \|//;'|\ perl -pe 's/ +/ /g;'|perl -pe 's/\n/ /;'|\ perl -pe 's/ +/\n\n/g' > unsegmented.txt And the segmented.txt, used for evaluation: cat cooked/wsj0{3,4,5,6}.txt|perl -pe 's/^\[\d+\] \|//;'|\ grep -v "^$" > segmented.txt