tokenizer, putzer, htmlEnt2Char -- three tools for corpus processing =================================================================== tokenizer -- a tokenizer whith end-of-sentence detection (see "tokenizer -h") putzer -- remove unnecessary blanks etc. (``putzer'' is German and means `cleaner") (see "putzer -h") htmlEnt2Char -- converts HTML-entities into characters (see "htmlEnt2Char -h") Compile and install (see INSTALL): ./configure make && make install I tried to write a fast, rule-based, and also to some extend robust tokenizer and sentence segmenter. Actually supported languages are: * German (see also file LIESMICH) * English (thanks also to Michaela Geierhos) * Russian For each language the corresponding codepage of ISO and MS-Windows, and partly UTF-8. Features: 1. customizable through options - language and codepage - try to undo hyphenation - semantics of line breaks (paragraph separator or not) - etc. 2. problems and strategies for tokenization - hyphenated words are considered as one token - option -c concatenates words with hyphen at end-of-line. This may cause errors, although a small exception list is defined 3. end-of-sentence detection: - positive: * end-of-sentence marker followed by blank and uppercase letter - negative: * abbreviations (except for, e.g., "etc." which often occurs at EOS) * dates - positive: * negative followed by word usually used exclusively at BOS * capitalized determiners, conjunctions, etc. - try to handle additional punctuation symbols following the full-stop correctly (brackets, apostrophes etc.) - tests on the Brown corpus support an error rate of about 3% Version history: 0.1 -- package with tokenizer, putzer, htmlEnt2Char 0.2 -- bug reported by js: when input contains long words or many following newlines tokenizer stops with "input buffer overflow". To avoid this use putzer as filter with newly introduced option -m! 0.3 -- optimization (inlines & macros): now about 10% faster 0.4 -- corrected some details in German EOS-detection, changed behaviour with option -sx: When a newline is recognized a space is printed on a separate line, instead of an empty line. 0.5 -- ':' now not considered as EOS-mark. Additions to German abbreviation list. 0.6 -- Added more German abbreviations, Roman numerals with point. Added rudimentary support for utf-8 in German. 0.7 -- Better EOS for English, thanks to Michaela Geierhos; rudimentary support for utf-8 in English 0.8 -- fixed a bug in the Russian part, which makes the tokenizer hanging 0.9 -- changes to German abbreviations rudimentary support for utf-8 in Russian 0.10 -- fixed bug raising a segfault for German language option short sequences in parenthesis are excluded from containit an end-of-sentence additions to German abbreviations 0.11 -- fixed a bug with options -C and -c. Introduced positive rules for German EOS : i.e. if a capitalized article, conjunction or prepositions follows an abbreviation or date, there should be an EOS. 0.12 -- better documentation (in English) Positive rules also for English: The text «The firm said it plans to sublease its current headquarters at 55 Water St. A spokesman declined to elaborate.» (Wall Street Journal) is now correctely splitted into two sentences 1.0 -- (almost) no changes GPL licensed now