;;; -*- mode: fundamental; coding: utf-8; indent-tabs-mode: t; -*- ;;; ;;; another shot at a finite-state language for preprocessing, normalization, ;;; and tokenization in LKB grammars. requires LKB version of 1-feb-09 or ;;; newer. note that the syntax is rigid: everything starting in column 2 ;;; (i.e. right after the rule type marker) is used as the match pattern until ;;; the first `\t' (tabulator sign); one or more tabulator are considered the ;;; separator between the matching pattern and the replacement, but other ;;; whitespace will be considered part of the patterns. empty lines or lines ;;; with a semicolon in column 1 (i.e. in place of the rule type marker, this ;;; is not Lisp) will be ignored. ;;; ;;; this is a fresh attempt (as of September 2008) at input tokenization. for ;;; increased compatibility with existing tools (specifically taggers trained ;;; on the PTB), we now assume a PTB-like tokenization in pre-processing. the ;;; grammar includes token mapping rules (using the new chart mapping machinery ;;; in PET) to eventually adjust (i.e. correct, in some cases) tokenization to ;;; its needs. specifically, many punctuation marks will be re-combined with ;;; preceding or following tokens, reflecting standard orthographic convention, ;;; and are then analyzed as pseudo-affixes. ;;; ;;; this file is inspired by the PTB `tokenizer.sed' script, and by and large ;;; should yield very similar results. with the addition of token mapping as ;;; a separate step inside the parser, we want to restrict RE-based processing ;;; to pure string-level phenomena. however, to actually tokenize (following ;;; some set of principles), we need to do more than just break at whitespace. ;;; some punctuation marks give rise to token boundaries, but not all. also, ;;; inputs (in the 21st century) may contain some amount of mark-up, where XML ;;; character references have become relatively common. full UniCode support ;;; now makes it possible to represent a much larger range of characters, e.g. ;;; various types of quotes and dashes. we aim to map mark-up to corresponding ;;; UniCode characters, and preserve those in parsing, as much as possible. ;;; ;;; the original `tokenizer.sed' script actually cannot always yield the exact ;;; tokenization found in the PTB. the script unconditionally separates a set ;;; of punctuation or other non-alphanumeric characters (e.g. |&| and |!|) that ;;; may be part of a single token (say in |AT&T| or URLs). we aim to do better ;;; than the original script, here, conditioning on adjacent whitespace. ;;; ;; ;; preprocessor rules versioning; auto-maintained upon CVS (or SVN) check-in. ;; @$Date: 2009-04-28 14:07:20 +0200 (tir, 28 apr 2009) $ ;; ;; tokenization pattern: after normalization, the string will be broken up at ;; each occurrence of this pattern; the pattern match itself is deleted. ;; :[ \t]+ ;;; ;;; string rewrite rules: all matches, over the entire string, are replaced by ;;; the right-hand side; grouping (using `(' and `)') in the pattern) and group ;;; references (`\1' for the first group, et al.) carry over part of the match. ;;; ;; ;; pad the full string with trailing and leading whitespace; makes matches for ;; word boundaries a little easier down the road; also, squash multiple spaces ;; and replace tabulators with a space. ;; !^(.+)$ \1 ! + !\t ;; ;; a set of `mark-up modules', often replacing mark-up character entitities ;; with actual UniCode characters (e.g. |—| or |---|), or just ditching ;; mark-up that has no bearing on parsing for now (e.g. most wiki mark-up). ;; these modules can be activated selectively by name in the REPP environment ;; or the top-level call into REPP. ;; >xml >latex >ascii >wiki ;; ;; two special cases involving periods: map ASCII ellipsis (|...|) to a single ;; UniCode character (|…|), and convert |..| between numbers into an n-dash, ;; i.e. a numeric range (typically tokenized off, i.e. |42| |–| |43|). maybe ;; the latter can also occur between non-numbers? we could also just preserve ;; it, but always make it a token in its own right? ;; ;; _fix_me_ ;; what about a sentence-final period following the ellipsis (as in cb/7060)? ;; (24-sep-08; oe) ;; !([^.])\.\.\.+([^.]) \1 … \2 !\[\.\.\.\] … !([0-9]) *\.\. *([0-9]) \1 – \2 ;; ;; some UniCode characters force token boundaries: m-dash, ellipsis. we used ;; to include n-dashes in this list, but some authors (e.g. ESR) use n-dashes ;; much like hyphes, attempting to be clever about bracketing in expressions ;; like |non–source-aware| [users]. ;; !([—…]) \1 ;; ;; deviating from the PTB conventions, we use one-character double quote marks ;; (i.e. |“| and |"| instead of |``| and |''|); much like the PTB, however, we ;; aim to disambiguate neutral quotes (|"| and |''|) at the string level, i.e. ;; opening quotes are preceded by a token boundary (white space), with a small ;; number of additional, token-initial characters than can intervene. anything ;; else, we assume, is a closing quote. rather than the proper UniCode closing ;; quote (|”|), however, use a straight double quote (|"|), which can double as ;; a unit of measure (feet). do the same for single quotes, using apostrophes ;; (|'|) rather than proper closing quotes (|’|), to allow ambiguity with the ;; possessive maker, specifically when following |s|. to not create spurious ;; ambiguity, preserve UniCode closing quotes, if used in the input. ;; ;; convert quotes to single characters prior to tokenizing off other characters ;; (group #1 below) to make adjacent whitespace detection easier, as e.g. in ;; |``$20!''|. ;; ;; _fix_me_ ;; in principle, i just discovered, there are separate prime and double prime ;; UniCode characters, intended for the units of measure. i doubt we see them ;; in any of the existing data sets, but in carefully edited documents, they ;; may show up eventually. assuming these are never used as quotes, we should ;; probably preserve them here. but as for the distinction between straight ;; and closing quotes, i now suspect we might see the closing quotes as a unit ;; of measure too. hence, consider ditching straight quotes altogether. ;; (23-jan-09; oe) ;; !`` “ !(^| [[({]*)("|'') \1“ !'' ” !` ‘ !(^| [[({]*)' \1‘ ;; ;; normalize stylistic variance in (directional) quote marks. once these rules ;; are complete, we are down to only six quote marks: |“|, |”|, |"|, |‘|, |’|, ;; and |'|. of these, the straight ones (the traditional ASCII characters) are ;; ambiguous between being a closing quote and something else. ;; ![„«] “ ![»] ” ![‚‹] ‘ ![›] ’ ;; ;; remove space after initial |O'| and |L'|, i.e. irish and romance names, to ;; avoid stripping off their apostrophes. ;; ! ([OlL])['’] \1' ;; ;; a new REPP facility: named groups and iterative group calls. there are a ;; number of characters that PTB tokenizes off (unconditionally, it seems, in ;; the original `tokenizer.sed'), though not when they are parts of names or ;; NE patterns, e.g. |AT&T| or |http://www.emmtee.net/?foo.php&bar=42|. thus, ;; we only want these as separate tokens when they are preceded or followed by ;; whitespace; this leaves a problem with, say, |http://www.emmtee.net/|, where ;; one would have to apply NE recognition (what used to be `ersatzing') _prior_ ;; to tokenization. ;; ;; either way, because characters we want to tokenize off might be `clustered' ;; with each other, e.g. |(42%), |, the notion of adjacent whitespace needs to ;; apply transitively through such clusters. it seems an iterative group is ;; the most straightforward way of getting that effect. the rules from the ;; group will be applied repeatedly (in order) at the time the group is called ;; (by means of the `>' operator), until there are no further matches. we need ;; to be careful to avoid indefinite recursion within the group, i.e. not add ;; duplicate spaces. thus, ditch multiple spaces initially. ;; ;; at this point, we exclude a few punctuation characters from this policy, in ;; part because that is the PTB approach (|-| and |/|), in part because they ;; can be prefixes or suffixes of one-token named entities, i.e. |<| and |>| in ;; URLs and email addresses. to work around these, we may need a string-level ;; `ersatzing' facility, associating a sub-string (that can be unambiguously ;; identified by surface properties, e.g. a URL) with an identifier of a token ;; class. ;; ;; like in the original PTB script, periods are only tokenized off in sentence- ;; final position, maybe followed only by closing quote marks or parentheses. ;; ;; _fix_me_ ;; there is an issue with some of the characters that are asserted (at least in ;; whitespace adjecency) to constitute separate tokens, specifically the dollar ;; sign. inputs like |HK$ 7.8| will end up with a bogus token boundary (which ;; is the case too in the original PTB sed(1) script). (15-jul-09; oe) ;; ! + #1 !([^ ])([][(){}?!,;:@#$€¢£¥%&“”"‘’']) ([^ ]|$) \1 \2 \3 !([^ ])\. ([])}”"’' ]*)$ \1 . \2 !(^|[^ ]) ([][(){}?!,;:@#$€¢£¥%&“”"‘’'])([^ ]) \1 \2 \3 # >1 ;; ;; any word-final apostrophe, by now, should be separated (e.g. |abrams'| --> ;; |abrams '|). which only leaves contracted forms, including the undesirable ;; PTB ones, e.g. |don't| --> |do n't|. but not |cannot| --> |can not| and the ;; more obscure ones: |gimme|, |lemme|, |'tis|, |wanna|, et al. ;; ;; _fix_me_ ;; the |cannot| case, especially without characterization information, is a bit ;; challenging: it presumably is frequent enough so that for PTB compliance we ;; should pull it apart, but that would seem to introduce unwanted ambiguity. ;; i doubt that |she cannot participate on monday| has the reading of her being ;; able to `not participate' (stay out of the way) on monday. or does it? ;; (19-sep-08; oe) !([^ ])['’]([dDmMsS]) \1 '\2 !([^ ])['’](ll|LL|re|RE|ve|VE) \1 '\2 !([^ ])(n['’]t|N['’]T) \1 \2 ;; ;; to allow parsing (of inputs involving basic punctuation) in the LKB, there ;; is a REPP module to undo PTB-style separation of tokens. this module will ;; only be activated for use within the LKB, not by preprocess-for-pet(). ;; >erg