# Copyright (c) 2002 by Palo Alto Research Center Incorporated. All rights reserved. # *** Generic Upper casing, to be run after sourcing character definitions. # This is sourced by the basic tok script. # Environment for Upcasing initial letters: after # certain punctuation marks and after a space # only at beginning or preceded by certain punctuation marks. # We don't have to worry about syntactic category labels here, # since they are in different environments. define Env [ Semicolon | Colon ] Space | [ LRB | LCB | LSQB | LDQ | LSQ ] (Space) | Hyphen | [Space]:[CAPSPACE]; # CAPSPACE above is a multicharacter symbol that acts as a parameter # in the machine that this script produces. If a higher level script executes # substitute defined Space for CAPSPACE # or if this is composed with a transducer that includes CAPSPACE:[Space] # then the initial character of every word could be capitalized. This might be # good for Eureka, where capitalization is messed up. If there is no # substitution or we execute # substitute symbol NOTHING for CAPSPACE # then the initial character only of words at the beginning or after defined # puncutation marks can be capitalized. This means that Bush at the beginning # of a sentence will be recognized as "^ bush" or "Bush" but only as "Bush" # later on in the sentence. So we won't even guess the lower-case later # in a sentence, and the guesser presumably will only produce the proper noun # reading. This might be good for the WSJ or more standard texts. # Capitalize the first letter if it is preceded by the initial # cap mark ICapMark. Running backwards, this will optionally # decapitalize, and if it does make the change, it will produce # the AllCapMark. # The \Alpha* at the beginning allows for funny parenthetic stuff # a la Eureka. define ICap InitialCapMark .x. 0 ; define ACap AllCapMark .x. 0 ; define CapInitial ~$CAPSPACE .o. \Alpha* (ICap Upcase) [$( Env (ICap Upcase) )]* ; # CapAll deals with lower-case letters after the first. # It provides for a literal output (the Alpha+) if # the input is not marked as all-caps (AllCapMark) If the input # has that mark, then it must contain at least one # lower-case letter, and the output will be all-caps. # The hyphen insures that the string ends in # a \Alpha, for the last iteration. The final one-plus # insures that the pattern stretches across each entire word. define CapAll [?* 0:%- ] .o. [\Alpha* ( ACap [Alpha $LC .o. [Upcase | UC]^>1 ] | Alpha+ ) \Alpha+]* .o. [?* %-:0] ; # MinimalCapAll allows a much more restricted set of all-upper-casing. It allows # alphabetic strings to go through unchanged, it all-upper-cases words that are all # lower-case to begin with, and it allows an all-lower-case tail of an initial-cap word # to be upper-cased. Single upper-case words do not receive the allcaps mark. # This variant is for efficiency--many fewer results. # A more liberal strategy would be to allow the CapAll for short words (e.g. less than # 5 characters) and the MinimalCapAll for longer words. define MinimalCapAll [?* 0:%- ] .o. [\Alpha* ( ACap [UC Upcase+ | Upcase^>1] | Alpha+ ) \Alpha+]* .o. [?* %-:0] ; define Capitalize CapInitial .o. MinimalCapAll .o. ~$CapMarks ;