;;; -*- Mode: tdl; Coding: utf-8; indent-tabs-mode: nil; -*- ;;; ;;; Copyright (c) 2009 -- 2010 Stephan Oepen (oe@ifi.uio.no); ;;; copyright (c) 2009 -- 2010 Dan Flickinger (danf@stanford.edu); ;;; see `LICENSE' for conditions. ;;; ;;; ;;; upon completion of `lexical parsing' (i.e. application of lexical rules ;;; until a fix-point is reached), we can now filter lexical entries. there is ;;; little point attempting to do that earlier (as PET used to in its original ;;; `-default-les' mode, where generics were only activated where there seemed ;;; to be `gaps' in the _initial_ lexical chart, i.e. after lexical lookup). ;;; ;;; the main problem in this approach is the interaction with orthographemics: ;;; in the initial lexical chart, there will be an edge analysing |UPS| as the ;;; plural or 3sg present tense form of the preposition |up|. it is only once ;;; lexical rules have been processed that we know such hypotheses have turned ;;; out invalid. thus, lexical filtering rules below operate on lexical edges, ;;; lexical entries that have gone through any number of lexical rules, i.e. ;;; everything that would ordinarily feed into syntactic rules. ;;; ;;; initially, our strategy is conservative: whenever there is a native entry, ;;; purge all generic entries in the same chart cell, unless there is a good ;;; reason to keep some. for now, only capitalization is considered a reason, ;;; and even there (i.e. for generic names), certain types of native entries ;;; will filter. ;;; ;;; both on tokens and signs, the `native' vs. `generic' distinction is made in ;;; ONSET values: `con_or_voc' vs. `unk_onset'. ;;; ;; ;; throw out generic whenever a native entry is available, unless the token is ;; a named entity (which now includes names activated because of mixed case or ;; non-sentence-initial capitalization). ;; generic_non_ne+native_lfr := lexical_filtering_rule & [ +CONTEXT < [ SYNSEM.PHON.ONSET con_or_voc ] >, +INPUT < [ SYNSEM.PHON.ONSET unk_onset, ORTH.CLASS non_ne ] >, +OUTPUT < >, +POSITION "I1@C1" ]. ;; ;; a native name, however, should suppress generic names, even NE ones. this ;; is restricted to singular native names, since otherwise we get unwanted ;; blocking for acronyms like |EDS|, given the native name |Ed|. ;; DPF 04-sept-09 - But we do want blocking for inherent plural proper names ;; like |Giants|. So on balance, it seems better to try manually listing ;; the |EDS| instances, and make the blocking more aggressive. ;; proper_ne+name_lfr := lexical_filtering_rule & [ +CONTEXT < [ SYNSEM [ PHON.ONSET con_or_voc, LOCAL.CAT.HEAD noun, LKEYS.KEYREL.PRED abstr_named_rel ] ] >, +INPUT < [ SYNSEM [ PHON.ONSET unk_onset, LKEYS.KEYREL.PRED named_rel ] ] >, +OUTPUT < >, +POSITION "I1@C1" ]. ;; ;; mass nouns (both native and generic) also suppress generic names, even ;; NE ones. this reflects what dan calls the `tyranny of mass nouns', i.e. ;; the assumptions that there are no syntactic contexts where a proper name ;; would be needed for coverage (thus glossing over differences in the ;; associated semantics, for improved parsing efficiency). ;; mass_noun+name_lfr := lexical_filtering_rule & [ +CONTEXT < [ SYNSEM [ LOCAL [ CAT.HEAD noun, AGR [ PNG.PN 3s, IND - ] ] ] ] >, +INPUT < [ SYNSEM [ PHON.ONSET unk_onset, LKEYS.KEYREL.PRED named_rel ] ] >, +OUTPUT < >, +POSITION "I1@C1" ]. ;; ;; avoid analyzing currency symbols (like |US$|), which appear capitalized, as ;; generic names ;; currency+name_lfr := lexical_filtering_rule & [ +CONTEXT < [ SYNSEM [ PHON.ONSET con_or_voc, LOCAL.CAT.HEAD noun & [ MINORS.MIN mnp_symb_rel ] ] ] >, +INPUT < [ SYNSEM [ PHON.ONSET unk_onset, LKEYS.KEYREL.PRED named_rel ] ] >, +OUTPUT < >, +POSITION "I1@C1" ]. ;; ;; discard generic names (even NE ones) for |I|, a pronoun that is standardly ;; capitalized. ;; proper_ne+pronoun_lfr := lexical_filtering_rule & [ +CONTEXT < [ SYNSEM [ PHON.ONSET con_or_voc, LOCAL [ CAT.HEAD.CASE nom, AGR.PNG.PN 1s ], LKEYS.KEYREL.PRED pron_rel ] ] >, +INPUT < [ SYNSEM [ PHON.ONSET unk_onset, LKEYS.KEYREL.PRED named_rel ] ] >, +OUTPUT < >, +POSITION "I1@C1" ]. ;; ;; a named entity corresponding to a name kills a PoS-activated generic name, ;; unless that is a named entity itself. ;; generic_name+ne_name_lfr := lexical_filtering_rule & [ +CONTEXT < [ SYNSEM.PHON.ONSET unk_onset, ORTH.CLASS named_entity ] >, +INPUT < [ SYNSEM [ PHON.ONSET unk_onset, LKEYS.KEYREL.PRED named_rel ], ORTH.CLASS non_ne ] >, +OUTPUT < >, +POSITION "I1@C1" ]. ;; ;; generic entries followed by punctuation will typically admit two readings, ;; one of them including the punctuation marks as part of the generic, as e.g. ;; in (sentence-final) |oe@yy.com.| these are rarely (if ever) desirable, so ;; delete edges whose tokens bear final punctuation if they have not undergone ;; punctuation affixation rule(s). and likewise for prefixing punctuation. ;; ;; DPF 09-nov-09 - These rules make reference to the FORM attribute, which ;; is in ORTH, propagated from the generic lexeme's TOKENS...+FORM attribute, ;; which is no longer visible at this stage, after lexical rules have applied. ;; As Woodley P. points out, if we wanted to cope with multi-token generics, ;; we should rather propagate into ORTH the +FORM of the first and the last ;; tokens, so left and right punctuation, respectively, would be on the ;; intended token of the MWE. This revision should be straightforward if we ;; ever implement generic MWEs: simply introduce two new features in `orthog' ;; instead of the one feature `FORM', establish the relevant links in ;; `basic_word', then use them in revised versions of these two rules. ;; generic_right_punct_lfr := lexical_filtering_rule & [ +INPUT < [ ORTH.FORM ^.+[])}”",;.!?-]$, SYNSEM [ PHON.ONSET unk_onset, PUNCT.RPUNCT no_punct ] ] >, +OUTPUT < > ]. generic_left_punct_lfr := lexical_filtering_rule & [ +INPUT < [ ORTH.FORM ^[[({“‘].+$, SYNSEM [ PHON.ONSET unk_onset, PUNCT.LPUNCT no_punct ] ] >, +OUTPUT < > ].