## macros to be used in RegExps rules ALPHA [^\]<>[(\.,";:?!¿¡«»'`)^@~|}{_/\\+=&$#*+%\s\d\-] ALPHANUM [^\]<>[(\.,";:?!¿¡«»'`)^@~|}{_/\\+=&$#*+%\s\-] NOALPHANUM [\]<>[(\.,";:?!¿¡«»'`)^@~|}{_/\\+=&$#*+%\s\-] PARTNUM [^\]<>[(";:?!¿¡«»'`)^@~|}{\s] OTHERS [\]<>[(\.,";:?!¿¡«»'`)^@~|}{_/\\+=&$#*+%\-] ## Tokenization rules. They are applied in the order of definition. ## The first matching the *beginning* of the line is applied ## and a token built. The process is repeated until the line ## has been completely processed. ## -The first field in the rule is the rule name. If it starts ## with a "*", the RegExp will only produce a token if the ## match is found in abbreviation list below. ## -The second field in the rule is the substring to form the token/s with ## It may be 0 (the match of the whole expression) or any number ## from 1 to the number of substrings (up to 9). A token will be ## created for each substring from 1 to the specified value. ## -The third field is the regexp to match against the line ## INDEX_SEQUENCE 0 (\.{4,}|-{2,}|\*{2,}|_{2,}|/{2,}) INITIALS1 1 ([A-Z](\.[A-Z])+)(\.\.\.) INITIALS2 0 ([A-Z]\.)+ TIMES 0 (([01]?[0-9]|2[0-4]):[0-5][0-9]) NAMES_CODES 0 ({PARTNUM}*[0-9]{PARTNUM}*{ALPHANUM}) THREE_DOTS 0 (\.\.\.) QUOTES 0 (``|<<|>>|'') MAILS 0 {ALPHANUM}+([\._]{ALPHANUM}+)*@{ALPHANUM}+([\._]{ALPHANUM}+)* URLS 0 ((mailto:|(news|http|https|ftp|ftps)://)\S+|^(www(\.\S+)+)) KEEP_COMPOUNDS 0 {ALPHA}+(['_\-]{ALPHA}+)+ *ABREVIATIONS1 0 (({ALPHA}+\.)+)(?!\.\.) *ABREVIATIONS2 0 ({ALPHA}+\.)(?!\.\.) WORD 0 {ALPHANUM}+ OTHERS_C 0 {OTHERS} ## Abbreviations. The dot is not tokenized separately ## in the cases listed below. a.c. aa.rr. abrev. adj. adm. admón. afma. afmas. afmo. afmos. ag. am. ap. apdo. art. arts. arz. arzbpo. assn. atte. av. avda. bros. bv. cap. caps. cg. cgo. cia. cit. cl. cm. co. col. corp. cos. cta. cte. ctra. cts. cía. d.c. dcha. dept. depto. dg. dl. dm. doc. docs. dpt. dpto. dr. dra. dras. dres. dto. dupdo. ed. ee.uu. ej. emma. emmas. emmo. emmos. entlo. entpo. esp. etc. ex. excm. excma. excmas. excmo. excmos. fasc. fdo. fig. figs. fol. fra. gb. gral. ha. hnos. hros. hz. ib. ibid. ibíd. id. ilm. ilma. ilmas. ilmo. ilmos. iltre. inc. intr. izq. izqda. izqdo. jr. kc. kcal. kg. khz. kl. km. kw. lda. ldo. lib. lim. loc. ltd. ltda. lám. ma. mg. mhz. min. mons. mr. mrs. ms. mss. mtro. máx. mín. ntra. ntro. núm. ob. obpo. op. pd. ph. pje. pl. plc. pm. pp. ppal. pral. prof. prov. pról. ps. pta. ptas. pte. pts. pza. pág. págs. párr. rda. rdo. ref. reg. rel. rev. revda. revdo. rma. rmo. rte. s. sdad. sec. secret. seg. sg. sig. smo. sr. sra. sras. sres. srs. srta. ss.mm. sta. sto. sust. tech. tel. telf. teléf. ten. tfono. tlf. tít. ud. uds. vda. vdo. vid. vol. vols. vra. vro. vta. íd. ít. mm. mms. ms. pulg. yda. mi. Ha. ac. ml. dl. hl. ac-pie. oz. qt. gal. pk. bu. cr. crt. tz. pt. mpa. pa. psi. lb. mmhg. cmhg. mhg. mol. mg. gr. grs. kg. kgs. mgr. oz. lb. ton. tm. milgal. lt. lps. gps. gpm. gph. gpd. mgd. gal. gpcd. mph. lbf. yb. zb. eb. pb. tb. gb. mb. kb. wb. cd. rad. sr. hz. lm. lx. nq. gy. sv.