Release notes for MMT grammars, August 15, 2007 1. Background The MMT system (Matrix Machine Translation or Massively Multilingual Translation) is an experiment in adapting the LOGON MT architecture and Grammar Matrix-derived grammars to create an NxN machine translation system (where the current value of N is 10). The goals for the system are to have all languages equally available as source or target languages, and all language pairs equally functional. In addition, it should be possible to add the N+1st language (as source and target) without directly considering every new language pair that this adds. While the grammars have relatively interesting coverage over a range of phenomena (coordination, negation, polar questions, marking of definiteness and demonstratives, clause-embedding verbs), the system targets a tiny toy domain (sentences about dogs and cats chasing cars and sleeping), and side-steps many important issues in transfer by positing a pseudo-interlingua. Even with this oversimplification, we still need to posit transfer rules to handle residual mismatches between the MRSs for the different languages. The two main sources of mismatch are the treatment of pro-drop (as these grammars do not posit _pronoun_n_rels for dropped pronouns) and complex predicates (with "hurt" in Italian and Farsi being expressed as "make harm", and "chase" in Farsi as "make pursuit"). Rather than writing N^2 transfer grammars, we create one transfer grammar per target language, which instantiates transfer rules which accommodate the expectations of the target language's (monolingual) grammar. For further generalization, the transfer rule types are taken from the Transfer MatriX (mtr.tdl, mrs.tdl). Specific transfer rules (e.g., pronoun insertion) are defined as types as well in a single shared file (acm.tdl). Particular transfer grammars then instantiate only the transfer rules in acm.tdl that are required. At this point, it remains an open question whether this strategy can be scaled, or whether N^2 language pairs require N^2 transfer grammars. 2. MMT set up The file setup.lisp defines the variables *mmt-languages* and *mmt-transfer-grammars*. The former lists the languages handled, and the latter associates each target language with its accommodation transfer grammar. Each language is identified by its three letter ISO code. A single language pair can be invoked with the following command (issued in the $LOGONROOT directory): ./batch --binary --from src --to tgt --ascii ./uw/mmt/test_sentences/src2tgt.txt where `src' and `tgt' are replaced with the three-letter codes for the source and target languages respectively (twice in the string each). The script all_lg_test will do a test run of all of the language pairs. It invokes format_results.py to output a pdf file with a table summarizing coverage over the 17 test sentences. 3. Provenance of grammars With the exception of the English grammar, the MMT monolingual grammars all began as course projects for Linguistics 471/567 at the University of Washington. In this class (renumbered to 567 in 2005), each student develops a grammar for a different language over the 10-week quarter, according to lab instructions highlighting different phenomena each week. The students begin with the Grammar Matrix (and since 2006, with a starter grammar configured from the Grammar Matrix customization system) and build out from there. In many cases, students work with languages that they are previously unfamiliar with, using reference grammars and in some cases native speaker consultants to assess the facts of the language as they attempt to model it. All of the grammars use ascii transliteration, which may or may not correspond to any standard transliteration. In addition, many of the grammars assume (but do not include) a morphophonological analyzer, and so parse and generate strings of regularized forms. For the MMT system, 9 of these grammars were selected and then updated for consistency with the current version of the Matrix and current conventions for the MRS representations. In some cases, the grammar coverage needed to be extended in order to handle our toy domain. In general, the grammars from earlier years required more modifications than the more recent grammars. These updates were done by Scott Drellishak, Margalit Zabludowski, and Emily M. Bender. The English grammar was created specifically for the MMT system, beginning with a starter grammar from the Grammar Matrix customization system. Language Code Orig. Author Orig. Dev Date Modifications -------- ---- ------------ -------------- ------------- Armenian hye S. Drellishak 2004 Drellishak, Bender English eng S. Drellishak 2007 Drellishak, Bender Esperanto epo J. Pool 2005 Drellishak, Bender Farsi fas W. McNeill 2004 Drellishak, Bender Finnish fin R. Mattson 2005 Drellishak, Bender Hausa hau K. Hutchins 2007 Bender, Drellishak Hebrew heb M. Zabludowski 2006 Zabludowski, Bender, Drellishak Icelandic isl K. Sickles 2007 Bender, Drellishak Italian ita J. Johanson 2006 Zabludowski, Bender, Drellishak Zulu zul K. O'Hara 2007 Bender, Drellishak 4. Files In the mmt directory, there are subdirectories for each monolingual grammar identified by the three letter language codes given above. Inside each grammar directory, there is a subdirectory called "doc" which contains student write ups from the course in which the grammars were developed, as well as the instructor's responses to those write ups. Also in the mmt directory are the transfer grammars (eng-acm et al), the shared files for the transfer grammars (mrs.tdl, mtr.tdl, and acm.tdl), a directory called test_sentences, and a directory called tsdb. test_sentences stores the input sentences for each language (again identified by the three letter code), as well as bitexts for each language pair. The bitexts themselves can be created by the script create_bitexts.py. The tsdb/home directory contains (untreebanked) profiles for each monolingual grammar over the the test suites created by the students as they developed their grammars and separate profiles for the MMT sentences. tsdb/skeletons provides skeletons for the general test suites and MMT sentences for each language. (For heb and eng, only MMT sentences are available.) 5. Acknowledgments The initial work of adapting these grammars for the MMT system was supported by a gift from the Utilika Foundation to the Turing Center at the University of Washington. Development of the Grammar Matrix is currently supported by NSF grant BCS-0644097.