EPE 2017: Extrinsic Parser Evaluation Shared Task at DepLing & IWPT 2017 Raw Parser Inputs Version 1.5; June 4, 2017 Overview ======== This archive contains the training and development texts for the downstream applications in EPE 2017: ‘negation/’, ‘events/’, and ‘speculation/’. For general background on the task set-up, please see: http://epe.nlpl.eu All parser inputs for the task are ‘clean’ running text files, encoded in UTF-8. Thus, there may be minor variation in, for example, newline conventions and the use of ASCII vs. Unicode punctuation symbols, notably for quote marks and apostrophes. As of version 1.2, this package provides pre-processed variants of all texts, using the analysis stack of Velldal et al. (2012; CL); see below for details. For use with EPE 2017, we maintain the the data in several formats: (a) as ‘raw’, running text (with file extension ‘.txt’) and (b) in segmented and morphologically analyzed form (see below), using a simple tabular-separated, CoNLL-lookalike file format (with file extension ‘.tt’). System Submissions ================== For instructions on how to package parser outputs for submission to EPE 2017, please see information for task participants on the task web site: http://epe.nlpl.eu/index.php?page=5 Event Extraction ================ The training and development texts originate from the 2009 Shared Task on Event Extraction at the BioNLP workshop (held in association with the Conference of the North American Chapter of the Association for Computational Linguistics; NAACL). There are 800 ‘.txt’ files for training and 150 files for development, for a total of 176,146 and 33,827 whitespace-separated tokens, respectively. Both the ‘raw’ files and the split into training and development data remain unchanged from BioNLP 2009. For additional information, please see the ‘LICENSE’ and ‘README’ files in each of the sub-directories, together with an archive of the shared task web site at the following address: http://www.nactem.ac.uk/tsujii/GENIA/SharedTask/ Opinion Analysis ================ The training and development data are taken from the MPQA Opinion Corpus (version 2.0) and have been moderately revised for use with EPE 2007. In particular, a few files have been omitted (e.g. because they contained multi-lingual, parallel text), and other files have been edited to replace mark-up (e.g. tags like ) with whitespace, to preserve character off-sets. File preparation and the split into training vs. development (and eventually evaluation) data was provided by Richard Johansson. For general background on the MPQA data, please see the file ‘README’ inside the ‘opinion/’ sub-directory and the corpus web page at: http://mpqa.cs.pitt.edu/corpora/mpqa_corpus/ The file names of the opinion analysis ‘raw’ texts correspond to the original MPQA document names, but for uniformity with other EPE parser inputs a common suffix ‘.txt’ was appended; also, the ‘parent’ directory structure from MPQA is not preserved in our flattened ‘training/’, ‘development/’, and ‘evaluation/’ directory tree, as the file names by themselves are unique. There are 449 and 90 training and development files, respectively, for a total of 214,289 and 44.277 whitespace-separated tokens. Negation Analysis ================= The training and development (and eventually evaluation data) for this downstream application originates with the Shared Task at the 2012 *SEM Conference (Morante & Blanco, 2012). The underlying text, segmentation, and basic linguistic analysis back then were originally prepared by Stephan Oepen as part of the Oslo Conan Doyle Corpus (CDC; http://www.delph-in.net/cdc/). There are 55,029 whitespace-separated tokens of training data and 11,379 tokens of development data. The split into training and development (and evaluation) sections and actual negation annotation were designed and implemented by the 2012 task organizers: Roser Morante and Eduardo Blanco. For general background on the 2012 *SEM task, please see: http://www.clips.ua.ac.be/sem2012-st-neg/ For use with EPE 2017, we maintain the the data in a moderately extended variant of the original *SEM 2012 file format. In particular, these are the modifications in the so-called ‘*sem+’ file format: + inserting new columns #3 and #4 (START and END), with character ranges; + inserting new column #7 (FEATURES) for morphological annotations; + replacing the original column #7 (PTB) with #8 and #9 (HEAD and DEPREL). Pre-Processing ============== EPE 2017 is conceptually set up as an end-to-end task, i.e. as parsing ‘raw’, running texts into dependency representations, which are then fed into the various downstream applications. However, to enable participation by parser developers who are not readily set up to process ‘raw’ text (and to provide a common reference point, maybe), this package makes available pre-processed versions of the texts: sentence splitting, tokenization, part of speech tagging, and lemmatization have been applied using the analysis stack of Velldal (2012; CL), as implemented in the so-called LOGON environment (which is also available for download by participants; please see the web page). In particular, the following tools were used: (a) the CIS Tokenizer (as found to give premium sentence splitting accuracy by Read et al. (2012; COLING); (b) the PTB-compliant REPP tokenizer of Dridan & Oepen (2012; ACL); the (c) TnT tagger of Brants (2000; ANLP); overlayed with (d) the GENIA Tagger of Tsuruoka et al. (2005; Panhellenic Conference on Informatics). Please see Velldal et al. (2012; CL) for the specifics of how the two taggers were interleaved, but note that the bulk of the part of speech and lemma information is contributed by GENIA. Communication ============= While you are looking at this data, please self-subscribe to the mailing list for the shared task: http://lists.nlpl.eu/mailman/listinfo/epe-users Known Errors ============ None, for the time being. Release History =============== [Version 1.5; June 4, 2017] + Inclusion of evaluation data; simplified approach to creation of ‘.tt’ files. [Version 1.4; May 11, 2017] + Inclusion of gold-standard negation annotations in ‘extended’ *SEM format; further corrections to those character spans in pre-processed ‘.tt’ files. [Version 1.3; April 18, 2017] + Correct character spans in pre-processed ‘.tt’ files; add missing sentences. [Version 1.2; April 13, 2017] + Include pre-processed versions of all ‘raw’ texts as additional ‘.tt’ files. [Version 1.1; April 9, 2017] + Re-packaging in new top-level directory and a bit tighter selection of files. [Version 1.0; March 28, 2017] + Re-release of the ‘raw’ texts, now with inputs for opinion analysis included. [Version 0.9; March 13, 2017] + Initial release of the training and development texts for three applications. Contact ======= For questions or comments, please do not hesitate to email the task organizers at: ‘epe-organizers@nlpl.eu’. Jari Björne Filip Ginter Richard Johansson Emanuele Lapponi Joakim Nivre Stephan Oepen (chair) Anders Søgaard Erik Velldal Lilja Øvrelid