EPE 2017: Extrinsic Parser Evaluation Shared Task at DepLing & IWPT 2017

Raw Parser Inputs

Version 1.5; June 4, 2017


Overview
========

This archive contains the training and development texts for the downstream
applications in EPE 2017: ‘negation/’, ‘events/’, and ‘speculation/’.

For general background on the task set-up, please see:

  http://epe.nlpl.eu

All parser inputs for the task are ‘clean’ running text files, encoded in
UTF-8.  Thus, there may be minor variation in, for example, newline conventions
and the use of ASCII vs. Unicode punctuation symbols, notably for quote marks
and apostrophes.

As of version 1.2, this package provides pre-processed variants of all texts,
using the analysis stack of Velldal et al. (2012; CL); see below for details.

For use with EPE 2017, we maintain the the data in several formats: (a) as
‘raw’, running text (with file extension ‘.txt’) and (b) in segmented and
morphologically analyzed form (see below), using a simple tabular-separated,
CoNLL-lookalike file format (with file extension ‘.tt’).


System Submissions
==================

For instructions on how to package parser outputs for submission to EPE 2017,
please see information for task participants on the task web site:

  http://epe.nlpl.eu/index.php?page=5


Event Extraction
================

The training and development texts originate from the 2009 Shared Task on Event
Extraction at the BioNLP workshop (held in association with the Conference of
the North American Chapter of the Association for Computational Linguistics;
NAACL).  There are 800 ‘.txt’ files for training and 150 files for development,
for a total of 176,146 and 33,827 whitespace-separated tokens, respectively.
Both the ‘raw’ files and the split into training and development data remain
unchanged from BioNLP 2009.  For additional information, please see the
‘LICENSE’ and ‘README’ files in each of the sub-directories, together with an
archive of the shared task web site at the following address:

  http://www.nactem.ac.uk/tsujii/GENIA/SharedTask/


Opinion Analysis
================

The training and development data are taken from the MPQA Opinion Corpus
(version 2.0) and have been moderately revised for use with EPE 2007.  In
particular, a few files have been omitted (e.g. because they contained
multi-lingual, parallel text), and other files have been edited to replace
mark-up (e.g. tags like <LU_ANNOTATE>) with whitespace, to preserve character
off-sets.  File preparation and the split into training vs. development (and
eventually evaluation) data was provided by Richard Johansson.  For general
background on the MPQA data, please see the file ‘README’ inside the ‘opinion/’
sub-directory and the corpus web page at:

  http://mpqa.cs.pitt.edu/corpora/mpqa_corpus/

The file names of the opinion analysis ‘raw’ texts correspond to the original
MPQA document names, but for uniformity with other EPE parser inputs a common
suffix ‘.txt’ was appended; also, the ‘parent’ directory structure from MPQA is
not preserved in our flattened ‘training/’, ‘development/’, and ‘evaluation/’
directory tree, as the file names by themselves are unique.  There are 449 and
90 training and development files, respectively, for a total of 214,289 and
44.277 whitespace-separated tokens.


Negation Analysis
=================

The training and development (and eventually evaluation data) for this
downstream application originates with the Shared Task at the 2012 *SEM
Conference (Morante & Blanco, 2012).  The underlying text, segmentation, and
basic linguistic analysis back then were originally prepared by Stephan Oepen
as part of the Oslo Conan Doyle Corpus (CDC; http://www.delph-in.net/cdc/).
There are 55,029 whitespace-separated tokens of training data and 11,379 tokens
of development data.  The split into training and development (and evaluation)
sections and actual negation annotation were designed and implemented by the
2012 task organizers: Roser Morante and Eduardo Blanco.  For general background
on the 2012 *SEM task, please see:

  http://www.clips.ua.ac.be/sem2012-st-neg/

For use with EPE 2017, we maintain the the data in a moderately extended
variant of the original *SEM 2012 file format.  In particular, these are the
modifications in the so-called ‘*sem+’ file format:

+ inserting new columns #3 and #4 (START and END), with character ranges;
+ inserting new column #7 (FEATURES) for morphological annotations;
+ replacing the original column #7 (PTB) with #8 and #9 (HEAD and DEPREL).


Pre-Processing
==============

EPE 2017 is conceptually set up as an end-to-end task, i.e. as parsing ‘raw’,
running texts into dependency representations, which are then fed into the
various downstream applications.  However, to enable participation by parser
developers who are not readily set up to process ‘raw’ text (and to provide a
common reference point, maybe), this package makes available pre-processed
versions of the texts: sentence splitting, tokenization, part of speech
tagging, and lemmatization have been applied using the analysis stack of
Velldal (2012; CL), as implemented in the so-called LOGON environment (which is
also available for download by participants; please see the web page).

In particular, the following tools were used: (a) the CIS Tokenizer (as found
to give premium sentence splitting accuracy by Read et al. (2012; COLING); (b)
the PTB-compliant REPP tokenizer of Dridan & Oepen (2012; ACL); the (c) TnT
tagger of Brants (2000; ANLP); overlayed with (d) the GENIA Tagger of Tsuruoka
et al. (2005; Panhellenic Conference on Informatics).  Please see Velldal et
al. (2012; CL) for the specifics of how the two taggers were interleaved, but
note that the bulk of the part of speech and lemma information is contributed
by GENIA.


Communication
=============

While you are looking at this data, please self-subscribe to the mailing list
for the shared task:

  http://lists.nlpl.eu/mailman/listinfo/epe-users


Known Errors
============

None, for the time being.


Release History
===============

[Version 1.5; June 4, 2017]

+ Inclusion of evaluation data; simplified approach to creation of ‘.tt’ files.

[Version 1.4; May 11, 2017]

+ Inclusion of gold-standard negation annotations in ‘extended’ *SEM format;
  further corrections to those character spans in pre-processed ‘.tt’ files.

[Version 1.3; April 18, 2017]

+ Correct character spans in pre-processed ‘.tt’ files; add missing sentences.

[Version 1.2; April 13, 2017]

+ Include pre-processed versions of all ‘raw’ texts as additional ‘.tt’ files.

[Version 1.1; April 9, 2017]

+ Re-packaging in new top-level directory and a bit tighter selection of files.

[Version 1.0; March 28, 2017]

+ Re-release of the ‘raw’ texts, now with inputs for opinion analysis included.

[Version 0.9; March 13, 2017]

+ Initial release of the training and development texts for three applications.


Contact
=======

For questions or comments, please do not hesitate to email the task organizers
at: ‘epe-organizers@nlpl.eu’.

Jari Björne
Filip Ginter
Richard Johansson
Emanuele Lapponi
Joakim Nivre
Stephan Oepen (chair)
Anders Søgaard
Erik Velldal
Lilja Øvrelid