PDFTOXML
========

Extract text from (scientific) papers in PDF format and convert to
simple paper xml format (../dtd/paperxml-simple.dtd)

A modified and extended PDF box (V 0.7.3) plus Python wrapper for
generating paper xml format.

Python and Java *source* code provided by DFKI in this directory 
are under LGPL and PDFBox license (modified PDFBox classes only).

Parts of the software in this directory come with different
copyright and licenses, e.g. libraries in the lib/ directory.
- FontBox-0.1.0-dev.jar: http://www.pdfbox.org
- PDFBox-0.7.3-dev.jar: http://www.pdfbox.org
- jung-1.6.0.jar: http://jung.sourceforge.net
- log4j-1.2.9.jar: http://www.apache.org
- Apache commons-collections-3.1.jar: http://www.apache.org

The Python code for generating paperxml output additionally
requires BibTeX2xml from http://bibtexml.sourceforge.net (not
provided in this directory).

The current version there seems to be a (extended) Java 
reimplementation.
The version we used was a single Python file containing two
main functions bibtexwasher and bibtexdecoder.


Build Java library pdfextract.jar (from src under TextExtractor/):
==================

cd TextExtractor
javac -classpath ../lib/PDFBox-0.7.3-dev.jar:../lib/FontBox-0.1.0-dev.jar:../lib/commons-collections-3.1.jar:../lib/jung-1.6.0.jar:../lib/log4j-1.2.9.jar -d classes -target 5 de/dfki/lt/extracttext/*.java
cd classes
jar cf ../../lib/pdfextract.jar de


Running the program:
===================

./extract [-b bibtexfile] [-w wordlistfile] pdf-inputfile xml-outputfile

e.g.
./extract -b P09-1001.bib -w en-wordlist P09-1001.pdf P09-1001.xml


Required versions:
=================

Python 2.5 or 2.6
Java 1.5 (may also compile and run with 1.4; not tested)


2009-10-30  ulrich.schaefer@dfki.de