PDFTOXML ======== Extract text from (scientific) papers in PDF format and convert to simple paper xml format (../dtd/paperxml-simple.dtd) A modified and extended PDF box (V 0.7.3) plus Python wrapper for generating paper xml format. Python and Java *source* code provided by DFKI in this directory are under LGPL and PDFBox license (modified PDFBox classes only). Parts of the software in this directory come with different copyright and licenses, e.g. libraries in the lib/ directory. - FontBox-0.1.0-dev.jar: http://www.pdfbox.org - PDFBox-0.7.3-dev.jar: http://www.pdfbox.org - jung-1.6.0.jar: http://jung.sourceforge.net - log4j-1.2.9.jar: http://www.apache.org - Apache commons-collections-3.1.jar: http://www.apache.org The Python code for generating paperxml output additionally requires BibTeX2xml from http://bibtexml.sourceforge.net (not provided in this directory). The current version there seems to be a (extended) Java reimplementation. The version we used was a single Python file containing two main functions bibtexwasher and bibtexdecoder. Build Java library pdfextract.jar (from src under TextExtractor/): ================== cd TextExtractor javac -classpath ../lib/PDFBox-0.7.3-dev.jar:../lib/FontBox-0.1.0-dev.jar:../lib/commons-collections-3.1.jar:../lib/jung-1.6.0.jar:../lib/log4j-1.2.9.jar -d classes -target 5 de/dfki/lt/extracttext/*.java cd classes jar cf ../../lib/pdfextract.jar de Running the program: =================== ./extract [-b bibtexfile] [-w wordlistfile] pdf-inputfile xml-outputfile e.g. ./extract -b P09-1001.bib -w en-wordlist P09-1001.pdf P09-1001.xml Required versions: ================= Python 2.5 or 2.6 Java 1.5 (may also compile and run with 1.4; not tested) 2009-10-30 ulrich.schaefer@dfki.de