Extraction from PDF
Libraries allowing to access the content of the PDF document:
- Poppler http://poppler.freedesktop.org/
- Part of the freedesktop.org initiative
- weakly documented
- supports Python
- poppler tools, like pdfimages seem to fail on most of the scientific articles
- gnupdf http://gnupdf.org/
- Is supposed to fill all the missing fetures in poppler (like forms/multimedia support, which might be relevant)
- seems not to have any Python bindings ( But might be possible to using it after generating a python library automatically)
A web page describing many useful PDF tools and libraries:
http://pdf-house.blogspot.com/
Problems
- sometimes ( like 1007.0043), images are included in the form of PDF files
--
PiotrPraczyk - 07-Sep-2010