Extraction from PDF

Libraries allowing to access the content of the PDF document:

  • Poppler http://poppler.freedesktop.org/ - Part of the freedesktop.org initiative
    • weakly documented
    • supports Python
    • poppler tools, like pdfimages seem to fail on most of the scientific articles
  • gnupdf http://gnupdf.org/
    • Is supposed to fill all the missing fetures in poppler (like forms/multimedia support, which might be relevant)
    • seems not to have any Python bindings ( But might be possible to using it after generating a python library automatically)

A web page describing many useful PDF tools and libraries: http://pdf-house.blogspot.com/ Problems

  • sometimes ( like 1007.0043), images are included in the form of PDF files

-- PiotrPraczyk - 07-Sep-2010

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2010-09-07 - PiotrPraczykExCern
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback