OCRing Project

Information for Optical Character Recognition (OCR) of scanned documents.

Links

Grid links

Invenio links

OCR links

Grid Interface for OCRing

OCRing tasks todo

  • Test grid job submission using D4Science Process Engine
  • Ocropus shipped with grid job
  • Use smaller (compressed) images in intermediate OCRing stages

  • Write python program for OCRing (based on websubmit_file_converter.py) including minimal set of required Invenio components. This tool can take as input a text file containing a list of URLs pointing to pdf files, or a directory containing pdf files. This program should be able to send jobs to D4Science Process Engine (JDLAdaptor and GridAdaptor) and lxbatch. Other ways of executing jobs (direct gLite job submission, other grids, clouds?) can be added later. Good documentation is important.
  • Measuring OCRing error rate using a set of documents from the cds.cern.ch
  • Recognition of math formulas (this is a low priority task)
  • Improve recognition of Cyrillic and Greek characters
  • Extracting figures (and other elements like tables?) from pdfs, recognizing figure captions
  • Test OCRing process using e.g. 100 CERN annual reports, yellow reports etc
  • Use Castor for storing input and output files (like gsiftp://srm.cern.ch/castor/cern.ch/user/u/userName)

-- JukkaKlem - 30-Apr-2010

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2010-06-28 - JukkaKlem
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback