OCRing Project
Information for Optical Character Recognition (OCR) of scanned documents.
Links
Grid links
Invenio links
OCR links
Grid Interface for OCRing
OCRing tasks todo
- Test grid job submission using D4Science Process Engine
- Ocropus shipped with grid job
- Use smaller (compressed) images in intermediate OCRing stages
- Write python program for OCRing (based on websubmit_file_converter.py) including minimal set of required Invenio components. This tool can take as input a text file containing a list of URLs pointing to pdf files, or a directory containing pdf files. This program should be able to send jobs to D4Science Process Engine (JDLAdaptor and GridAdaptor) and lxbatch. Other ways of executing jobs (direct gLite job submission, other grids, clouds?) can be added later. Good documentation is important.
- Measuring OCRing error rate using a set of documents from the cds.cern.ch
- Recognition of math formulas (this is a low priority task)
- Improve recognition of Cyrillic and Greek characters
- Extracting figures (and other elements like tables?) from pdfs, recognizing figure captions
- Test OCRing process using e.g. 100 CERN annual reports, yellow reports etc
- Use Castor
for storing input and output files (like gsiftp://srm.cern.ch/castor/cern.ch/user/u/userName)
--
JukkaKlem - 30-Apr-2010