Ingestion Workflow (*)

Overview of the Workflow that will govern ingestion of information into INSPIRE. This is operational, but must be at least mirrored in the software. Sophisticated workflow design is probably not needed at the software level, however queuing things in RT is envisioned as useful for these purposes.

* Not to be confused with Indigestion Workflow....

Workflow diagram:


In the diagram green means manual step. The blue and orange comments were left from a 2008 discussion, and I'm not sure what all of them mean...but in many cases they describe the module responsible for that box. There appear to be some omissions of module names, but for the most part this diagram matches our expectations.

Needs for OAI/non-OAI harvesting

  • SystemDesignBibCategorize : New module, very simple, its needs are described on its page. A manual interface would be a bit more complicated, but this can be postponed a bit, as BibEdit could be used in a pinch.
  • New Needs for existing modules:
    • BibHarvest-
      • Needs to have a per-source configuration that provides the following:
        • Specify harvester to use (OAI-PMH for OAI sources. If non-OAI specify a scraping script, or RSS/Atom, etc)
        • Specify mapping from the harvested data to INSPIRE MARC (non-OAI scrapers should provide an output that can be converted to MARC easily...some may produce SPIRES output...)
        • Specify collection based on source (i.e. visible or invisible HP)
        • Specify whether fulltext should be downloaded (and if so, whether to use only for indexing or to display)
        • Specify the source code to be used for the $$9 in the harvested data
        • [optional] specify a batch mode harvest or an individual mode
          • Batch mode would be a conference site, wherein many records would get similar treatment and an inputter would like to look at them all at once (possible creating one ticket)
          • Individual mode would be like eprints, where each record creates its own ticket and is handled in completely independent fashion.
        • [optional] specify any additional information to be added to the harvest (like conference info, journal info) that might not be found in the data, but should be part of the metadata of all records
          • This would work well with Batch mode, and also with manually triggered harvests, i.e. one might like to trigger a specific harvest and specify a few fields to be put into all the records.

      • Needs to create RT tickets in BibCatalog for new harvests that require processing
    • BibMerge needs to add an internal note to the record showing that it was merged with another record. What happens to the removed record?
    • BibMerge needs to be used in both BibHarvest/Holding pen and the BibEdit history interface for comparing revisions versions of records
    • BibMerge several UI improvements/ fixes
      • bibMerge should be able to merge reference lists. This is tricky...
    • BibClassify - Note that it is run twice. Once for quick keywording, done on abstract and another time as part of input to bibcategorize
    • SystemDesignBibEditSpecialModes - These are potential blockers. See their own page for reqs/specs.
      • $$9 updates on completion to show new source (keywords, cites etc) no need to save different sources since we have history
      • See note about reference list merging.

Needs for User input/corrections

Another major source of data is user corrections. These will come directly to RT, however, we have the opportunity to make this update process significantly smoother.

Minimal corrections system:

  • User sumbits RT tickets in a queue that are not associated with a given record, and are in free text form
  • Editor reads ticket, performs search in bibedit for given record, makes changes using bibedit, possibly bibedit special modes (possibly extracting cites from newer version and adding them to the record via a special mode/bibmerge sort of package)
  • Editor closes ticket

Medium corrections system:

  • User clicks on "Correct this record" -> Form that collects data about missing citations in specific way, collects other corrections as text...just to standardize.
  • Creates an RT ticket bound to that record
  • Editor follows link from ticket to the record, corrections as before.

Nice Correction system:

  • User clicks on "Correct this record" -> BibEdit stripped down interface appears
    • Doesn't show all fields, maybe some extra nice UI, mostly buttons missing
  • User submits a corrected record -> goes to holding pen and creates ticket
  • Editor treats correction in similar workflow to harvests, checking differences, applying or not

Ultra-Nice correction system

  • As above but with preview for users, other interactive checks for common mistakes.
  • Editor can decide to trust certain fields to users (Maybe certain users...maybe authors...)

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatdia workflow.dia r1 manage 6.6 K 2009-06-20 - 14:53 TravisBrooks dia file of th workflow
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2009-09-23 - TravisBrooks
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback