System Design: BibExtract

1. Introduction

BibExtract is a tool to automagically extract bibliographic data from documents themselves, rather than feeds of matadata like OAI etc.

Data extracted is generally references or authors and affiliations but could potentially be other things

Existing Cern RefExtract will be subsumed by this module, as will existing SPIRES "Giva"

Most output of these extractors will be checked for accuracy by some human intervention (possibly BibEdit/BibCheck), except possibly in non-core papers, or in cases of extreme trust.

2. Use cases

2.1 arXiv download

The daily download from arXiv (OAI) also includes the pdf and latex files for the papers. These are automatically fed through BibExtract which generates proposed lists of references and authors/affils for each paper. In some cases these proposed lists may move directly into the record (visible/countable/searchable), in others they may stay in a holding pen for human checking/addition

2.2 Adding missed references

A user complains that her reference is missed in our version of a journal paper. Using the BibEdit interface, a maintainer goes to the record and requests that BibExtract re-extract the references from the journal PDF (not stored locally, but found by hand by the maintainer after using the DOI in the record to go to the journal splash page). New reference list is generated, which can then (in BibEdit?) be compared to the existing list and all new references added.

2.3 Updating references

Author resubmits new version of paper to arXiv after adding several new references. The OAI feed alerts us to a change in the paper, BibExtract runs through the new version, checks the new rpoposed reference list against the existing list and adds any new additions, flagging the record for additional work if it detects a possible conflict.

2.4 Author lists

CMS publishes a paper in an instrumentation journal, without sending it to arXiv. The 2000 person author list is not sent by the journal in their feed, however we have a PDF of the paper, sent by email from one of the authors. The maintainer directs BibExtract to get the author list from the PDF and then checks for possible missed extractions by:

  • Comparing with previous CMS papers
  • Making sure all extracted authors have appeared before in the database
  • Checking for affiliations that aren't in the inst. KB
Then adds the author list to the paper.

3. Workflow

4. Mock-up screenshots

5. Architecture

6. API

-- TravisBrooks - 11 Jul 2008

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2008-07-11 - TravisBrooks
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback