System Design: BibExtract
1. Introduction
BibExtract is a tool to automagically extract bibliographic data from documents themselves, rather than feeds of matadata like OAI etc.
Data extracted is generally references or authors and affiliations but could potentially be other things
Existing Cern RefExtract will be subsumed by this module, as will existing SPIRES "Giva"
Most output of these extractors will be checked for accuracy by some human intervention (possibly
BibEdit/BibCheck), except possibly in non-core papers, or in cases of extreme trust.
2. Use cases
2.1 arXiv download
The daily download from arXiv (OAI) also includes the pdf and latex files for the papers. These are automatically fed through
BibExtract which generates proposed lists of references and authors/affils for each paper. In some cases these proposed lists may move directly into the record (visible/countable/searchable), in others they may stay in a holding pen for human checking/addition
2.2 Adding missed references
A user complains that her reference is missed in our version of a journal paper. Using the BibEdit interface, a maintainer goes to the record and requests that
BibExtract re-extract the references from the journal PDF (not stored locally, but found by hand by the maintainer after using the DOI in the record to go to the journal splash page). New reference list is generated, which can then (in BibEdit?) be compared to the existing list and all new references added.
2.3 Updating references
Author resubmits new version of paper to arXiv after adding several new references. The OAI feed alerts us to a change in the paper,
BibExtract runs through the new version, checks the new rpoposed reference list against the existing list and adds any new additions, flagging the record for additional work if it detects a possible conflict.
2.4 Author lists
CMS publishes a paper in an instrumentation journal, without sending it to arXiv. The 2000 person author list is not sent by the journal in their feed, however we have a PDF of the paper, sent by email from one of the authors. The maintainer directs
BibExtract to get the author list from the PDF and then checks for possible missed extractions by:
- Comparing with previous CMS papers
- Making sure all extracted authors have appeared before in the database
- Checking for affiliations that aren't in the inst. KB
Then adds the author list to the paper.
3. Workflow
4. Mock-up screenshots
5. Architecture
6. API
--
TravisBrooks - 11 Jul 2008