arXiv Harvesting

Description of how the metadata harvesting workflow from arXiv, with plot-extraction, is configured on INSPIRE. This harvesting will be scheduled daily before other metadata-updates (like SPIRES updates).

Setting up OAI Harvest

The harvesting of metadata from arXiv use the OAI-PMH protocol and therefore it is necessary to set up the harvesting using the Invenio OAI Harvest Admin Interface. When adding a new OAI source, some parameters must be set according to this list:

  • Base URL: http://export.arxiv.org/oai2
  • Metadata prefix: arXiv
  • Postprocess: harvest, convert, extract plots/references, attach full-text, filter, upload
  • BibConvert configuration file: /opt/invenio/etc/bibconvert/config/oaiarXiv2inspire.xsl
  • BibFilter program: /opt/invenio/bin/bibfilter_oaiarXiv2inspire.py

As you see in the above list, two files are needed for the harvesting to function properly. Firstly, it is necessary to provide a XSLT stylesheet to transform OAI style XML into proper Inspire MARCXML. Secondly, a python script must be provided to perform various filtering of incoming record updates. These will be properly installed when installing Inspire sources.

Selected categories

This sections describes which sets/categories that are currently harvested from arXiv.

Main sets where all records are accepted:

physics:astro-ph (Astrophysics)
physics:gr-qc (General Relativity and Quantum Cosmology)
physics:hep-ex (High Energy Physics - Experiment)
physics:hep-lat (High Energy Physics - Lattice)
physics:hep-ph (High Energy Physics - Phenomenology)
physics:hep-th (High Energy Physics - Theory)
physics:nucl-ex (Nuclear Experiment)
physics:nucl-th (Nuclear Theory)

Secondary sets, where only some records are accepted:

??

Update sets, where only updates are accepted (outside of the main ones):

physics:astro-ph 
physics:cond-mat
physics:math-ph
physics:nlin
physics  (Physics)
physics:quant-ph
math
cs

Sub-Categories in the future

Including cross-listings:

physics:astro-ph.HE (High Energy Astrophysical Phenomena)
physics:physics.acc-ph (Accelerator Physics)
physics:physics.ins-det (Instrumentation and Detectors)

If possible to select for main category only (otherwise DESY selects manually):

physics:physics.data-an (Data Analysis, Statistics and Probability)

Convert step: XSLT stylesheet

During the convertion step the following files are needed:

  • oaiarXiv2inspire.xsl - main stylesheet to transform arXiv OAI to Inspire MARCXML. This is the path you put in the configuration.
  • oaiarXiv2inspire_categories.xml - the stylesheet uses this file to map various category-names.

Installed by running make install from bibconvert folder on Inspire repo.

Filtering step: BibFilter script

The bibfilter step also consists of two files:

  • bibfilter_oaiarXiv2inspire.py - main script run by BibHarvest to filter records. This is the path you put in the configuration.
  • oaiarXiv_bibfilter_actions.cfg - configuration script for the filtering process

Installed by running make install from bibharvest folder on Inspire repo.

-- JanLavik - 21-Jan-2011

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2013-07-17 - KirstenSachs
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback