Journal harvesting in INSPIRE - a developers view

We are harvesting content from several sources (, Elsevier, APS etc.) via different protocols and services:

  • OAI-PMH from and Proceedings of Science (PoS)
  • Atom feeds from CONSYN (Elsevier) systems
  • REST-API services like the ones offered by American Physical Society (APS)

INSPIRE is running on Invenio which supports some of these services. Others we have written custom scripts for.


When working on the current content ingestion scripts there are two repositories of interest:

  • Harvesting Kit: a shared Python package with SCOAP3 containing scripts and transformations of XML records into MARCXML. This is done purely in Python and replaces the legacy XSLT stylesheets.

Harvesting Kit


Harvesting Kit consist of a mix of utility functions (such as and specific "packages" for a given source (such as,

These packages are mainly used to convert one metadata format to MARCXML.

If we take a look at the repository, the most interesting files are inside the harvestingkit folder



To start contributing to the Harvesting Kit repository, it is recommended that you fork the repository on the GitHub pages and then clone your fork.

If you already have a virtual environment installed with Invenio v.1.x.x installed (see how), you can simply install harvesting kit there:

workon master  # or whatever you named your Invenio v1.x.x environment
cdvirtualenv src
git clone
cd harvesting-kit
pip install -e . --process-dependency-links --allow-all-external

If you do not want to use a virtualenv the procedure is exactly the same without having to workon:

cd ~/src
git clone
cd harvesting-kit
pip install -e . --process-dependency-links --allow-all-external

Although it is highly recommended to use a virtualenv setup.


OAI-PMH harvesting is active for and PoS harvesting. We use the Invenio module OAIHarvest to harvest metadata from repositories.

Read more][here]]

Harvesting American Physical Society (APS)


APS provides a REST API to get a list of updated records within date-ranges or get specific papers via DOI.

The API documentation is available here.


For the APS ingestion we currently have the following layers:

First point of entry is the BibTasklet which is run using the following command:

bibtasklet -T bst_apsharvest

That is perhaps the most basic example, a more real example when harvesting a range of records looks like this:

bibtasklet -T bst_apsharvest -a "from_date=2014-02-21" -a "until_date=2014-02-28" -a "threshold_date=2012-01-01"

The current CLI options:

   Task to download APS metadata + fulltext given a list of arguments.

    Operates in two ways:

        1. Harvesting of new/updated metadata+fulltext from APS via REST API

           This means that new records are being looked for at APS servers.
           Active when from_date and until_date is given, in addition when
           a DOI not already in the system is given.

           If the value "last" is given to from_date the harvester will harvest
           any new records since last run.

           If match is set to "yes" the records harvested will be matched against
           the database and split into "new" and "updated" records.

        2. Attachment of fulltext only from APS for existing records

           When the records to be processed already exists in the system, the
           task only harvests the fulltext's themselves and attaches them
           to the records.


    Get full update for existing records via record identifier:
    >>> bst_apsharvest(recids="13,513,333")

    Get full update for existing records via a search query and unhide fulltext:
    >>> bst_apsharvest(query="find j prstab", hidden="no")

    Get metadata only update for an existing doi:
    >>> bst_apsharvest(dois="10.1103/PhysRevB.87.235401", fulltext="no")

    Get fulltext only update for a record and append to record:
    >>> bst_apsharvest(recids="11139", metadata="no", update_mode="append")

    Get new records from APS, send update to holding pen and email new records
    >>> bst_apsharvest(from_date="last", update_mode="o")

    Get records from APS updated between given dates, insert new and correct
    >>> bst_apsharvest(from_date="2013-06-03", until_date="2013-06-04",
                       new_mode="insert", update_mode="correct")

    @param dois: comma-separated list of DOIs to download fulltext/metadata for.
    @type dois: string

    @param recids: comma-separated list of recids of record containing
                   a DOI to download fulltext for.
    @type recids: string

    @param query: an Invenio search query of records to download fulltext for.
    @type query: string

    @param records: get any records modified, created or both since last time
                    in the database to download fulltext for, can be either:
                    "new" - fetches all new records added
                    "modified" - fetches all modified records added
                    "both" - both of the above
    @type records: string

    @param new_mode: which mode should the fulltext files be submitted in:
                "email" - does NOT run bibupload and sends an email instead. Default.
                "insert" - inserts the records into the database
                "append" - appends the fulltext to the existing attached files
                "correct" - corrects existing attached fulltext files, or adds new
                "replace" - replaces all attached files with new fulltext file

                The fulltext is appended by default to new records.
    @type mode: string

    @param update_mode: which mode should the fulltext files be submitted in:
                "email" - does NOT run bibupload and sends an email instead. Default.
                "insert" - inserts the records into the database
                "append" - appends the fulltext to the existing attached files
                "correct" - corrects existing attached fulltext files, or adds new
                "replace" - replaces all attached files with new fulltext file

                The fulltext is appended by default to new records.
    @type mode: string

    @param from_date: ISO date for when to harvest records from. Ex. 2013-01-01
                      If the value is "last" it means to get records since last
    @type from_date: string

    @param until_date: ISO date for when to harvest records until. Ex. 2013-01-01
    @type until_date: string

    @param fulltext: should the record have fulltext attached? "yes" or "no"
    @type fulltext: string

    @param hidden: should the fulltext be hidden when attached? "yes" or "no"
    @type hidden: string

    @param match: should a simple match with the database be done? "yes" or "no"
    @type match: string

    @param reportonly: only report number of records to harvest, then exit? "yes" or "no"
    @type reportonly: string

    @param threshold_date: ISO date for when to harvest records since. Ex. 2013-01-01
    @type threshold_date: string

    @param devmode: Activate devmode. Full verbosity and no uploads/mails.
    @type devmode: string


We query the APS REST API detailed in the attached file.

Example query to fetch $ curl ''

We then receive a JSON response:


We fetch the apsxml and store it as the fulltext, in addition to checking the bagit format consistency with checksum's etc.

To do this the BibTasklet bst_apsharvest is using code from the apsharvest module inside the INSPIRE overlay:


It is also dependent on Harvesting-Kit for converting the JATS XML received from APS into MARCXML


As a fall-back, in case the XML received is not JATS, we fall back to the old XSLT 2.0 with a Java call.

Harvesting Proceedings of Science (PoS)


Since the PoS records are harvested through OAI-PMH, we are making use of the OAIHarvest module of Invenio. The module will harvest the records in Dublin Core format supplied by PoS and we will then run a "BibFilter" script on the records.

This is all done automatically when running the harvest.

This filtering script is named and lives inside the INSPIRE overlay:


It is a Python command line script that take one argument: path to an XML file with PoS harvested records.

python path_to.xml

The output is then saved in the folder determined by the variable CFG_POS_OUT_DIRECTORY.


Then by reading the XML, the filtering scripts calls Harvesting Kit to convert the XML into MARCXML using



Proceedings of Science have an OAI-PMH server

OAI-PMH url:

xds schema:

I'm going to use the conference "IHEP-LHC" in the following examples:

The OAI base URL is:

Each record describes a proceeding (i.e. a single contribution to a conference):

The element in the section is in this form: PoS($conference-short-name)$numeral e.g.: PoS(IHEP-LHC)001

If you prepend to it the string: you end up with a sort of "stable URL" of the record, which points to a minimal landing page: If a pdf file is available, you should find it here.

There are two ortogonal "sets" by which each record is classified:

the "conference", which represents all proceedings of a conference, such as: (should give 29 records: all the accepted contributions of that conference)

and the "group", which represent keywords, such as (should give 4702 records: all the contributions to all conferences that have the "group:6" keyword, which is for "High Energy Physics")

As already mentioned, the keywords are assigned to the conference, so each contribution of that conference will share the same keywords.

When a new conference is published, we may send you the conference short name, which you can use to collect the metadata records:$conference-short-name

Harvesting from Elsevier (CONSYN)




-- JanLavik - 14 May 2014

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2014-05-15 - JanLavik
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback