Journal harvesting in INSPIRE - a developers view

We are harvesting content from several sources (arXiv.org, Elsevier, APS etc.) via different protocols and services:

  • OAI-PMH from arXiv.org and Proceedings of Science (PoS)
  • Atom feeds from CONSYN (Elsevier) systems
  • REST-API services like the ones offered by American Physical Society (APS)

INSPIRE is running on Invenio which supports some of these services. Others we have written custom scripts for.

Repositories

When working on the current content ingestion scripts there are two repositories of interest:

  • Harvesting Kit: a shared Python package with SCOAP3 containing scripts and transformations of XML records into MARCXML. This is done purely in Python and replaces the legacy XSLT stylesheets.

Harvesting Kit

Overview

Harvesting Kit consist of a mix of utility functions (such as ftp_utils.py) and specific "packages" for a given source (such as aps_package.py, elsevier_package.py).

These packages are mainly used to convert one metadata format to MARCXML.

If we take a look at the repository, the most interesting files are inside the harvestingkit folder

harvestingkit/
  aps_package.py
  elsevier_package.py
  pos_package.py
  ftp_utils.py
  springer_crawler.py
  ...

Contribution

To start contributing to the Harvesting Kit repository, it is recommended that you fork the repository on the GitHub pages and then clone your fork.

If you already have a virtual environment installed with Invenio v.1.x.x installed (see how), you can simply install harvesting kit there:

workon master  # or whatever you named your Invenio v1.x.x environment
cdvirtualenv src
git clone git@github.com:yourusername/harvesting-kit.git
cd harvesting-kit
pip install -e . --process-dependency-links --allow-all-external

If you do not want to use a virtualenv the procedure is exactly the same without having to workon:

cd ~/src
git clone git@github.com:yourusername/harvesting-kit.git
cd harvesting-kit
pip install -e . --process-dependency-links --allow-all-external

Although it is highly recommended to use a virtualenv setup.

OAI-PMH

OAI-PMH harvesting is active for arXiv.org and PoS harvesting. We use the Invenio module OAIHarvest to harvest metadata from repositories.

Read more http://invenio-demo.cern.ch/help/admin/oaiharvest-admin-guide][here]]

Harvesting American Physical Society (APS)

Background

APS provides a REST API to get a list of updated records within date-ranges or get specific papers via DOI.

The API documentation is available here.

Usage

For the APS ingestion we currently have the following layers:

First point of entry is the BibTasklet bst_apsharvest.py which is run using the following command:

bibtasklet -T bst_apsharvest

That is perhaps the most basic example, a more real example when harvesting a range of records looks like this:

bibtasklet -T bst_apsharvest -a "from_date=2014-02-21" -a "until_date=2014-02-28" -a "threshold_date=2012-01-01"

The current CLI options:

   Task to download APS metadata + fulltext given a list of arguments.

    Operates in two ways:

        1. Harvesting of new/updated metadata+fulltext from APS via REST API

           This means that new records are being looked for at APS servers.
           Active when from_date and until_date is given, in addition when
           a DOI not already in the system is given.

           If the value "last" is given to from_date the harvester will harvest
           any new records since last run.

           If match is set to "yes" the records harvested will be matched against
           the database and split into "new" and "updated" records.

        2. Attachment of fulltext only from APS for existing records

           When the records to be processed already exists in the system, the
           task only harvests the fulltext's themselves and attaches them
           to the records.


    Examples:

    Get full update for existing records via record identifier:
    >>> bst_apsharvest(recids="13,513,333")

    Get full update for existing records via a search query and unhide fulltext:
    >>> bst_apsharvest(query="find j prstab", hidden="no")

    Get metadata only update for an existing doi:
    >>> bst_apsharvest(dois="10.1103/PhysRevB.87.235401", fulltext="no")

    Get fulltext only update for a record and append to record:
    >>> bst_apsharvest(recids="11139", metadata="no", update_mode="append")

    Get new records from APS, send update to holding pen and email new records
    >>> bst_apsharvest(from_date="last", update_mode="o")

    Get records from APS updated between given dates, insert new and correct
    >>> bst_apsharvest(from_date="2013-06-03", until_date="2013-06-04",
                       new_mode="insert", update_mode="correct")


    @param dois: comma-separated list of DOIs to download fulltext/metadata for.
    @type dois: string

    @param recids: comma-separated list of recids of record containing
                   a DOI to download fulltext for.
    @type recids: string

    @param query: an Invenio search query of records to download fulltext for.
    @type query: string

    @param records: get any records modified, created or both since last time
                    in the database to download fulltext for, can be either:
                    "new" - fetches all new records added
                    "modified" - fetches all modified records added
                    "both" - both of the above
    @type records: string

    @param new_mode: which mode should the fulltext files be submitted in:
                "email" - does NOT run bibupload and sends an email instead. Default.
                "insert" - inserts the records into the database
                "append" - appends the fulltext to the existing attached files
                "correct" - corrects existing attached fulltext files, or adds new
                "replace" - replaces all attached files with new fulltext file

                The fulltext is appended by default to new records.
    @type mode: string


    @param update_mode: which mode should the fulltext files be submitted in:
                "email" - does NOT run bibupload and sends an email instead. Default.
                "insert" - inserts the records into the database
                "append" - appends the fulltext to the existing attached files
                "correct" - corrects existing attached fulltext files, or adds new
                "replace" - replaces all attached files with new fulltext file

                The fulltext is appended by default to new records.
    @type mode: string

    @param from_date: ISO date for when to harvest records from. Ex. 2013-01-01
                      If the value is "last" it means to get records since last
                      harvest.
    @type from_date: string

    @param until_date: ISO date for when to harvest records until. Ex. 2013-01-01
    @type until_date: string

    @param fulltext: should the record have fulltext attached? "yes" or "no"
    @type fulltext: string

    @param hidden: should the fulltext be hidden when attached? "yes" or "no"
    @type hidden: string

    @param match: should a simple match with the database be done? "yes" or "no"
    @type match: string

    @param reportonly: only report number of records to harvest, then exit? "yes" or "no"
    @type reportonly: string

    @param threshold_date: ISO date for when to harvest records since. Ex. 2013-01-01
    @type threshold_date: string

    @param devmode: Activate devmode. Full verbosity and no uploads/mails.
    @type devmode: string

Implementation

We query the APS REST API detailed in the attached file.

Example query to fetch $ curl 'http://harvest.aps.org/content/journals/articles?from=2013-04-15'

We then receive a JSON response:

{
 "doi":"10.1103/PhysRevA.87.050301",
 "metadata_last_modified_at":"2013-05-13T10:11:48-0400",
 "last_modified_at":"2014-05-15T08:06:21-0400",
 "bagit_urls":
    {"apsxml":
      "http://harvest.aps.org/bagit/articles/10.1103/PhysRevA.87.050301/apsxml"
    }
}

We fetch the apsxml and store it as the fulltext, in addition to checking the bagit format consistency with checksum's etc.

To do this the BibTasklet bst_apsharvest is using code from the apsharvest module inside the INSPIRE overlay:

apsharvest/
  apsharvest_config.py
  apsharvest_dblayer.py
  apsharvest_tests.py
  apsharvest_utils.py

It is also dependent on Harvesting-Kit for converting the JATS XML received from APS into MARCXML

harvestingkit/
  aps_package.py

As a fall-back, in case the XML received is not JATS, we fall back to the old XSLT 2.0 with a Java call.

Harvesting Proceedings of Science (PoS)

Usage

Since the PoS records are harvested through OAI-PMH, we are making use of the OAIHarvest module of Invenio. The module will harvest the records in Dublin Core format supplied by PoS and we will then run a "BibFilter" script on the records.

This is all done automatically when running the harvest.

This filtering script is named bibfilter_oaipos2inspire.py and lives inside the INSPIRE overlay:

bibharvest/
  bibfilter_oaipos2inspire.py

It is a Python command line script that take one argument: path to an XML file with PoS harvested records.

python bibfilter_oaipos2inspire.py path_to.xml

The output is then saved in the folder determined by the variable CFG_POS_OUT_DIRECTORY.

Implementation

Then by reading the XML, the filtering scripts calls Harvesting Kit to convert the XML into MARCXML using pos_package.py:

harvestingkit/
  pos_package.py

Background

Proceedings of Science have an OAI-PMH server

OAI-PMH url: http://pos.sissa.it/cgi-bin/oai/oai-script-spires-extended.cgi?verb=ListRecords&metadataPrefix=pos-ext_dc&set=conference:NIC%20X

xds schema: http://pos.sissa.it/pos-ext_dc/pos-ext_dc.xsd

I'm going to use the conference "IHEP-LHC" in the following examples: http://pos.sissa.it/cgi-bin/reader/conf.cgi?confid=186

The OAI base URL is: http://pos.sissa.it/cgi-bin/oai/oai-script.cgi

Each record describes a proceeding (i.e. a single contribution to a conference): http://pos.sissa.it/cgi-bin/oai/oai-script.cgi?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai%3Apos.sissa.it%3AIHEP-LHC%2F001

The element in the section is in this form: PoS($conference-short-name)$numeral e.g.: PoS(IHEP-LHC)001

If you prepend to it the string: http://pos.sissa.it/contribution?id= you end up with a sort of "stable URL" of the record, which points to a minimal landing page: http://pos.sissa.it/contribution?id=PoS(IHEP-LHC)001 If a pdf file is available, you should find it here.

There are two ortogonal "sets" by which each record is classified: http://pos.sissa.it/cgi-bin/oai/oai-script.cgi?verb=ListSets

the "conference", which represents all proceedings of a conference, such as: http://pos.sissa.it/cgi-bin/oai/oai-script.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=conference:IHEP-LHC (should give 29 records: all the accepted contributions of that conference)

and the "group", which represent keywords, such as http://pos.sissa.it/cgi-bin/oai/oai-script.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=group:6 (should give 4702 records: all the contributions to all conferences that have the "group:6" keyword, which is for "High Energy Physics")

As already mentioned, the keywords are assigned to the conference, so each contribution of that conference will share the same keywords.

When a new conference is published, we may send you the conference short name, which you can use to collect the metadata records: http://pos.sissa.it/cgi-bin/oai/oai-script.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=conference:$conference-short-name

Harvesting from Elsevier (CONSYN)

Usage

Background

Implementation

-- JanLavik - 14 May 2014

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2014-05-15 - JanLavik
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback