Journal harvesting in INSPIRE - a developers view
We are harvesting content from several sources (arXiv.org, Elsevier, APS etc.) via different protocols and services:
- OAI-PMH from arXiv.org and Proceedings of Science (PoS)
- Atom feeds from CONSYN (Elsevier) systems
- REST-API services like the ones offered by American Physical Society (APS)
INSPIRE is running on Invenio which supports some of these services. Others we have written custom scripts for.
Repositories
When working on the current content ingestion scripts there are two repositories of interest:
- Harvesting Kit
: a shared Python package with SCOAP3 containing scripts and transformations of XML records into MARCXML. This is done purely in Python and replaces the legacy XSLT stylesheets.
Harvesting Kit
Overview
Harvesting Kit consist of a mix of utility functions (such as ftp_utils.py) and specific "packages" for a given source (such as aps_package.py, elsevier_package.py).
These
packages are mainly used to convert one metadata format to MARCXML.
If we take a look at the repository, the most interesting files are inside the
harvestingkit folder
harvestingkit/
aps_package.py
elsevier_package.py
pos_package.py
ftp_utils.py
springer_crawler.py
...
Contribution
To start contributing to the Harvesting Kit repository, it is recommended that you fork the repository on the
GitHub
pages and then clone your fork.
If you already have a
virtual environment installed with Invenio v.1.x.x installed (
see how), you can simply install harvesting kit there:
workon master # or whatever you named your Invenio v1.x.x environment
cdvirtualenv src
git clone git@github.com:yourusername/harvesting-kit.git
cd harvesting-kit
pip install -e . --process-dependency-links --allow-all-external
If you do not want to use a virtualenv the procedure is exactly the same without having to
workon:
cd ~/src
git clone git@github.com:yourusername/harvesting-kit.git
cd harvesting-kit
pip install -e . --process-dependency-links --allow-all-external
Although it is
highly recommended to use a virtualenv setup.
OAI-PMH
OAI-PMH harvesting is active for arXiv.org and
PoS harvesting. We use the Invenio module OAIHarvest to harvest metadata from repositories.
Read more
http://invenio-demo.cern.ch/help/admin/oaiharvest-admin-guide
][here]]
Harvesting American Physical Society (APS)
Background
APS provides a REST API to get a list of updated records within date-ranges or get specific papers via DOI.
The API documentation is
available here.
Usage
For the APS ingestion we currently have the following layers:
First point of entry is the
BibTasklet bst_apsharvest.py which is run using the following command:
bibtasklet -T bst_apsharvest
That is perhaps the most basic example, a more real example when harvesting a range of records looks like this:
bibtasklet -T bst_apsharvest -a "from_date=2014-02-21" -a "until_date=2014-02-28" -a "threshold_date=2012-01-01"
The current CLI options:
Task to download APS metadata + fulltext given a list of arguments.
Operates in two ways:
1. Harvesting of new/updated metadata+fulltext from APS via REST API
This means that new records are being looked for at APS servers.
Active when from_date and until_date is given, in addition when
a DOI not already in the system is given.
If the value "last" is given to from_date the harvester will harvest
any new records since last run.
If match is set to "yes" the records harvested will be matched against
the database and split into "new" and "updated" records.
2. Attachment of fulltext only from APS for existing records
When the records to be processed already exists in the system, the
task only harvests the fulltext's themselves and attaches them
to the records.
Examples:
Get full update for existing records via record identifier:
>>> bst_apsharvest(recids="13,513,333")
Get full update for existing records via a search query and unhide fulltext:
>>> bst_apsharvest(query="find j prstab", hidden="no")
Get metadata only update for an existing doi:
>>> bst_apsharvest(dois="10.1103/PhysRevB.87.235401", fulltext="no")
Get fulltext only update for a record and append to record:
>>> bst_apsharvest(recids="11139", metadata="no", update_mode="append")
Get new records from APS, send update to holding pen and email new records
>>> bst_apsharvest(from_date="last", update_mode="o")
Get records from APS updated between given dates, insert new and correct
>>> bst_apsharvest(from_date="2013-06-03", until_date="2013-06-04",
new_mode="insert", update_mode="correct")
@param dois: comma-separated list of DOIs to download fulltext/metadata for.
@type dois: string
@param recids: comma-separated list of recids of record containing
a DOI to download fulltext for.
@type recids: string
@param query: an Invenio search query of records to download fulltext for.
@type query: string
@param records: get any records modified, created or both since last time
in the database to download fulltext for, can be either:
"new" - fetches all new records added
"modified" - fetches all modified records added
"both" - both of the above
@type records: string
@param new_mode: which mode should the fulltext files be submitted in:
"email" - does NOT run bibupload and sends an email instead. Default.
"insert" - inserts the records into the database
"append" - appends the fulltext to the existing attached files
"correct" - corrects existing attached fulltext files, or adds new
"replace" - replaces all attached files with new fulltext file
The fulltext is appended by default to new records.
@type mode: string
@param update_mode: which mode should the fulltext files be submitted in:
"email" - does NOT run bibupload and sends an email instead. Default.
"insert" - inserts the records into the database
"append" - appends the fulltext to the existing attached files
"correct" - corrects existing attached fulltext files, or adds new
"replace" - replaces all attached files with new fulltext file
The fulltext is appended by default to new records.
@type mode: string
@param from_date: ISO date for when to harvest records from. Ex. 2013-01-01
If the value is "last" it means to get records since last
harvest.
@type from_date: string
@param until_date: ISO date for when to harvest records until. Ex. 2013-01-01
@type until_date: string
@param fulltext: should the record have fulltext attached? "yes" or "no"
@type fulltext: string
@param hidden: should the fulltext be hidden when attached? "yes" or "no"
@type hidden: string
@param match: should a simple match with the database be done? "yes" or "no"
@type match: string
@param reportonly: only report number of records to harvest, then exit? "yes" or "no"
@type reportonly: string
@param threshold_date: ISO date for when to harvest records since. Ex. 2013-01-01
@type threshold_date: string
@param devmode: Activate devmode. Full verbosity and no uploads/mails.
@type devmode: string
Implementation
We query the APS REST API detailed in the attached file.
Example query to fetch
$ curl 'http://harvest.aps.org/content/journals/articles?from=2013-04-15'
We then receive a JSON response:
{
"doi":"10.1103/PhysRevA.87.050301",
"metadata_last_modified_at":"2013-05-13T10:11:48-0400",
"last_modified_at":"2014-05-15T08:06:21-0400",
"bagit_urls":
{"apsxml":
"http://harvest.aps.org/bagit/articles/10.1103/PhysRevA.87.050301/apsxml"
}
}
We fetch the apsxml and store it as the fulltext, in addition to checking the bagit format consistency with checksum's etc.
To do this the
BibTasklet bst_apsharvest is using code from the apsharvest module inside the
INSPIRE overlay
:
apsharvest/
apsharvest_config.py
apsharvest_dblayer.py
apsharvest_tests.py
apsharvest_utils.py
It is also dependent on Harvesting-Kit for converting the JATS XML received from APS into MARCXML
harvestingkit/
aps_package.py
As a fall-back, in case the XML received is not JATS, we fall back to the old XSLT 2.0 with a Java call.
Harvesting Proceedings of Science (PoS)
Usage
Since the
PoS records are harvested through OAI-PMH, we are making use of the OAIHarvest module of Invenio. The module will harvest the records in Dublin Core format supplied by
PoS and we will then run a "BibFilter" script on the records.
This is all done automatically when running the harvest.
This filtering script is named bibfilter_oaipos2inspire.py and lives inside the INSPIRE overlay:
bibharvest/
bibfilter_oaipos2inspire.py
It is a Python command line script that take one argument: path to an XML file with
PoS harvested records.
python bibfilter_oaipos2inspire.py path_to.xml
The output is then saved in the folder determined by the variable
CFG_POS_OUT_DIRECTORY
.
Implementation
Then by reading the XML, the filtering scripts calls Harvesting Kit to convert the XML into MARCXML using pos_package.py:
harvestingkit/
pos_package.py
Background
Proceedings of Science have an OAI-PMH server
OAI-PMH url:
http://pos.sissa.it/cgi-bin/oai/oai-script-spires-extended.cgi?verb=ListRecords&metadataPrefix=pos-ext_dc&set=conference:NIC%20X
xds schema:
http://pos.sissa.it/pos-ext_dc/pos-ext_dc.xsd
I'm going to use the conference "IHEP-LHC" in the following examples:
http://pos.sissa.it/cgi-bin/reader/conf.cgi?confid=186
The OAI base URL is:
http://pos.sissa.it/cgi-bin/oai/oai-script.cgi
Each record describes a proceeding (i.e. a single contribution to a
conference):
http://pos.sissa.it/cgi-bin/oai/oai-script.cgi?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai%3Apos.sissa.it%3AIHEP-LHC%2F001
The
element in the section is in this form:
PoS($conference-short-name)$numeral
e.g.:
PoS(IHEP-LHC)001
If you prepend to it the string:
http://pos.sissa.it/contribution?id=
you end up with a sort of "stable URL" of the record, which points
to a minimal landing page:
http://pos.sissa.it/contribution?id=PoS(IHEP-LHC)001
If a pdf file is available, you should find it here.
There are two ortogonal "sets" by which each record is classified:
http://pos.sissa.it/cgi-bin/oai/oai-script.cgi?verb=ListSets
the "conference", which represents all proceedings of a conference, such as:
http://pos.sissa.it/cgi-bin/oai/oai-script.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=conference:IHEP-LHC
(should give 29 records: all the accepted contributions of that conference)
and the "group", which represent keywords, such as
http://pos.sissa.it/cgi-bin/oai/oai-script.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=group:6
(should give 4702 records: all the contributions to all conferences
that have the "group:6" keyword, which is for "High Energy Physics")
As already mentioned, the keywords are assigned to the conference, so
each contribution of that conference will share the same keywords.
When a new conference is published, we may send you the conference
short name, which you can use to collect the metadata records:
http://pos.sissa.it/cgi-bin/oai/oai-script.cgi?verb=ListRecords&metadataPrefix=oai_dc&set=conference:$conference-short-name
Harvesting from Elsevier (CONSYN)
Usage
Background
Implementation
-- JanLavik - 14 May 2014