SLAC Inputting - Inspire

Introduction

This document covers the process of enhancing and correcting the metadata in an Inspire record that is created from the OAI Harvest from http://arxiv.org . SystemOperationsArXivHarvesting describes this process. A record is created in Inspire with most of the data needed to make the record meaningful and useful. The Harvest is excellent but is yet unable to attach standardized affiliations to author names, link papers to conferences, recognize an experiment or collaboration reliably, and properly translate the entire reference list written out by authors into linkable cites to other Inspire papers.

The same functions performed in this process are also performed in adding new records from journals or preprints that did not come from arxiv.org. These records will come in through journal feeds curated by DESY and/or FermiLab, or by individual requests coming through RT SystemOperationsRT

Inputting: What do you need to start.

This is SLAC specific

  • SLAC Unix and Windows accounts. This is handled by account-services@slacNOSPAMPLEASE.stanford.edu and part of the general orientation process. A SLAC-sponsored computer security course will be required to initially get these accounts, and will have to be repeated regularly.
    The Unix account is used for logging into local SLAC hosts.
    The Windows account is used for mail, calendaring, departmental shared fileserver space, etc.

    The SLAC UNIX and Windows accounts will enforce a password change every six months. If you miss the password change and are locked out, you must go back to account-services@slacNOSPAMPLEASE.stanford.edu to get password restored, at this point a trip to building 050.
  • an RT account. RT is Request Tracker, our work ticket tracking application, used to organize user requests and record transactions. Accounts are set up by our INSPIRE RT administrators at SLAC (Mike or Bernard). RT sends notifications to your SLAC email account.
  • an account on INSPIRE. This account is needed for editing INSPIRE records. The host is at CERN, and needs to be set up by someone on our team with CERN system management permissions (typically Thorsten or Jan). This allows you to use BibEdit, the Inspire record editor.
General Overview After getting your accounts, it helps to browse three working environments.

  • RT (Request Tracker) https://inspirert.cern.ch (prompted for RT account and password) browse the various queues and types of requests and from whom.
  • INSPIRE http://inspirehep.net search for records, look at author profiles, look at the different formats and record presentations. Browse through the other HEP collections, Institutions, Conferences, Journals, Jobs, etc. Some of these collections are "authority" records for data stored in HEP records, or link to useful data in that collection.
  • INSPIRE admin area http://inspirehep.net/youraccount/login (then you can proceed: http://inspirehep.net/youraccount/display ) introduce yourself to the basic data content, Inspire syntax, and the range of available tools.
  • INSPIRE BibEdit https://inspirehep.net/record/edit/ This is BibEdit without a record to modify, you can search for one within BibEdit, or create a new record from there. Hovering over the icons will give a brief description of the function.
  • Cataloguer's RT Claim http://tislnx3.slac.stanford.edu/cgi-bin/rtbibedit.py This is the generic version of a specific link each cataloguer will have to open a set of Inspire records to curate, it automatically changes RT ticket status for those records, and presents a link to those records in BibEdit.
Inputting: An Inspire Record

Inspire records are stored in MARC format. MARC is an acronym for MAchine Readable Cataloguing. It is a long time Library cataloguing standard for storage/formatting of Library records. Each field is keyed by a number called a Marc Field, and by a variety of subfield codes that can be alpha or numeric. Specific Marc fields are assigned by the type of data in the record according to Marc standards.

DevelopmentRecordMarkup is a detailed Twiki on Inspire's MARC. A very helpful section is: DevelopmentRecordMarkup#Literature_MARC

Another approach is to take a rather full Inspire record such as:

https://inspirehep.net/record/478176?ln=en on the bottom of the record is a link to 'marc' format which gives you:

https://inspirehep.net/record/478176/export/hm - this you can compare with the detailed display, and with the MARC documentation above to see what sort of values go into the record

https://inspirehep.net/record/edit/?ln=en#state=edit&recid=478176 is the same record in bibedit. If you are already logged into Inspire, there's an "Edit this record" Link on the bottom of each record for you.

Inputting: BibEdit

Documentation for Inputting into BibEdit in general: https://inspirehep.net/help/admin/bibedit-admin-guide This guide can be reached from the Admin Area in http://inspirehep.net/youraccount For detailed System design: SystemDesignBibEdit

BibEdit is the primary tool for SLAC inputters to add and modify Inspire records. There's a lot there and it's also a work in progress, so we should feel free to suggest improvements and fixes.

General Workflow

  1. Cataloguer has bookmarked link to RT Claim. http://tislnx3.slac.stanford.edu/cgi-bin/rtbibedit.py
  2. Open a couple more search windows in http://inspirehep.net, for Institutions, Conferences, google searches, and a window with marc examples, the number and type depend on personal preferences.
  3. Claim 5 papers to work. this opens up a Bibedit Screen with 1/5 in the upper corner of the side bar, meaning the first of the five records you are curating (inputting).
  4. Open a PDF for that record by clicking on the PDF icon. This is the authoritative information for the record.
  5. Input Author info, other Metadata, and References. You can select different Displays for that work. Curator Display shows the primary marc fields the Inputting process should cover.
  6. When ready to submit the record: Preview the record using the preview icon. This shows what our Physcist clients will see when record is sent and is useful for checking that you've made linkable references. Then close the RT tickets on the left sidebar by clicking on X . Then SUBMIT. The record will be scheduled for update, the update should happen that day often within the minute if you are working at SLAC in the morning up until mid afternoon.
  7. The next Inspire record will be presented in Bibedit and the record count incremented above. Double check that for sanity. Once all 5 have been completed, go back to RT Claim for another set. You can choose 1,5, or 10 records.
The most common record modifications will be adding in standardized affilations for each author, checking author/ title spellings, adding in a Conference number or Experiment or Collaboration where applicable, and creating linkable references.

Inputting: Authors

Author fields are contained in the 100 (first-author) and 700 fields in BibEdit. The marc fields are laid out in numeric order, so the standard BibEdit screen has the 100 and 700 fields separated. For easier inputting of authors, you can click on check box for Authors in the bottom sidebar and see only 100/700 marc fields to input.

* Long Author LIst * If there is a list of more than a dozen or so authors w/o affiliations, click on blue link [new ticket] under tickets: select Long Author List in the drop down box, and create a ticket to curate the long author list separately. If there is a long author list with affiliations already in, it has likely been processed already by the collaboration submitting and Authors_XML file with the paper, and won't need a Long Author List ticket.

What to look for?

  • Author order: This is rare problem, some times a field gets populated by a Collaboration name, remove that and put in the 710a field.
  • Author spelling: sometime authors with double last names, paper editors, or ones with extensions (Jr, III, etc) have some problems. See
DevelopmentRecordMarkup#Literature_MARC in the 100 Marc field to see where the data goes. Repair mistaken first/last names by putting in LastName 2ndLastName, FirstName Middle format.
  • Affiliation: This is 99% of the author inputting. If the field is populated, look that it is standardized spelling, detailed rules are listed at CataloguingRulesInstitutions For 95% of the affiliations, you can do a google-like search in Institutions and use the abbreviated Inst ICN (Inst Catch Name) for that Inst. Include any punctuation and pay attention to case.
Affiliation guessing and autocomplete The affiliation fields have an autocomplete function that searches for the correct ICN to use. Now deployed is a function key that supplies an "AUTHOR GUESS" into the Affilation fields (100u and 700u) with a CNTL-SHIFT-g sequence. The guessed fields will highlight in yellow. In the meantime, Mike runs a job to preload HEP_curation records with the guessed fields into Inspire, with a vertical bar "|" in the beginning of the guessed field. If the guessed field is correct, remove the "|", else correct the field using the autocomplete or lookup function.

Inputting: References

Because References seem to be the most eye-catching part of the record for our physcist users, and because it's a fairly massive job to keep up with this work, Cataloguing rules are very simple. Make linkable references, as many as you can find. DO NOT try to make a clean looking reference list in the marc record, that is the task of the format, and of the paper itself. The primary task for us is that if someone gives us a nearly usable link to an Inspire record, make it work. References are contained in 999C5 fields.

There are currently four types of data that's in a record that Inspire can use to make a link, arxiv-id (037,035), report-number(037), DOI(0247), pub-note(773). Each of these fields is unique to that record, but many papers are not published, many papers are not submitted to arXiv, and many papers do not have report-numbers. Most of the papers will have at least one of these identifiers. In the future, Inspire will be able to link by Recid(001) and texkey(035a) so you can use one of those in the absence of a valid arxiv/report/pub-note in the record. There are two ways to see if the paper has been linked to another Inspire paper.

  1. A recid (up to seven digit number) in the $$0 subfield of that particular 999C5 marc field
  2. in the Preview page, or in the References Format, an underlined title for a paper is shown
Sometimes, the arxiv harvest ingests non reference data. It doesn't hurt anything, but might make it confusing to look at the reference list, so ok to delete the mess if you like, but only in the interest of making your task go easier/quicker/more accurately.

Tools On the bottom of the left sidebar is a set of checkboxes, by checking only References box, you can curate references only without seeing the rest of the marc record. On the toolbar above the editor box are presented 3 icons on the left side, these are tools specifically for reference curation, and 4 on the right that are used for all curation including References.

Icons on the upper left:

  1. Reference Extractor. Icon with cardboard box and arrow. On hover it reads: "run reference extractor on this record". This tool extracts the references from the local PDF (typically from arXiv.org).
  2. Text Extractor. icon with clipboard. On hover it reads: "Extract references from text". This tool allows you to paste a reference list from a pdf, from a publisher, and extract them to bibedit.
  3. URL RefExtract icon with a globe. On hover it reads "Extract references from URL". Put a link to the PDF into the provided box and BIbEdit will extract the references. Note that if an article has been published in a journal, the publisher's final version of the PDF is considered authoritative, and should be used for reference extraction.
Icons on the upper right:
  1. Marc Text editor. Icon with capital "A". On hover it reads: "Text Marc". This converts the record into a marc form that can be directly edited with a text editor.
  2. Print. Icon of printer
  3. PDF: Icon of PDF. This is the latest PDF from http://arxiv.org . The PDF is the authority for correct data.
  4. Preview. Icon of magnifying glass. This allows you to preview the record to see how the detailed display and References format will look. This helps see if references are linked to Inspire records.
Reference Workflow this workflow is specific to the ref_curation queue, where it is likely this is first time we've touched record.
  1. Check to see if record curated already.
  2. Check record for linked references using Preview, visually check alongside PDF ref list.
  3. If 2 looks very good, skip the next 3 steps. Otherwise...
  4. Re-extract references using reference extractor icon. Compare to current ref list and apply or not depending on which list is better.
  5. Fix marc fields that don't link.
  6. Repeat 2. If good then submit.
Any references that you changed in bibedit will be marked with a $$9CURATOR tag upon submission.

What to look for:

The system creates links from cited references to their records in INSPIRE based on the following identifiers.

  1. arxiv-id This is the first choice for a reference, and is entered into the $$r subfield. The format is consistent and comes in two flavors, arxiv:.YYMM.NNNN where YYMM is the four digit year/month the paper hit arXiv.org, and NNNN is an incremental number for that month, packed with 0 on the left. if you have a linkable arxiv-id no need to go through a lot of trouble formatting an existing pub-note, fix it if it's easy and recognizeable. EG: arXiv:1307.4749. "arxiv" is not case sensitive, upper/lower/mixed all works. The other format is formed so: AAA-AA/NNNNNNN where AAA-AA is an alpha phrase with the Physics field on the left of the dash, and the discipline, e.g. ex, th or ph on the right. The numeric field NNNNNNN is a sequential number assigned by Arxiv.org. EG: hep-ex/0702005 or cond-mat/0302050
  2. pub-note (journal reference) if there is one. This goes in the $$s field of the same Marc field. Formatting of the pub-note is below.
  3. report-number this is a mixed alpha numeric text string, almost always using "-" as separator. Rept numbers are generally recognizeable by having the publishing institution abbreviation as part of the number.
  4. DOI goes in $$a and/or recid goes into $$0
If a reference has its linkable data scattered across more than one 999 field, put the subfields into the same 999 field and remove the others.
  1. recid key of the Inspire record (001), enter into the $$O field in 999C5 for link. This doesn't count in citation counts.
  2. ISBN located in the 022a field of the referenced record, enter into $$i subfield in 999C5 for link. This doesn't count in citation counts.

Pub-note formatting

A pub-note has the following format: Journal Short Title,Volume,Page

* Journal Short Title: This can be found in the Journals database http://inspirehep.net/collection/journals Do a google type search on the journal name or abbreviation listed, and use the resulting short-title that comes back in Bold on the record. Include any punctuation. A list of the more common journal titles can be found here: http://www.slac.stanford.edu/~slaclib/table.popular.journ

* Volume: Volume is an alpha numeric field, usually 2-3 chars long. If there is a volume letter associated with the volume, have the letter precede the numeric code

* Page: starting page of the article. Some publications have article IDs and no page numbers. In such cases, use the article ID in place of the page number.

Inputting: Metadata

The remaining data to check in the record:

Title

Check against the PDF. It's possible that the title changed, you can see that the sidebar in greyscale of the PDF will have a .V2 or .V3 if it's a later version of the paper. It's also likely that a title has changed in the published paper from the original arxiv submission. There are 4 separate Title MARC fields:

  • 245a The current title of record
  • 247a Old Title of arXiv paper if 245 is now the published version, the published title goes into 245 if different than original arXiv title.
  • 210a Title variation. If a later arXiv version has changed title, put the previous title or key phrases of the previous title that are different than current title into this marc field
  • 242a Translated Title, a LaTex version of the title.

Conference

If the paper was presented at a Conference, The PDF will likely have Presented at with a description of the conference proceedings, dates, location. This is very likely to have been harvested and put into the 500a field. Use that text, or phrases from that text to search in the Conferences collection http://inspirehep.net/collection/conferences to find a conference number. That number is put in 773w. The conference number is in the following format: CYY-MM-DD.I if the conf-number is C12-10-05.2, it is the 2nd conference that begins on Oct 5, 2012

If there is no existing conference, click on the blue link [new ticket] selecting CONF add_user queue in the selection box. Paste in the 500a field into the ticket. Also include an identifier for the paper you were curating. Then submit that ticket. When the record for the conference is created, it is assigned a "conference number" or "cnum". That cnum will then be added to the paper, to tie the paper (and any others) reliably to that conference.

Date, Pages.

The preprint date in 269c and the number of pages in 300a are rarely missed, but scan for them just in case.

Pub-Note and DOI

Pub-note information and Conference information both go into a 773 field. It's OK to use different subfields of the same 773.

If a paper you're curating is already published, put all of the publication information into the following 773 fields:

  • 773p Journal Short-title (this matches the Short-title in the References and can be gotten in the same method
  • 773v Journal Volume as in References (generally 3 char, the alpha char is in first Char)
  • 773c Page range (firstpage-lastpage) or article id
  • 773y Year (as part of the pub-note, not necessarily the year in 269c)
  • 773n Issue (optional)
DOI is an acronym for Digital Object Identifier. DOI are coded by the Publishers, we do not create the codes, we get them from the published documents, or from the arxiv paper if arXiv has it and we do not. Enter the DOI in the 0247 marc field with DOI in the $$2 subfield and the DOI in $$a.

A paper can have two dates, a date it first appeared as a preprint either in arXiv, report, or conference presentation, and a published date, for when it appeared in a journal. Published date is entered in 260c, the preprint date in 269c.

Experiments, Collaboration, PACS, other stuff.

If a paper is part of a Collaboration or Experiment, it will be stated on the PDF. There isn't a separate collection of Collaboration names, use the text as presented by the paper. Similarly, use the same text for Experiment as put in the paper, but check in the Experiments collection for standardization. For collaboration, look up in HEP to see how that particular collaboration is noted.

PACS are also listed in the PDF, enter them into separate 084 marc fields with PACS in subfield $$2 , and the PAC in $$a

delete 500 field if it has Temporary Entry or Brief Entry in it. Leave any other note field

Inputting: Curator Dashboard

The SLAC curators have an application to set up a dashboard for more efficient curation of arxiv harvested records via the HEP_Curation RT queue. The url for the application is https://tislnx3.slac.stanford.edu/cgi-bin/grpbibedit.py

the code lives on /afs/slac.stanford.edu/g/library/cgi-bin/grpbibedit.py and is sim-linked from tislnx3:/var/www/cgi-bin/grpbibedit.py allowing for afs backup of the actual code.

The python script takes the catalogers RT account, RT password, and choice of number of records (25, 50) and changes ownership of the oldest "NEW" tickets in HEP_Curation to that cataloger, leaving the status as NEW. The recid is pulled from each RT ticket and assembled in groups of 5 and linked in an html page to bibedit in the following url:

https://inspirehep.net/record/edit/?ln=en#state=search&p=recid:1347134%20or%20recid:1347135%20or%20recid:1347136%20or%20recid:1347137%20or%20recid:1347138

which opens 5 records in a group in bibedit. The curator closes the RT ticket as each Inspire record is cleaned up.

The finished dashboard is written to tislnx3:/var/www/html/ and looks like this: http://www.slac.stanford.edu/~slaclib/sulbibedit.html

tislnx3 apache points its html documents from /var/www/html --> /afs/slac.stanford.edu/g/library/intweb/www-tisint

There are library apps that use this pageand left httpd.conf alone, so there are sim-links to the tislnx3 machine in that directory mhuangbibedit --> /var/www/html/mhuangbibedit

The curator looks to the following link: https://inspirert.cern.ch/Search/Results.html?Query=Queue%20%3D%20%27HEP_curation%27%20AND%20Status%20%3D%20%27new%27

In RT to check the queue status. If there are any records in the queue older than 4 weeks, they can load another set of curator records.

There is a test script with the same directory links at: https://tislnx3.slac.stanford.edu/cgi-bin/testgrpbibedit.py

this script pulls rt records from inspire-vm2 but links to PROD inspire.

Inputting: SLAC Spires HEP records to Inspire

Document services ingests SLAC papers into the SlacSciDoc system. Place holder to link to this documentation.

Once records have been updated into Spires HEP, we need to select out the HEP papers of interest to Inspire, merge new information that Inspire doesn't have, then add complete records to Inspire that don't exist.

Until SlacSciDoc is replaced, Spires is used to expose records to OSTI. SlacSciDoc replacement will have a new workflow utilizing it's own formats and Inspire's Holding Pen.

Until then, log on to sunspi4

enter following commands:

  • spires
  • set xeq newproto
  • ..slac2inspire

There are 4 choices, the first 3 steps are necessary for the work flow of making sure we don't duplicate records, current Inspire records get relevant information, and we select the proper records for Inspire HEP.

Step 1 Searches in the Spires HEP file for SLAC papers that have been updated by Arsella, this search accomplished by 'find dupd '.. The protocol searches by eprint and report-num in INSPIRE to see if record exits. This presents two web pages:

  • http://www.slac.stanford.edu/~slaclib/slacdash.html This page presents the metadata in Spires HEP that might be merged into Inspire. Each record metadata is presented, author/aff report-num, pub-note, doi, cnum, title and a link to the Inspire bibedit record. Curator should then update each record.

Step 2 After the text file is returned from Annette and Kirsten, IRN (keys) of the selected records can be pasted into the prompt one at a time. The result will be the following page:

  • http://www.slac.stanford.edu/~slaclib/slaccheck.html This has the metadata from the selected records. Curator should use phrases from the title to paste into the Inspire search box, and compare titile, author, etc, to see if a match. The record, if matched would then be opened and curated. Records not found should be noted for Step 3.

Step 3 IRNs for papers in the selected list not in Inspire are entered one at a time at the prompt. The app will generate MARCxml and present it on this web page:

Choice 4 If SciDoc is not replaced by the time holding pen is fully functional, this is the option to take all of the SLAC records in a date range, create a marcxml file and upload into holding pen. Just enter the dupd search at the prompt, and the following web page will be presented:

Inputting: Holding Pen

Holding Pen is documented here: SystemDesignBibHoldingPen Holding Pen is where records can be created and stored to be added or merged into Inspire collections. Currently there is no Holding Pen function for HEP_Curation.

Harvesting: arXiv

  • arXiv_harvest.png:
    arXiv_harvest.png

Harvesting: General Data Sources and Inputs

  • Overview of the data curation process for Inspire HEP hepinputs.png

Slac Inputting - Spires Legacy

The former SLAC processes for Spires inputting are kept in SlacInputtingSpires and at the SLAC Mwiki. You need to VPN or be at SLAC and use a SLAC logon to see the SLAC wiki documents.

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng arXiv_harvest.png r1 manage 22.9 K 2013-07-30 - 01:05 MikeSullivan  
PNGpng hepinputs.png r1 manage 65.5 K 2013-07-30 - 01:02 MikeSullivan  
Edit | Attach | Watch | Print version | History: r33 < r32 < r31 < r30 < r29 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r33 - 2015-06-15 - BernardHecker
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback