Enrichment scripts

Place to discuss the procedures needed to port SLAC/DESY/Fermilab enrichment scripts to Inspire


From the October meetings we have an idea of how to build these tools:

  • 1a [Oct07] {2 weeks} - summary of existing SLAC/FERMI/CERN/DESY enrichement+maintenance scripts: free text description | parameters | interface type | e.g
  • 1b {1 week} - compare with existing Invenio facilities
  • 1c [Dec07-Feb08] {6 weeks} - additional requirements from inputters
  • 1d [Feb08] {8 weeks} - analysis of the future maintenance framework; toolset design
  • 1e [Q4 08] {40 ext weeks} - development
  • (Invenio: bibupload priority implementation/curator project at EPFL)


No list should be considered complete until discussion with the enrichment staff.

SLAC Automated

These tasks are instigated automatically, but occasionally create worklists for humans. In those cases there are scripts for assisting the human in dealing with the list. A brif description is given, along with the key to the record in the spirestasks database, which contains the information about which scripts/spires protocols/perl modules do the work. To find out about the language of the exisiting script, go to spires sel spirestasks dis

This list may be lacking a few but this is the bulk. See below for more comments All are CLI, unless there is not human interaction at all. Fully automated ones are noted.

ID Name Description Proposed Solution Fully Automated
S1 Create the SLAC leaklist Checks SLAC items in HEP which are leaks or misses, moves them to the SLAC file used by TechPub Rebuild at SLAC - Inst Repository Yes
S2 Check old temps Check old temporary entries from core archives Workflow
S5 PBN from arXiv Add PBN from arXiv Build
S6 Missing topcited papers Missing HEP papers with near or more than 100 citations Build
S7 Physrev Employs a protocol in subfile PHYSREV Build (Holding pen)
S8 Spellcheck for CONF file Run CONF.SPELLCHECK on the CONF file CernBibCheck??
S9 Add pbn to CONF file Add published proceedings info to CONF file Non migrate yet
S10 Physrev Cites Adding the cites from Physrev papers Build (holding pen)
S11 SPIRES email User Email Not need to migrate, need editor...
S12 URL for SciDir, journals Find papers without URL, add the URL GoDirect/DOI
S13 URL for other journals Find papers without URL, add the URL GoDirect/DOI
S14 URL for Phys.Rev. papers Find papers without URL, add the URL GoDirect/DOI
S15 Add to HEP Add documents to HEP from batch input file, protocol checks for duplicates Build (holding pen)
S16 Add HEP meeting note Add MN to records with CNUM, but no MN CernBibCheck
S17 PPF title spellcheck Spellchecking PPF titles CernBibCheck??
S18 Check new PBN Catch suspicious PBN, espcially those with two or more journal references CernBibCheck
S19 Fix NONE PBN Fix incorrectly coded journal entries (PBN,DPBN) CernBibCheck
S20 Fix bad dates Updated preprints with bad or no date CernBibCheck
S21 Fix PBN bad page range Fix PBN with bad page range CernBibCheck
S22 Correct wrong affiliations Get a list of incorrect AFF and correct them CernBibCheck (With inst auth. file)
S23 URL problems HEP URLs with typo in URL field Link checker?
S27 SLAC TechPub Updates SLAC queue status that describes actions needed SLAC specific soution, must be built (inst repository harvest...)
S28 SLAC-Coding SLAC items needing coding Slac Specific
S29 Fix HEP duplicates Check duplicate report-num or spicite Build
S31 Audit PPF papers Audit randomly chosen papers Workflow
S33 Check new authors Check authors in the newauth file, put there by inputters Build
S34 Check exp element Check records with cn, but no exp, add exp Build
S35 Giva Retrieve lists of authors from PDF or copy from another report (See CERN C1, C2) Build
S36 Add field-code Check for missing field-code and adding it to HEP Build
S37 PPF input duplicates Duplicates discovered during PPF input Build/Workflow
S38 Title changed in arXiv Puts arXiv title as main title, purs old HEP title into OLD-TITLE Build/ Holding pen
S39 Duplicate Texkeys Removes duplicate texkeys CernBibCheck?? Not sure yes
S40 Duplicate Cites Removes duplicate Citations to journal and eprint CernBibCheck yes
S41 SCL Fix missing and incorrect SCL CernBibCheck yes
S42 TranslateSCLFCTC Translate SCLs to FC and TC as the come through CernBibCheck yes
S43 Inst Affiliation Change Change affiliations from inst changes submitted on web build fixed with ids...? yes
S44 Process Processes all files not needed (bibindex) yes
S45 Topcite Check for new topcited papers, only promotions, short job build? yes
S46 Post updates to afs Old routine to post updates for mirrors (deprecated) not needed yes
S47 Removes Populate removes db before processing not needed yes
S48 Backups Make backup copies on sunspi5, where a separate job archives to TSM not needed yes
S49 AddCites Add Cites from citesj to the database (expand to cites2?) build holding pen
S50 Full Topcite Check for new topcited papers, incl. brand new, long job build (same as 45- may be easier in invenio) yes
S51 SPIJOBS daily web listing Daily listing of HEP labs jobs not yet yes
S52 Prepare CERN table Lists the PPF report numbers and the related IRN ??? yes
S54 Weeding FNAL and DESY hardcopy-at notes Strips hardcopy-at notes when url, pbn, etc appear formatting?? yes
S55 DK related to SLAC Prepares a list of papers with DK related to SLAC without SLAC coding, mails to Ann and Travis not needed yes
S56 Prepares institutions list for the web Prepares the list of HEP institutions for Web listing not yet yes
S58 Experiments list for the web Prepares experiments list for the Web not yet yes
S59 Ad Theses Add Theses to HEP from UMI build holding pen
S?? conferences ..conf (only uses 13 and 5) addition of conferences no migration??
S?? ..hep.correct.citation Makes certain mass citation changes... ???


Many of these procedures are decribed in detail in SlacEnrichmentScripts

Note that there are several broad categories of solutions:

  • Build new
  • Use CernBibCheck - Means I hope CernBibCheck will suffice, thhough we still need good interface to CernBibCheck to make it easy to run and correct errors (especially correcting similar errors at the same time)
  • Not needed to migrate (jobs/expts/inst files) either these are slac specific and will be needed as interface between inspire and SLAC's inst. repository (work we need to do, but in a different stream.way) or they are for dbs that, for the moment, remain in spires.
  • Workflow These need building, but primarily need to exist as part of a workflow module, not an interface to records.

From Inputter conversations:

  • 7, 15, 27, 35 are all large components of daily work (in addition to the "Piecework" below
    • 7, 15, 27 are all things currently harvest via OAI (or similar), dump to a holding pen, then do duplicate checking and record editing and addition
    • 35 is GIva for Authors of long author lists. A big job, needs computing power.
  • Also, in Piecework below there is mention of CLI. This is important for several reasons:
    • inputters use for lookups because it allows easier narrowing of searches, (i..e find this...and this...also this)
    • CLI is faster and more reliable than SPIRES web
    • allows use of also (search by elements that aren't indexed - search only in current result i.e. more like a grep on the result set...can be faster and more powerful)
    • Georgia uses for mass changes, but others do not.

SLAC Piecework

These are tasks that are instigated by a human when prompted, either by user email, an eprint to input/check after harvesting, etc. A description is given, along with the name of the spires protocol/perl script and the relative frequency of the action. Note that these are descibed in more detail in SlacInputting

ID name (script) description solution automation frequency comments
SP1 authors.pl adds authors/affiliations and a few other items on input ppfin option "b" build perl manual CLI (with spires ..ppfin.bb) every eprint as it arrives uses extensive perl modules
SP2 ppfin.check.cites manual check on extracted citations (ppfin option c) build spires manual CLI every eprint allows re-extraction of bad cites, manual addition, etc
SP3 cites.add semi-manual check of updated cites build spires manual cli eprints with noticed revised cites merges new cites with old in an intelligent way (combine this with work with journals/holding pens)
SP4 ppfin.checks manual check on anything inputters changed that wasn't downloaded/double check on consistency (ppfin option p) CernBibCheck? spires manual CLI every eprint philosophy is, if it is downloaded, inputter checks, then this looks for suspicious stuff. If inputter adds, this makes them double check. Separated in time...
SP5 spires command line our data staff use the spires command line to do many tasks, the main function that needs to be reproduced is making a change to a large set of records build none daily/weekly usage

DESY specifics

Most of our programs have slac equivalents and will thus be obsolete (e.g. the menu organizing the workflow of our inputters). Additional requirements are mostly related to keywording, journal input and selection of hep-relevant material.

ID script description to do comments
D1 springer.py, iop.py,many perl programs download journal toc or take publisher feeds as input, create hep records (desy format), checks for duplicates, store abstract and references add write module for inspire format, into holding pen no urgent need to convert
D2 aipaut.pl extracts ASTR from AIP abstracts   not part of aip.pl because of access restrictions
D3 checker.pl checks hep records (desy format) against authority files for author names, affiliations, rep nrs, keywords, journal titles, field and type codes, pacs, exp nr; expands keyword abbreviations add kw check to CernBibCheck, immediate exp of kw abbrev  
D4 getps (perl) download a single arxiv paper, convert to ps, add barcode, print    
D5 addbarcode (perl) inserts barcodes (arxiv nr) on first page of arxiv ps files   useful when adding keywords to hep records
D6 eprepget (perl) calls getps for daily eprints of selected archives   printed copies for keyworders
D7 cronret.pl retrieves most recent papers from hep for authors of last week's eprints   help for keyworders
D8 inspacs.pl replaces pacs nrs by verbal descriptions   help for keyworders
D9 dowlist (perl) creates html file for weekly list of accessions (preprints, articles, books), sorted according to type and field codes    
D10 inpuspi.pl converts desy to spires format including some syntax checks   obsolete
D11 akwli (zsh) creates and prints automatic keywords for eprints of last week needs tag suggestions for keyworders
D12 akwins.pl inserts automatic keywords (from title, abstract, author keywords) into file of journal records (desy format)   only very quick human check
D13 jnlcheck.pl finds new issues on publishers' web site   to be replaced more and more by publisher feeds
D14 doipdf.pl extracts doi's from file with journal records, retrieves pdf url from abstract file, downloads and prints pdf store pdf url for keywording
D15 doiref.pl extracts doi from jnl records, checks first for existing reference file in refdir,then looks for pdf file in pdfdir, then looks for pdf url in abstract file to download pdf file and extract references from pdf, converts to spires format (raffle)    
D16 doira2spi.pl extracts doi's from jnl records, concatenates corresp abs and refs files as input for spires   obsolete
D17   select papers from toc, email lists..., tag (hep relevant, to be keyworded ...) holding pen flush unselected rest

-- AnnetteHoltkamp - 19 Oct 2007

Fermilab specifics

F1 aff.pl reads affiliation from input batch file, uses it as argument in fin add search of inst subfile and writes input batch file to output file with affiliation changed to ICN if there is a result for the search.
F2 author.pm reads author from input batch file, finds author in hepnames and returns author's name as found in hepnames or, if zresult, writes CHECKAUTH message next to name in output file.
F3 = S10 physrev.pl Goes out and gets metadata from APS and loads it into our system.
F4 title.pl Processes titles that come from Phys Rev.
F5 textitle.pl Converts regular spires title into correctly formatted tex title.
F6 doi.pl Runs dois from crossref into

CERN specifics

CERN specific scripts and procedures used by the library staff.

Jocelyne Jerdelet, Catherine Cart, DSU/SI, 24/10/2007

ID Name Description To do Comments
C1 agiv500 Gives the list of authors from a paper in PostScript format, extract the author pages, gives a file of authors and affiliations in format Aleph500 Improve extraction of affiliations Accents are deleted, to be improved The result file has to be edited with sysno of the Aleph record and to be check for errors. Affiliations have to be cleaned because not well extracted (done only for CERN papers)
C2 agiv500pdf Gives the list of authors from a paper in PDF format, extract the author pages, gives a file of authors and affiliations in format Aleph500 Improve extraction of affiliations Accents are deleted, to be improved Same as above
C3 Aleph authority Authority data base to standardise and add accents on authors   Previously existed also for periodic titles but replaced by a knowledge file used by uploader tool and several other scripts.
C4 upload22.x Uploader tool is a system (semi-automatic) that performs the transformation of bibliographic records from different sources into the structure supported by the local database. The process of upload involves the following steps: download of bibliographic data from the source extraction and transformation of records downloaded from the source into the local structure matching of records with the current contents of CDS. Every source has to be configured (actually 190 configurations for 190 different sources), lots of knowledge files are integrated in to format the data http://doc.cern.ch/uploader/KB/. These configurations are written and maintained by the library staff according to their needs. Improve the matching, problems with matching title + first author. No problem when the matching with ArXiv number can be done. 4 types of result files after matching: a source.correct file to update an existing record and correct existing fields ; an source.append file to update an existing record and add new fields ; a source.new file to upload new records; a source.nc file to be check manually because the result of the matching was ambiguous. This tool is the first version of biconvert but is always used by the library staff. Uses many KBs.
C5 bibconvert Used for converting arXiv XML metadata to CDS MARCXML format   to be improved by using journal title knowledge file to clean publication references
C6 Electronic submissions Used by secretaries or authors to submit CERN divisional papers, CERN theses, CERN Internal notes, scientific committee papers, conference announcements..........   Submission templates are doing some metadata enrichment during author submission.
C7 check_format500 Checking errors of format for a data file before upload in Aleph    
C8 chkenc Checking errors of accents (alert if not in UTF8) for a data file before upload in Aleph    
C9 reportcheck.py This program prints a list of errors or missing or duplicate report numbers found in the input file (containing a list of report numbers), that start with the given department pattern   done by Kyriakos; checks numbers of a series
C10 dptmtsReportCheck.py This program takes a list of department patterns (like CERN-TH-) and for every pattern it creates a file after the departments name containing all errors that are associated with the specific department. This program uses 'reportcheck.py' program to find the errors   idem
C11 fft_aps.py Downloading fulltext from an input file of APS URLs corresponding to CERN affiliated papers.    
C12 fft_jhep.py Downloading fulltext from an input file of JHEP URLs corresponding to Open Access papers.    
C13 ALEPH Library Staff Menu Aleph utils used for extracting data from the database. These data are then used for global changes, add new fields or correct existing fields according very specific searches in CDS or in Aleph   ALEPH used mostly for multi-record editing. Otherwise CDS queries (see item C14) are mostly used instead of ALEPH's iutils.
C14 CDS query language Lots of searches (saved as Favorites) are made into CDS to detect cataloguing incompatibilities and checking : Example: checking if all papers from a special issue of JHEP have been entered in the database   using search engine web api
C15 CernBibCheck, Config AU To delete the special signs in the authors   Example: Cart, &C corrected by Cart, C
C16 CernBibCheck, ConfigRNba14et21pr260 To clean the tag 088, and the tag 260 with the tag 088, for the BAS 14 (theses) and the BAS 21 (books)   General CernBibCheck comment: more multi-field condition checks wanted.
C17 CernBibCheck, ConfigRNba11-16 To clean the tag 088, and the tag 269, with the tag 088, for the BAS 11 (preprints) and the BAS 16 (scientific committee)    
C18 CernBibCheck, ConfigBA13-773et260 To update the tag 260c (year) with the year of publication (tag 773y)    
C19 CernBibCheck, ConfigLatexCERN To detect errors of latex in the title (like $ impairs) in the tags 245 & 246    
C20 CernBibCheck, Configsujetsxx To correct the XX subjects with the journal titles from the knowledge `773p---65017a'    
C21 CernBibCheck, Configlkravider To detect all the notices which have a tag LKR empty (who make noise with nchkall)    
C22 CernBibCheck, Config8564uniqETmanquants To obtain a list of notices with not uniq url or with missing url    
C23 script3digitrncern (on top of CernBibCheck) To format automatically with 3 digits a serie of report numbers of CERN documents, to obtain a complete order on the web (example: EP-1981-59 ( EP-1981-059)    
C24 CernBibCheck, ConfigDoublonscernep To obtain a list of similar notice with the same tags: 037, 088, 100, 245, 246    
C25 CernBibCheck, ConfigBA11 To clean all the tags of the preprints    
C26 CernBibCheck, ConfigBA13 To clean all the tags of the articles    
C27 CernBibCheck, ConfigBA14 To clean all the tags of the theses    
C28 CernBibCheck, ConfigBA16SCICOM To clean all the tags the scientific committee papers    
C29 CernBibCheck, 48 different knowledges For different configurations, just in the KB/PREPRINTS directory and a lot of others elsewhere   wanted mgmt tool with links between various KBs (one field appears in many)
C30 Autocheck With an excel file of around 100 formulas (that I defined, I can also add news items or delete others). This program checks the formulas on CDS and imports automatically the results, files of errors, on my directory    
C31 cleantitle Nchkall doesn't run with the big knowledge of the titles of journals. So, this script checks the titles of journals of a file with the knowledge to update them. Example: JHEP---J. High Energy Phys.    
C32 Xenu This program detects the errors of links in the url from a file of URLs extracted from the database. It checks if the file exists and if it is not empty (0 byte). But it is very difficult to prepare the file of url : we have to format all the different series of url to be recognised by xenu, because of the set link, and we have a lot of different forms of url, different sorts of barcodes,...   Link checker with manual workaround around setlink URLs. Invenio's fulltext indexing and fulltext document checking tools could be used instead.
C33 Doublon This program detects the double notices (similar tags: titles, authors, abstracts,...)   Common Lisp based, just like CernBibCheck. To be integrated into Invenio.
C34 Correct_journals.py This script detects the errors in a knowledge: 2 different good forms for a same title of journal   Small script to assure KB is okay. To be integrated to Invenio.
C35 findsimilar.py This script detects similar good forms in the knowledge of journals   Small script to assure KB is okay. To be integrated to Invenio.

Ad C14: Example of checking


Ad C29: Example knowledge file 1

Knowlegde file 'SISUC-773p---65017a.kbr' (to clean the subjects)

Phys. Lett. A---General Theoretical Physics
Phys. Lett. B---Particle Physics
Phys. Lett.---General Theoretical Physics

Ad C29: Example knowledge file 2

Knowledge file 'SISUC-693e---693a.kbr' (to clean the accelerators)

AD-1---$$aCERN PS
AD-2---$$aCERN PS
AD-3---$$aCERN PS

Ad C29: Example knowledge file 3

Knowledge file 'SISUC-773P---773p.kbr' (7492 lines, to clean the titles of journals)


Ad C30: Example autocheck 1

500__a:'not held by the cern library' and 8564_y:'fulltext'*

000723989 500__ $$aNot held by the CERN library
000723989 8564_ $$uhttp://preprints.cern.ch/cgi-bin/setlink?base=preprint&categ=CM-P&id=CM-P00047994$$yFulltext

This formula detect 2 incoherent tags

Ad C30: Example autocheck 2

8564_y:published and 960:11

This formula detects the incoherence between 'published' and bas 11. 'published' means it is an article in bas13 and not a preprint in bas11.

Ad C30: Example autocheck 3

690C_a:article and 690C_a:preprint

This formula detect the notices with 2 incoherent indicators

Ad C32: Examples xenu

http://preprints.cern.ch/cgi-bin/setlink?base=preprint&categ=hep-ex&id=9908008 must be corrected by: http://doc.cern.ch//archive/electronic/hep-ex/9908/9908008.pdf

http://documents.cern.ch/cgi-bin/setlink?base=generic&categ=public&id=cer-000052995 must be corrected by: http://doc.cern.ch//archive/electronic/other/generic/public/cer-000036351.pdf

http://preprints.cern.ch/cgi-bin/setlink?base=preprint&categ=kek-scan&id=197903169 must be corrected by : http://doc.cern.ch//archive/electronic/kek-scan//197903169.pdf

Here is the result from Xenu:

Broken links, ordered by link:
error code : 404 (not found), linked from page(s):
In this url, there is an error in the barcaode : it must be 9908008 instead of 908008

error code:404 (not found), linked from page(s):
In this url, there are 2 / instead of 1 / (scan// instead of scan/)

Tools needed

  • MARCXML editor
    • It appears clear that one of the main tools needed is a powerful marcxml editor that is general (like bibedit) but includes features like global search and replace
    • This editor can (and probably should) be invenio neutral, ie. only deals with the xml, and talks to the DB for fetching and updating (locking) and possibly also lookups for guesses etc.
    • This would ideally (imho) be an AJAX web app. The lookups would work well as AJAX calls to fetch XML, and AJAX would probably be responsive enough for power users.
    • If not web/ajax, then command line is the next best thing. Possibly the hooks in Invenio all remain the same...
  • Beyond editing
    • The power to manipulate large chunks of records, as in a perl script type thing, but without knowing perl/python. There are several people who make globabl changes in SPIRES data, though rather simplified procedures....
    • Live lookups, guessing, etc
  • Workflow - Many of the above scripts are using some sort of workflow techniques. Mostly these are ad hoc, we need a framework for workflow within the system.
    • Tag records for certain types of enrichment (current main 3 are: initial inputting, giva for authors, suspicious fields failing BibCheck, adding from holding pens, a few others)
    • Should be extensible to other workflows
    • Record and/or element level (currently have nothing at element level...)
    • Should eliminate duplicates (i.e. don't tell me to do that same thing to the same record twice...)
    • Should preserve tasks (i.e. do tell me to do different things to the same record twice...)
    • Does not need concept of individual workers, though it could help. Can just create queues, and then workers pull from queues
    • Needs to be able to cross things off (when initially pulled, then can be put back later if changes fail? currently don't think too much about this?)
  • Holding Pen - A separate, hidden collection(?) that contains records not yet checked for duplication and or not complete or correct enough to be released in the public collection. This is usually for Journal Feeds, but alos for other types of feeds (inst. repositories, theses, etc., etc.)

Guesses about Invenio exposed functions needed

These are just rough guesses at what functions I would like to be able to access from Invenio when writing an enrichment script. Most of these would define the curation interface, on top of which tools are built, but others might be built into the tools themselves.Based on the stuff I've written for SPIRES in the last few years, I think these account for most of th basic components. Of course that was obvious already, but nice to define them.

  • append_element(element,value,record_key, )
  • replace_element(element, value,record_key, ) -> might be nice if both of these could accept chunks of records
  • get_record(record_key,) -> give me the MARC_XML of the record, and optionally lock it for editing
  • update_record(xml) -> takes full xml and completely replaces any existing record with contents
  • search -> returns number or array of records (or keys) for a given search

In fact, here is the documentation for the current PERl SPIRES API from which new enrichment tools are built:



For building toolsets the following architectures should be considered:
  • Python API to Invenio
    • Ideal for builind backend to very common tasks for lookups, guesses, CRUD, etc.
  • Perl API to Invenio
    • Choice deprecated to retain consistent codebase, unless this make it much easier...
  • REST/SOAP/Web Serv. etc
    • Best Choice for end user tools to interact with Inspire (can be ajaxified etc)
    • Language neutral
    • Note that to preserve user (maintainer) comfort, having a CLI interface that almost exactly duplicates current tools would be nice, Web calls could easily be embedded in command line scripts in order to make comfortable tools, then the tools could be rewritten as web screens using the same backend web calls as part of phase 3 or phase 2.5, etc. Not all scripts would need to be handled the same, as well... Porting author and citation checking to web screens via AJAX is something we have thought about before now, and is not a small project, and it is easy to slow down the input process tremendously just by getting it a bit wrong....it eould be nice to have CLI tools as backups, or first tries.
Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2008-02-06 - TravisBrooks
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback