Converting Insts to Inspire

Introduction

SPIRES has a database, institutions, that contains the allowed entities that can be used as affiliations on author lists. This db also contains information about each inst., included address, director, etc, etc. At the very least the authority aspect of this db must be ported over so that input can work correctly. At best, the entire DB can be moved and maintained in inspire (but it is my contention that this should come later...) This db also contains that lookups used commonly by DESy and also by authors themselves to refer to insts, so it serves as a knowledge base for the process of determining affiliations, and a nicely maintainable knowledge base at that.

Purpose

Some possible uses of inst:

  • authority for affiliations in hep
  • affiliation history on author page
    • finer granularity gives more meaningful profile
  • authoritative list of HEP institutes
    • cern list not well maintained, probably to be replaced by Inspire inst
    • granularity on institutional level at least for HEP inst
  • mailing lists
    • recent example: LHC bible
    • conference poster etc
    • institutional level advisable since mail to general university address might get dumped
  • publication list / annual report for core HEP institutions
  • institutional metrics
  • SCOAP3 accounting

Requirements:

  • granularity
    • departmental level at least for HEP institutes, preferably as well for closely related disciplines (astrophysics, nuclear physics...) to facilitate exchange
  • core tag
  • field code
  • hierarchical structure
    • country -> univ -> department
  • history
    • predecessor, successor (e.g. for mergers)

-- AnnetteHoltkamp - 05-Dec-2010

First Steps

In order to faciliatate the creation of author. list inputting tools, Marko and I are working on porting the db to a xmlMARC format like that used here

http://cdsweb.cern.ch/collection/HEP%20Institutes

I have created a SPIRES format (xmlinspire) for insts that spits out basic XML. I will probably need to modify this to unravel complicated lookups like country name (we use a indirected country code which then looksup a name, so that names of countries can change easily (political reality...) However for the moment I will simply take this xml output at face value, more should eventually be done to get these subtleties... See field mapping below for blow-by blow account fo the mapping

Dont forget to strip a coupld of latin1 encoded chars...:

iconv -fLatin1 -tUTF8 as in the HEP case (using a linux box..)

XSL

Created XSL to do the mapping and created /inspire/inst directory in cvs to contain this work. That has a make file similar to the one in bibconvert and the xml file of the inst. dump is in a similar ftp location. Travis tested the conversion, though he has not yet gotten the upload to work and create a new collection...

Field Mapping (Travis' suggestion)

There are far less elements here than in HEP, so this is relatively straightforward, just using a few examples from the cds site, I've guessed the following mapping:

SPIRES Name MARC field notes
INST 970b Our key, used in HEP records. Could go in 001 possibly. Made up the "b" subfield
IC --- Unknown use at the moment.
IMC 270z Mailing /postal code Chose "z" subfield at random for "zip". LOC: 270e
country.code 270c Our code for the country name, I can replace with name from lookup, but better to do this indirectly...
inst.catch.name 110a Our std name for the inst
address 270a our free form address, includes free form name (multiply occuring)
State.Code 270s our state /province code. LOC: 270e
Report.code ---- prefix on report numbers????
DEPartment --- would be nice in 110b, but we didn't really use this...
city 270b city
desy.aff 595a DESY name for this aff 9=DESY
type 980a we tag various insts of note... just guessing where this goes...
director 270p I think you have contact, we have director....
director-note 270n
director-date 270d
xtra-indexi 595b These are xtra words that should help find the inst, but often aren't in the record. I.e. common searches
desylookup 595a Ditto above, these are other ways of writing the name, useful for lookups
oaff 595a ditto above, not sure what the difference is here
phone.number 270l phone. LOC: 270k. 270l for fax. still needed?
email.contact 270m email (of contact/or director...)
date-updated 961c date we last touched it...
date-added 961x added date
note1 500a random extra information, human readable...
district ??? lat long corrdinates used to create maps
url 856u url for entity (usually inst, not dept)
time-zone-info ??? Time zone relative to UTC
street-address 270a street address

Field Mapping (final)

based on CDS usage and LoC marc doc

marc field subfield content spires field Note
110 a Institution part of address corporate name in native language, well-known acronym in brackets behind
b Department part of address subordinate unit in native language, well-known acronym in brackets behind
t newICN   HEP affil following new standards
u inst.catch.name ICN HEP affiliation (spires name)
x   obsolete ICN ICN of obsolete inst for which this inst should be used instead
371 a address part of address street etc, city with postal code + additions (native language)
b city part of address in English
c state or province part of address  
d country country.name in English
e postal code part of address bare form
g country code
x     "secondary" actually was used only once! for Castel Gandolfo http://inspirehep.net/record/905456
034 d longitude district  
f latitude district  
2 source e.g. bibcheck geocode including version
q type match type/quality
035 9 external identifier schema "HAL" or "GRID"
a external identifier e.g. "grid.12345.1"
043 t time zone    
372 a field of activity   University, Research center, Company
410 a name variants DLU, DESYAFF + standard acronyms
9 source of name variant   desy...
410 g xtra words xtra-index to help find insts
510 a name of related inst ICN 110__u  
w type of relation   a predecessor, b successor, t parent inst, r otherwise related
0 record nr of related inst    
i specify relation if $$w"r"
595 a hidden note    
65017 a content classification   same as HEP categories, FC
667 a nonpublic note note1  
6781 a historical data   administrative history
680 i public note    
8564 u url   inst website
961 c   date-updated  
961 x   date-added  
980 a tags type various tags (which are still useful?)
980 a CORE PPF/NON-PPF should move to 690C, don't use NONCORE (default)
b DEAD  

Questions

  • Do we want to maintain director information?
  • Do we need report.code (report nr prefix)?
  • Do we still need phone/fax? needs maintenance
  • Where to store historical ICN's? Also in 595 with $9 Inspire?

Elements Not mapped

Several elements above were not mapped because I wasn't sure what to do with them, but they seemed important (except where noted). Below are ones that might be useful, but probably aren't...

 Opt  Sing       String  00/09       REMOVAL.DATE, RD
 Opt  Sing       Hex     00/0A       LIST.CODE, LC, NOT-USED
 Opt  Sing       String  00/0C       INSTCODE, ICODE
 Opt  Mult       String  00/12       ACCELERATOR, ACC
 Opt  Mult       Hex     00/13       DESY-DATEUPD, ACC-NOTE, AN, DDATE
 Opt  Mult       String  00/14       DESY-ACCTUPD, ACC-DATE, AD, DACCT|
 Opt  Mult       String  00/18       PO.BOX, PHONE-NOTE, PNOTE
 Opt  Mult       String  00/1A       FTS-NUMBER, FTSN
 Opt  Mult       String  00/1B       TELEX, TX
 Opt  Mult       String  00/1C       CABLE, CA, EXP, EXPERIMENT.CODE
 Opt  Mult       Struc   00/1D       COMPUTER-STR
 Req  Sing       String  01/00  key  . COMPUTER-NETWORK, NET, NETWORK
 Opt  Mult       Struc   01/01       . NODE-STR
 Req  Sing       String  02/00  key  . . NODE-ADDRESS, NODE, NODE-ID
 Opt  Mult       String  02/01       . . COMPUTER, COMP
 Opt  Mult       String  02/02       . . DEPARTMENT, DEPT
 Opt  Mult       String  02/03       . . MAIL-CONTACT, MCON
 Opt  Mult       String  00/1F       TIME-ZONE-INFO, TZ
 Opt  Mult       Struc   00/20       TELECOPIER-STR, TELESTR
 Req  Sing       String  03/00  key  . TELECOPIER-NUM, FAX, TCOP
 Opt  Sing       String  03/01       . TELECOPIER-NOTE, TNOT
 Opt  Mult       Hex     00/22       BITNET.NODE, BN, TIME-UPDATED, TU
 Opt  Mult       String  00/24       ACCTADD, AA
 Opt  Mult       String  00/25       ACCTUPD, AUP
 Opt  Mult       String  00/26       DAFF.UPLOW, DAFFU
 Opt  Mult       String  00/27       CERN.AFF, CAFF

-- TravisBrooks - 24 Nov 2007

Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r22 - 2018-11-22 - KirstenSachs
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback