CERN Inputting

1. Introduction

This page summarizes the current inputting workflow practices at CERN. The page is not exhaustive, it shows typical examples only.

For the similar practices at SLAC, see the SlacInputting page.

2. Harvesting from arXiv

The workflow is a bit complex due to the requirement to store harvested records both in CDS and in ALEPH500.

The overall schema is presented below. For more details, please see the dedicated CDS wiki page CDS.CERNDocumentServerArxivHarvesting.


3. CERN CLIC Notes: an example of collection cleaning

This example shows how the enrichment of a CERN series such as the CERN CLIC Notes is typically being performed by the CERN library.

3.1. CLIC Notes: overview of actions and actors

This diagram shows an overview of various procedures and actors involved in the cleaning and enrichment of the CERN CLIC Notes collection. The procedures will be described in detail below.

Note that in peak times the library can have up to ten students and apprentices available.


3.2 CLIC Notes: detailed action list

Here is a detailed overview of actions being performed during the CLIC Notes collection enrichment and cleanup. The library maintains this as an MS Excel sheet. (This version is based on Cath's 2008-01-31 file.) Note that the sheet contains a status column that tracks the progress.


Sep-07 Jan-08 Status Persons
Formula to find all the collection: reportnumber:'CLIC-Note-*' 688 817 ? Cath
Formula to find the missing URLs: reportnumber:'CLIC-Note-*' not 8564_u:'0->zz' 58 0 done Cath

Actions for CLIC-Notes

Action ID Action description Amount Status Persons
1/ To do a list of reports numbers of all the collection & to run the Kyriakos'script to detect missing reports numbers done Cath
2/ To find the missing paper copies in the library to catalogue &/or to scan 55 done Cath
3/ To find the missing paper copies in the archive to catalogue &/or to scan 3 done Cath
4/ To find the missing paper copies in the concerning secretariat (Sonia Escaffre, 7 doc) to catalog &/or to scan 7 done Cath & Sonia
5/ To add the barcodes & the URLs before scanning 65 done Cath
6/ To scan (or to rescan : 1) & verification scanning 65 done Cath & Scanning Service
7/ To catalogue in base 29 (blind base) the never existing documents (to maintain a complete list by reports numbers) 2 done Cath
8/ To add 595__a:SIS ARCCLICNOTE-2007 all done Cath
9/ To add 084$$a:CLIC-Note-XXX$$2CERN Library --> to have a list with a good order all done Cath
10/ To do a comparison between the number of documents on CDS & ALEPH, to verify and to resolve if problems of synchronization done Cath
11/ To extract a file of all the URLs, to check with Xenu, to verify the URLs to do Cath
(if bad URLs, but difficulties with set links so file to reformate & xenu doesn't accept gz files)
12/ To run the specific configuration to find the references of publication from SLAC
Formula : reportnumber:'CLIC-Note-*' and 960:11 not 595__a:'sis:2008*'
If references of publication found, to add 773 or LKR & conferences to catalogue 3 done Cath
& to correct in BAS13, with 690C_a:ARTICLE
& to add: 595__a:SIS:2008XX PR/LKR added (from SLAC) XX is the month
13/ To search the references of publication from INSPEC, KEK, Google on going Vanessa
Formula : reportnumber:'CLIC-Note-*' and 960:11 not 595__a:'sis:2008*' & Sara
a- If references of publication found, to add 773 or LKR & conferences to catalog 401? on going Vanessa
& to correct in BAS13, with 690C_a:ARTICLE & Sara
& to add: 595__a:SIS:2008XX PR/LKR added (from INSPEC) or KEK, Google, or from the journal of the proceedings ?????? (XX is the month)
b- If no references of publication found, to add : ? on going Vanessa
595__a:SIS:2008XX PR/LKR not found (from INSPEC, KEK or Google or the journal of the proceedings) ?????? & Sara
14/ To create a file of documents without CERN affiliation and to check on the full text if CERN affiliation, & to add it if present on the full text Cath
1rst formula: reportnumber:'CLIC-Note-*' not affiliation:' CERN*' and 960:13 and 773__p:'phys. rev.*' null
2nd formula: reportnumber:'CLIC-Note-*' not affiliation:'cern*' and 960:13 and 773__p:'rev. mod. phys.*' null
3rd formula: reportnumber:'CLIC-Note-*' not affiliation:'cern*' and 960:13 and 773__p:'ieee*' null
4th formula: reportnumber:'CLIC-Note-*' not affiliation:'cern*' and 960:13 and 773__p:'jhep*' null
5th formula: reportnumber:'CLIC-Note-*' not affiliation:'cern*' and 960:13 and 773__p:'jinst*' null
15/ To create a file of documents without CERN affiliation and to check on the full text if CERN affiliation, & to add it if present on the full text 123 Summer
Formula for all the collection, only if time: reportnumber:'CLIC-Note-*' not affiliation:CERN student
16/ After that, to turn a program to import in ALEPH the accepted full texts from the editors Joce
17/ To add: 088$$9 barcode (from 8564$$u) to refind the file if lost on aleph to do Cath
18/ To run the script "clean title" to format the titles of journals with the KB of journals to do Cath
19/ To open manually all scanned documents (with barcodes SCAN, CM-P, P?), if forbidden editor's URLs exist to do Cath
20/ To create a program to detect forbidden URLs (to check Elsevier, Springer, Science direct, copy right) for scanned documents to do Rado
21/ To open manually all non scanned documents, if forbidden editor's URLs exist done Students
22/ To create a program to detect forbidden URLs (to check Elsevier, Springer, Science direct, copy right) for non scanned documents to do Tibor
23/ To add 595$$a:giva a faire
& to turn the program GIVA to add authors from a collaboration (if missing) 1 done Cath
Formula : reportnumber:'clic-note-*' and 710__g:'* Collaboration' not 700__a:'0->zz'
24/ To clean experiments, accelerators, collaborations (new knowledges to create) to do Cath
25/ To turn chakau, chkall, autocheck to do Cath
26/ To move all the paper collection form the library to the archive to do Cath
27/ To correct all the holdings to do Vanessa
28/ To do Excel file for statistics to do Cath

Out of these, the "core" actions are (i) the author and affiliation completion with Giva (23), (ii) the publication reference completion (12-13), (iii) the fulltext file completion (14-16). Every core action is specified with a dedicated workflow diagram below.

Note that some actions use programs and scripts referred to under ComparisonSlacFermilabDesyCernEnrichmentScripts page. The overview of used tools:

Action ID Uses script
1/ reportcheck (C9)
11/ Xenu (C32)
12/ uploader (C4), config SPIRESBIBTEX
14/ manually, or Giva
15/ manually, or Giva
16/ uploader (C4)
18/ cleantitle (C31)
23/ Giva
25/ CernBibCheck, autocheck (C30)

3.3. CLIC Notes: adding publication references

This diagram shows how the publication references are being completed during the CLIC Notes collection enrichment. (Actions 12/-13/ from the above list.)

Note that an "internal note" tag (595) is used to track success/failure of the enrichment process.


3.4. CLIC Notes: adding fulltext files

This diagram shows how the fulltext files are being collected during the CLIC Notes collection enrichment. (Actions 14/-16/ from the above list.)

Note that the collection is not considered completely cleaned unless all articles have a fulltext file: either an electronic file is found and submitted or a paper copy is scanned. Hunting for fulltext is done in three ways: Google etc, CERN physical archives and rescan, or author asked by email. The choice is ad hoc and depends on the most probable successful way (e.g. year of publication).

The email communication process with authors is currently not assisted by any tool. Note that for some collections such as CERN Theses the communication with authors represented the amount of ~500 emails. A tracking tool would be profitable here.


3.5. CLIC Notes: maintenance, the day after

Once the collection of CLIC Notes is enriched and cleaned, the further maintenance of this collection is performed by the autocheck tool C30 (see SystemDesignBibExport) that alerts cataloguers automatically in case bad things are detected.

4. Importing JHEP publication references and OA fulltext files

For SISSA/IOP, all JHEP and JINST articles are taken with the uploader tool (C4, aka SystemDesignBibConvert). The fulltext file is atached for all OA articles, the link is "SISSA/IOP Open Access article".

This workflow is representative for all importations using the uploader tool (C4).


5. Importing from publishers: Elsevier, APS, IEEE

In the past the uploader tool (C4) was often used to import publication references (e.g. from arXiv). Now the process usually starts at the publishers' web sites and is highly manual (screen-scraping).

For publishers Elsevier, APS, IEEE, we make the search on publishers sites for "affiliation CERN" and current year in all the journals from these publishers and we input new articles and update what we already have in CDS with the publication references. After that we attach the file of the publication for CERN affiliated articles published by APS and IEEE (authorisation for CERN papers), the link is "APS Published version, local copy" or "IEEE Published version, local copy". Other metadata are updated at the same time, e.g. no affiliation in APS, so we need to add it.

We harvest the publishers 4 times in a year, of course we have to eliminate manually what we have done before, but we always make the same search and the same sorting ("most recent first") so not too difficult. We harvest and input manually (more reliable than with an uploader matching because of LaTeX and formulas problems).

Once per year, usually in January, we check all the articles for the previous year, to be sure nothing was missed.

6. Importing conferences via uploader: JaCoW, AIP, etc

Many conference proceedings are imported by using the same uploader tool (C4) as for the JHEP publication references and fulltext files described above, so it is not necessary to present the workflow here.

For example, conferences published by JACoW are imported via a dedicated C4 configuration to treat the "Open Archive Format", and all EPAC, PAC, DIPAC, FEL, ICALEPCS..... contributions are taken.

Another example, for electronic conference proceedings published by AIP, the Library Physicist makes a selection on subject, and then the uploader tool is again used to import them.

7. Importing conferences via hard copy: BEAUTY 2006, etc

For conference proceedings that we receive in the hard copy paper form, such as BEAUTY 2006, a manual process in needed. We look for contributions with affiliation CERN and input new CERN articles, and we update all contributions that we already have (CERN or not CERN) attached with this conference.

8. Inputting authors and affiliations with Giva

For collaboration papers written by many authors, the following procedure is being used to extract the author names together with their affiliations:


