CERN Inputting
1. Introduction
This page summarizes the current inputting workflow practices at CERN.
The page is not exhaustive, it shows typical examples only.
For the similar practices at SLAC, see the
SlacInputting page.
2. Harvesting from arXiv
The workflow is a bit complex due to the requirement to store
harvested records both in CDS and in ALEPH500.
The overall schema is presented below. For more details, please see
the dedicated CDS wiki page
CDS.CERNDocumentServerArxivHarvesting.
3. CERN CLIC Notes: an example of collection cleaning
This example shows how the enrichment of a CERN series such as the
CERN CLIC Notes is typically being performed by the CERN library.
3.1. CLIC Notes: overview of actions and actors
This diagram shows an overview of various procedures and actors
involved in the cleaning and enrichment of the CERN CLIC Notes
collection. The procedures will be described in detail below.
Note that in peak times the library can have up to ten students and
apprentices available.
3.2 CLIC Notes: detailed action list
Here is a detailed overview of actions being performed during the CLIC
Notes collection enrichment and cleanup. The library maintains this
as an MS Excel sheet. (This version is based on Cath's 2008-01-31
file.) Note that the sheet contains a status column that tracks the
progress.
CERN CLIC Notes
|
Sep-07 |
Jan-08 |
Status |
Persons |
Formula to find all the collection: reportnumber:'CLIC-Note-*' |
688 |
817 |
? |
Cath |
Formula to find the missing URLs: reportnumber:'CLIC-Note-*' not 8564_u:'0->zz' |
58 |
0 |
done |
Cath |
Actions for CLIC-Notes
Action ID |
Action description |
Amount |
Status |
Persons |
1/ |
To do a list of reports numbers of all the collection & to run the Kyriakos'script to detect missing reports numbers |
done |
Cath |
2/ |
To find the missing paper copies in the library to catalogue &/or to scan |
55 |
done |
Cath |
3/ |
To find the missing paper copies in the archive to catalogue &/or to scan |
3 |
done |
Cath |
4/ |
To find the missing paper copies in the concerning secretariat (Sonia Escaffre, 7 doc) to catalog &/or to scan |
7 |
done |
Cath & Sonia |
5/ |
To add the barcodes & the URLs before scanning |
65 |
done |
Cath |
6/ |
To scan (or to rescan : 1) & verification scanning |
65 |
done |
Cath & Scanning Service |
7/ |
To catalogue in base 29 (blind base) the never existing documents (to maintain a complete list by reports numbers) |
2 |
done |
Cath |
8/ |
To add 595__a:SIS ARCCLICNOTE-2007 |
all |
done |
Cath |
9/ |
To add 084$$a:CLIC-Note-XXX$$2CERN Library --> to have a list with a good order |
all |
done |
Cath |
10/ |
To do a comparison between the number of documents on CDS & ALEPH, to verify and to resolve if problems of synchronization |
done |
Cath |
11/ |
To extract a file of all the URLs, to check with Xenu, to verify the URLs |
to do |
Cath |
|
(if bad URLs, but difficulties with set links so file to reformate & xenu doesn't accept gz files) |
12/ |
To run the specific configuration to find the references of publication from SLAC |
|
Formula : reportnumber:'CLIC-Note-*' and 960:11 not 595__a:'sis:2008*' |
|
If references of publication found, to add 773 or LKR & conferences to catalogue |
3 |
done |
Cath |
|
& to correct in BAS13, with 690C_a:ARTICLE |
|
& to add: 595__a:SIS:2008XX PR/LKR added (from SLAC) XX is the month |
13/ |
To search the references of publication from INSPEC, KEK, Google |
on going |
Vanessa |
|
Formula : reportnumber:'CLIC-Note-*' and 960:11 not 595__a:'sis:2008*' |
& Sara |
a- |
If references of publication found, to add 773 or LKR & conferences to catalog |
401? |
on going |
Vanessa |
|
& to correct in BAS13, with 690C_a:ARTICLE |
& Sara |
|
& to add: 595__a:SIS:2008XX PR/LKR added (from INSPEC) or KEK, Google, or from the journal of the proceedings ?????? (XX is the month) |
b- |
If no references of publication found, to add : |
? |
on going |
Vanessa |
|
595__a:SIS:2008XX PR/LKR not found (from INSPEC, KEK or Google or the journal of the proceedings) ?????? |
& Sara |
14/ |
To create a file of documents without CERN affiliation and to check on the full text if CERN affiliation, & to add it if present on the full text |
Cath |
|
1rst formula: reportnumber:'CLIC-Note-*' not affiliation:' CERN*' and 960:13 and 773__p:'phys. rev.*' |
null |
|
2nd formula: reportnumber:'CLIC-Note-*' not affiliation:'cern*' and 960:13 and 773__p:'rev. mod. phys.*' |
null |
|
3rd formula: reportnumber:'CLIC-Note-*' not affiliation:'cern*' and 960:13 and 773__p:'ieee*' |
null |
|
4th formula: reportnumber:'CLIC-Note-*' not affiliation:'cern*' and 960:13 and 773__p:'jhep*' |
null |
|
5th formula: reportnumber:'CLIC-Note-*' not affiliation:'cern*' and 960:13 and 773__p:'jinst*' |
null |
15/ |
To create a file of documents without CERN affiliation and to check on the full text if CERN affiliation, & to add it if present on the full text |
123 |
Summer |
|
Formula for all the collection, only if time: reportnumber:'CLIC-Note-*' not affiliation:CERN |
student |
16/ |
After that, to turn a program to import in ALEPH the accepted full texts from the editors |
Joce |
17/ |
To add: 088$$9 barcode (from 8564$$u) to refind the file if lost on aleph |
to do |
Cath |
18/ |
To run the script "clean title" to format the titles of journals with the KB of journals |
to do |
Cath |
19/ |
To open manually all scanned documents (with barcodes SCAN, CM-P, P?), if forbidden editor's URLs exist |
to do |
Cath |
20/ |
To create a program to detect forbidden URLs (to check Elsevier, Springer, Science direct, copy right) for scanned documents |
to do |
Rado |
21/ |
To open manually all non scanned documents, if forbidden editor's URLs exist |
done |
Students |
22/ |
To create a program to detect forbidden URLs (to check Elsevier, Springer, Science direct, copy right) for non scanned documents |
to do |
Tibor |
23/ |
To add 595$$a:giva a faire |
|
& to turn the program GIVA to add authors from a collaboration (if missing) |
1 |
done |
Cath |
|
Formula : reportnumber:'clic-note-*' and 710__g:'* Collaboration' not 700__a:'0->zz' |
24/ |
To clean experiments, accelerators, collaborations (new knowledges to create) |
to do |
Cath |
25/ |
To turn chakau, chkall, autocheck |
to do |
Cath |
26/ |
To move all the paper collection form the library to the archive |
to do |
Cath |
27/ |
To correct all the holdings |
to do |
Vanessa |
28/ |
To do Excel file for statistics |
to do |
Cath |
Out of these, the "core" actions are (i) the author and affiliation
completion with Giva (23), (ii) the publication reference completion
(12-13), (iii) the fulltext file completion (14-16). Every core
action is specified with a dedicated workflow diagram below.
Note that some actions use programs and scripts referred to under
ComparisonSlacFermilabDesyCernEnrichmentScripts page. The overview of
used tools:
Action ID |
Uses script |
1/ |
reportcheck (C9) |
11/ |
Xenu (C32) |
12/ |
uploader (C4), config SPIRESBIBTEX |
14/ |
manually, or Giva |
15/ |
manually, or Giva |
16/ |
uploader (C4) |
18/ |
cleantitle (C31) |
23/ |
Giva |
25/ |
CernBibCheck, autocheck (C30) |
3.3. CLIC Notes: adding publication references
This diagram shows how the publication references are being completed
during the CLIC Notes collection enrichment. (Actions 12/-13/ from
the above list.)
Note that an "internal note" tag (595) is used to track
success/failure of the enrichment process.
3.4. CLIC Notes: adding fulltext files
This diagram shows how the fulltext files are being collected during
the CLIC Notes collection enrichment. (Actions 14/-16/ from the above
list.)
Note that the collection is not considered completely cleaned unless
all articles have a fulltext file: either an electronic file is found
and submitted or a paper copy is scanned. Hunting for fulltext is
done in three ways: Google etc, CERN physical archives and rescan, or
author asked by email. The choice is ad hoc and depends on the most
probable successful way (e.g. year of publication).
The email communication process with authors is currently not assisted
by any tool. Note that for some collections such as CERN Theses the
communication with authors represented the amount of ~500 emails. A
tracking tool would be profitable here.
3.5. CLIC Notes: maintenance, the day after
Once the collection of CLIC Notes is enriched and cleaned, the further
maintenance of this collection is performed by the autocheck tool C30
(see
SystemDesignBibExport) that alerts cataloguers automatically in
case bad things are detected.
4. Importing JHEP publication references and OA fulltext files
For SISSA/IOP, all JHEP and JINST articles are taken with the uploader
tool (C4, aka
SystemDesignBibConvert). The fulltext file is atached
for all OA articles, the link is "SISSA/IOP Open Access article".
This workflow is representative for all importations using the
uploader tool (C4).
5. Importing from publishers: Elsevier, APS, IEEE
In the past the uploader tool (C4) was often used to import
publication references (e.g. from arXiv). Now the process usually
starts at the publishers' web sites and is highly manual
(screen-scraping).
For publishers Elsevier, APS, IEEE, we make the search on publishers
sites for "affiliation CERN" and current year in all the journals from
these publishers and we input new articles and update what we already
have in CDS with the publication references. After that we attach the
file of the publication for CERN affiliated articles published by APS
and IEEE (authorisation for CERN papers), the link is "APS Published
version, local copy" or "IEEE Published version, local copy". Other
metadata are updated at the same time, e.g. no affiliation in APS, so
we need to add it.
We harvest the publishers 4 times in a year, of course we have to
eliminate manually what we have done before, but we always make the
same search and the same sorting ("most recent first") so not too
difficult. We harvest and input manually (more reliable than with an
uploader matching because of LaTeX and formulas problems).
Once per year, usually in January, we check all the articles for the
previous year, to be sure nothing was missed.
6. Importing conferences via uploader: JaCoW, AIP, etc
Many conference proceedings are imported by using the same uploader
tool (C4) as for the JHEP publication references and fulltext files
described above, so it is not necessary to present the workflow here.
For example, conferences published by JACoW are imported via a
dedicated C4 configuration to treat the "Open Archive Format", and all
EPAC, PAC, DIPAC, FEL, ICALEPCS..... contributions are taken.
Another example, for electronic conference proceedings published by
AIP, the Library Physicist makes a selection on subject, and then the
uploader tool is again used to import them.
7. Importing conferences via hard copy: BEAUTY 2006, etc
For conference proceedings that we receive in the hard copy paper
form, such as
BEAUTY 2006
, a
manual process in needed. We look for contributions with affiliation
CERN and input new CERN articles, and we update all contributions that
we already have (CERN or not CERN) attached with this conference.
8. Inputting authors and affiliations with Giva
For collaboration papers written by many authors, the following
procedure is being used to extract the author names together with
their affiliations: