How to guides for INSPIRE
This is a small collection of operational HOW-TOs for INSPIRE. It is a work in process and much more guides are available in the INSPIRE Admin logbooks in the INSPIRE RT Admin queue.
General
How to load big (> 1000) record updates to INSPIRE
When the number of records to load into INSPIRE becomes towards and beyond thousands, the system may suffer major latency in regards to indexing and re-formatting of the new updates. Therefore, we have this walkthrough as a reference to the steps necessary to make the process of updating many records on INSPIRE less painful (to both ourselves and our users).
Using BibUpload -n parameter (--notimechange) for record updates
Many of the deamons in INSPIRE is based on looking at the latest modification date for all the records in order to determine if something is out-of-date and needs re-processing. This can sometimes turn bad when you have thousands of records being re-processed after a big record update - where the update was not touching the whole record, but only some parts of it.
BibUpload has this CLI parameter -n that can be used to force
BibUpload to not update the records' modification date.
$ bibupload --correct -n huge_update.xml
WARNING: Using this functionality leaves it up to the user to manually update the necessary indexes and format caches
For example:
If you update the references of a record, launch a bibindex job to index the records:
$ bibindex -w references -i id,id,id
See this page for available index names:
https://inspirehep.net/admin/bibindex/bibindexadmin.py?ln=en
And possibly also a bibreformat job to update the hdref cache:
$ bibreformat -o HDREF -N HDREF -i id,id,id
The other cached formats which need updating (if affected) are HA (very brief format?), HB (brief format), HX (
BibTeX). To avoid conflict with the regular bibsched jobs, it is best to run each format individually and use the same name as the regular job, e.g.
sudo -u apache /opt/cds-invenio/bin/bibreformat -o HX -N HX -i <recidlist>
sudo -u apache /opt/cds-invenio/bin/bibreformat -o HB -N HB -i <recidlist>
Finding the right time for big updates
When uploading huge updates it is wise to choose the appropriate time to do it, in order to avoid long queues of record updates that might frustrate curators. Since INSPIRE is a worldwide service, it can be sometimes difficult to find the appropriate window of opportunity to do big updates that influences the speed of curator update ingestion (
BibEdit). An example of a possible time is between 22.00 CET and 0400 CET (before arXiv updates).
There might be possibilities in the future to automatically restrict (or re-schedule) big upload jobs to this time-window.
Big Uploads affecting citation counts
Big uploads correcting publication notes, report numbers, ISBNs or DOIs might stop bibrank because of citation losses. In that case it is better to proactively run
$ bibrank --disable-citation-losses-check -i id,id,id
How to upload Springer PDF full-texts from AFS
Springer PDF full-texts are located in the
/afs/cern.ch/project/inspire/springer/ folder on AFS.
Inside this folder there are these folders:
already_uploaded
new_uploads
not_uploaded
The
not_uploaded folder is for files that have not yet been matched with an INSPIRE record.
All in all, this will create the following workflow:
1. New full-texts are placed in
..springer/new_uploads/ using already
defined folder and name conventions
2. Run the script on this folder in the following way:
$ cd /afs/cern.ch/project/inspire/springer/
Run the script
$ python springer_import.py /afs/cern.ch/project/inspire/springer/new_uploads > upload_marcxml.xml
This script will leave you with 3 files:
/afs/cern.ch/project/inspire/springer/new_uploads_ambig_recordsQ9oKvL.dat
/afs/cern.ch/project/inspire/springer/new_uploads_missing_recordsMY9vTd.dat
/afs/cern.ch/project/inspire/springer/upload_marcxml.xml
3. After running the script, all the full-texts not matching (listed in
missing_records* file) should be moved from
..springer/new_uploads/
to
..springer/not_uploaded/
4. With the MARCXML well formed and full-texts are uploaded to PROD, the
remaining full-text files inside
..springer/new_uploads/ can be moved to
..springer/already_uploaded/
How to fix erroneous plots
This guide tells you how to update INSPIRE attached plots containing errors (like corrupted image).
(TODO: Update this guide when original files are also attached to INSPIRE records.)
Using invenio.plotextractor:
The simplest way is to use the plotextractor:
1. You can use your locally installed Invenio to retrieve and convert the plots:
$ plotextractor -a "arXiv:1006.3575" -l "http://inspirehep.net"
2. Then do a search/replace in the resulting xml file to fix the file paths (as we need to upload these files to the public AFS space). One good location is /afs/cern.ch/project/inspire/uploads/
3. Move images and XML to /afs/cern.ch/project/inspire/uploads/
4. Launch a bibupload --correct job with the corrected XML
Using
BibDocfile:
1. Download original images from the arXiv record:
http://arxiv.org/e-print/1007.1727
2. (Prerequisite:
ImageMagick installed) Extract images into a local folder and run:
$ convert image.eps image.png
Important: Do not run this on the server nodes, but on a newer installation of Linux.
3. Move images to /afs/cern.ch/project/inspire/uploads/
4. Log into PROD and run:
$ sudo -u apache /opt/cds-invenio/bin/bibdocfile -r RECID --get-info | grep '.png:name='
RECID:DOCID:VERSION:FORMAT:KEY=VALUE
------------------------------------------------------------
860907:495246:1:.png:name=figs_Sensitivity
860907:495249:1:.png:name=figs_medsig
860907:495245:1:.png:name=figs_mu_up_dist
860907:495244:1:.png:name=figs_pseudo_exp
860907:495248:1:.png:name=figs_q01
860907:495251:1:.png:name=figs_qmu_up+1_dist
860907:495247:1:.png:name=figs_qmu_up_dist
860907:495250:1:.png:name=figs_qtevDist
5. Identify the affected images and use the docname to revise image
$ sudo -u apache /opt/cds-invenio/bin/bibdocfile --revise /afs/cern.ch/project/inspire/uploads/figs_pseudo_exp.png -r 860907 --with-docname='figs_pseudo_exp' --set-doctype=Plot
Which creates the following MARCXML:
<record>
<controlfield tag="001">860907</controlfield>
<datafield tag ="FFT" ind1=" " ind2=" ">
<subfield code="a">/afs/cern.ch/project/inspire/uploads/figs_pseudo_exp.png</subfield>
<subfield code="n">figs_pseudo_exp</subfield>
<subfield code="f">.png</subfield>
<subfield code="t">Plot</subfield>
<subfield code="d">KEEP-OLD-VALUE</subfield>
<subfield code="z">KEEP-OLD-VALUE</subfield>
<subfield code="r">KEEP-OLD-VALUE</subfield>
</datafield>
</record>
6. Put all the required MARCXML changes into one file and upload it using
BibUpload:
sudo -u apache /opt/cds-invenio/bin/bibupload -c input.xml -N FFT -S2
How to edit webdocs
Note: this is intended for use by people without shell access to the worker nodes, non-developers, etc. for corrections and content additions without requiring sysadmin support.
The automatic webdoc styling with Inspire header/footer only applies
to files with .webdoc extension below /webdoc, in particular info
material in /opt/cds-invenio/lib/webdoc/invenio/info/ . A symlink to a
standalone .html file works for the header/footer styling, but it
generates slightly incorrect html (redundant header sections, etc.),
so preferably this shortcut is not used.
To edit a file, it already has to be present, you cannot create a new
file with the webdoc edit interface.
- So the first step is to scp a stub file into place
/afs/cern.ch/project/inspire/info
or create one with a regular editor
- Log into your inspire account http://inspirehep.net/youraccount/login
- Under admin tasks at the bottom click on "Run Info Space Manager" https://inspirehep.net/info/manage?ln=en
- Navigate to the correct location, see image below.
- use the built-in editor (scroll to the bottom) to make changes
- "Save changes" or simply leave the page. There is no "Cancel" button (I had asked Javier to add that, so you just have to navigate away from the page to abort changes)
_Note: webdoc/ space does not provide directory indices, so a URL to a
info/ directory will give 404. all links to webdoc space need to point
to actual files_
The content of /info is periodically autocommitted to a git repository
(several times a day), so we have complete version history of all
files should we need to revert, etc. Using the repo requires sysadmin action, though.
Thorsten S., 2013.01.16
Operations
How to load a PROD DB dump on your local machine or server
1. In order to load a dump, the simplest way if you have invenio already installed is to load it directly using dbexec:
$ zcat dumpfile.sql.gz | sudo -u apache /opt/invenio/bin/dbexec
or in case the dumpfile is not compressed:
sudo -u apache /opt/invenio/bin/dbexec < dumpfile.sql
However, this may cause issue with duplicate keys. So an alternative method is:
$ sudo -u apache /opt/invenio/bin/dbexec -i
(for the interactive shell)
mysql> set foreign_key_checks = 0;
mysql> source dumpfile.sql
mysql> set foreign_key_checks = 1;
This will take some hours probably with the latest dump.
2. Clean up attached files:
Attached files will no longer be in sync if the dump originated from another instance. To remove all attached files and references to them:
$ rm -rf /opt/invenio/var/data/files/*
$ sudo -u apache /opt/invenio/bin/dbexec -i
mysql> TRUNCATE bibrec_bibdoc;
mysql> TRUNCATE bibdoc;
3. Clean up formats
Sometimes cached formats references the site the dump originated from, unless relative links are used. This is the case on INSPIRE production servers and as such this step can be skipped.
mysql> DELETE FROM bibfmt WHERE format LIKE 'h_';
4. Final cleanup:
Remove old sessions:
mysql> TRUNCATE session;
Then you can delete all scheduled
BibSched tasks such as dbdump, bibrank, oaiharvest, refextract etc. from
BibSched.
For good measure, it is wise to refresh page caches, and restart apache.
$ sudo -u apache /opt/invenio/bin/webcoll -u admin -f
$ sudo /etc/init.d/httpd restart
How to update BibDoc doctype (to INSPIRE-PUBLIC)
On INSPIRE we use the
BibDoc variable doctype in a special way to restrict access. Any public file needs to be of the INSPIRE-PUBLIC doctype. Sometimes a file is uploaded with the wrong or default doctype. Here is how to update an existing documents doctype to having `Main' or `' doctype to `INSPIRE-PUBLIC' doctype:
(Python script courtesy of Tibor Simko. Run under
apache
identity)
atypes = ('Main', '') # list of initial doctypes to be changed from
btype = 'INSPIRE-PUBLIC' # final doctype value to be changed into
import os
from invenio.bibdocfile import BibDoc
from invenio.dbquery import run_sql
for atype in atypes:
res = run_sql("SELECT id_bibrec,id_bibdoc FROM bibrec_bibdoc WHERE type=%s",
(atype,))
for row in res:
id_bibrec, id_bibdoc = row
abibdoc = BibDoc(recid=id_bibrec, docid=id_bibdoc)
abibdoc_type_pathname = os.path.join(abibdoc.get_base_dir(), '.type')
# update DB value:
run_sql("""UPDATE bibrec_bibdoc SET type=%s WHERE type=%s
AND id_bibrec=%s AND id_bibdoc=%s""",
(btype, atype, id_bibrec, id_bibdoc))
# update file value:
fdesc = open(abibdoc_type_pathname, 'w')
fdesc.write(btype)
fdesc.close()
# print info
print "I: updated record %s file %s" % (id_bibrec, abibdoc_type_pathname)
How to re-configure HAProxy
In order to re-configure HAProxy configuration (for example to put a node out of rotation), you can load up haproxy.cfg (
/etc/haproxy/haproxy.cfg
) on the proxy node (
PCUDSSW1503
). Then after performing the necessary changes to the haproxy.cfg file you can reload it using the following commands (as described
here
):
sudo /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid)
Btw, this actually kills the connections to haproxy.
How to unbalance a node in HAProxy
This is now done using our Fabric script from the inspire-scripts repository which connects to the HAproxy socket to put nodes in maintenance mode.
$ fab disable:prod2
$ fab enable:prod2
--
JanLavik - 28-Sep-2012