How to guides for INSPIRE

This is a small collection of operational HOW-TOs for INSPIRE. It is a work in process and much more guides are available in the INSPIRE Admin logbooks in the INSPIRE RT Admin queue.

General

How to load big (> 1000) record updates to INSPIRE

When the number of records to load into INSPIRE becomes towards and beyond thousands, the system may suffer major latency in regards to indexing and re-formatting of the new updates. Therefore, we have this walkthrough as a reference to the steps necessary to make the process of updating many records on INSPIRE less painful (to both ourselves and our users).

Using BibUpload -n parameter (--notimechange) for record updates

Many of the deamons in INSPIRE is based on looking at the latest modification date for all the records in order to determine if something is out-of-date and needs re-processing. This can sometimes turn bad when you have thousands of records being re-processed after a big record update - where the update was not touching the whole record, but only some parts of it.

BibUpload has this CLI parameter -n that can be used to force BibUpload to not update the records' modification date.

$ bibupload --correct -n huge_update.xml

WARNING: Using this functionality leaves it up to the user to manually update the necessary indexes and format caches

For example:

If you update the references of a record, launch a bibindex job to index the records: $ bibindex -w references -i id,id,id

See this page for available index names: https://inspirehep.net/admin/bibindex/bibindexadmin.py?ln=en

And possibly also a bibreformat job to update the hdref cache: $ bibreformat -o HDREF -N HDREF -i id,id,id

The other cached formats which need updating (if affected) are HA (very brief format?), HB (brief format), HX (BibTeX). To avoid conflict with the regular bibsched jobs, it is best to run each format individually and use the same name as the regular job, e.g.

 sudo -u apache /opt/cds-invenio/bin/bibreformat -o HX -N HX -i <recidlist>
 sudo -u apache /opt/cds-invenio/bin/bibreformat -o HB -N HB -i <recidlist>

Finding the right time for big updates

When uploading huge updates it is wise to choose the appropriate time to do it, in order to avoid long queues of record updates that might frustrate curators. Since INSPIRE is a worldwide service, it can be sometimes difficult to find the appropriate window of opportunity to do big updates that influences the speed of curator update ingestion (BibEdit). An example of a possible time is between 22.00 CET and 0400 CET (before arXiv updates).

There might be possibilities in the future to automatically restrict (or re-schedule) big upload jobs to this time-window.

Big Uploads affecting citation counts

Big uploads correcting publication notes, report numbers, ISBNs or DOIs might stop bibrank because of citation losses. In that case it is better to proactively run $ bibrank --disable-citation-losses-check -i id,id,id

How to upload Springer PDF full-texts from AFS

Springer PDF full-texts are located in the /afs/cern.ch/project/inspire/springer/ folder on AFS.

Inside this folder there are these folders:

already_uploaded
new_uploads
not_uploaded

The not_uploaded folder is for files that have not yet been matched with an INSPIRE record.

All in all, this will create the following workflow:

1. New full-texts are placed in ..springer/new_uploads/ using already defined folder and name conventions

2. Run the script on this folder in the following way:

$ cd /afs/cern.ch/project/inspire/springer/

Run the script

$ python springer_import.py  /afs/cern.ch/project/inspire/springer/new_uploads > upload_marcxml.xml

This script will leave you with 3 files:

/afs/cern.ch/project/inspire/springer/new_uploads_ambig_recordsQ9oKvL.dat
/afs/cern.ch/project/inspire/springer/new_uploads_missing_recordsMY9vTd.dat
/afs/cern.ch/project/inspire/springer/upload_marcxml.xml

3. After running the script, all the full-texts not matching (listed in missing_records* file) should be moved from ..springer/new_uploads/ to ..springer/not_uploaded/

4. With the MARCXML well formed and full-texts are uploaded to PROD, the remaining full-text files inside ..springer/new_uploads/ can be moved to ..springer/already_uploaded/

How to fix erroneous plots

This guide tells you how to update INSPIRE attached plots containing errors (like corrupted image). (TODO: Update this guide when original files are also attached to INSPIRE records.)

Using invenio.plotextractor:

The simplest way is to use the plotextractor:

1. You can use your locally installed Invenio to retrieve and convert the plots:

$ plotextractor -a "arXiv:1006.3575" -l "http://inspirehep.net"

2. Then do a search/replace in the resulting xml file to fix the file paths (as we need to upload these files to the public AFS space). One good location is /afs/cern.ch/project/inspire/uploads/

3. Move images and XML to /afs/cern.ch/project/inspire/uploads/

4. Launch a bibupload --correct job with the corrected XML

Using BibDocfile:

1. Download original images from the arXiv record: http://arxiv.org/e-print/1007.1727

2. (Prerequisite: ImageMagick installed) Extract images into a local folder and run:

$ convert image.eps image.png

Important: Do not run this on the server nodes, but on a newer installation of Linux.

3. Move images to /afs/cern.ch/project/inspire/uploads/

4. Log into PROD and run:

$ sudo -u apache /opt/cds-invenio/bin/bibdocfile -r RECID --get-info | grep '.png:name='

RECID:DOCID:VERSION:FORMAT:KEY=VALUE
------------------------------------------------------------
860907:495246:1:.png:name=figs_Sensitivity
860907:495249:1:.png:name=figs_medsig
860907:495245:1:.png:name=figs_mu_up_dist
860907:495244:1:.png:name=figs_pseudo_exp
860907:495248:1:.png:name=figs_q01
860907:495251:1:.png:name=figs_qmu_up+1_dist
860907:495247:1:.png:name=figs_qmu_up_dist
860907:495250:1:.png:name=figs_qtevDist

5. Identify the affected images and use the docname to revise image

$ sudo -u apache /opt/cds-invenio/bin/bibdocfile --revise /afs/cern.ch/project/inspire/uploads/figs_pseudo_exp.png -r 860907 --with-docname='figs_pseudo_exp' --set-doctype=Plot

Which creates the following MARCXML:

<record>
        <controlfield tag="001">860907</controlfield>
        <datafield tag ="FFT" ind1=" " ind2=" ">
                <subfield code="a">/afs/cern.ch/project/inspire/uploads/figs_pseudo_exp.png</subfield>
                <subfield code="n">figs_pseudo_exp</subfield>
                <subfield code="f">.png</subfield>
                <subfield code="t">Plot</subfield>
                <subfield code="d">KEEP-OLD-VALUE</subfield>
                <subfield code="z">KEEP-OLD-VALUE</subfield>
                <subfield code="r">KEEP-OLD-VALUE</subfield>
        </datafield>
</record>

6. Put all the required MARCXML changes into one file and upload it using BibUpload:

sudo -u apache /opt/cds-invenio/bin/bibupload -c input.xml -N FFT -S2

How to edit webdocs

Note: this is intended for use by people without shell access to the worker nodes, non-developers, etc. for corrections and content additions without requiring sysadmin support.

The automatic webdoc styling with Inspire header/footer only applies to files with .webdoc extension below /webdoc, in particular info material in /opt/cds-invenio/lib/webdoc/invenio/info/ . A symlink to a standalone .html file works for the header/footer styling, but it generates slightly incorrect html (redundant header sections, etc.), so preferably this shortcut is not used.

To edit a file, it already has to be present, you cannot create a new file with the webdoc edit interface.

  1. So the first step is to scp a stub file into place /afs/cern.ch/project/inspire/info or create one with a regular editor
  2. Log into your inspire account http://inspirehep.net/youraccount/login
  3. Under admin tasks at the bottom click on "Run Info Space Manager" https://inspirehep.net/info/manage?ln=en
  4. Navigate to the correct location, see image below.
  5. use the built-in editor (scroll to the bottom) to make changes
  6. "Save changes" or simply leave the page. There is no "Cancel" button (I had asked Javier to add that, so you just have to navigate away from the page to abort changes)

info-space.png

_Note: webdoc/ space does not provide directory indices, so a URL to a info/ directory will give 404. all links to webdoc space need to point to actual files_

The content of /info is periodically autocommitted to a git repository (several times a day), so we have complete version history of all files should we need to revert, etc. Using the repo requires sysadmin action, though.

Thorsten S., 2013.01.16

Operations

How to load a PROD DB dump on your local machine or server

1. In order to load a dump, the simplest way if you have invenio already installed is to load it directly using dbexec:

$ zcat dumpfile.sql.gz | sudo -u apache /opt/invenio/bin/dbexec

or in case the dumpfile is not compressed:

sudo -u apache /opt/invenio/bin/dbexec < dumpfile.sql

However, this may cause issue with duplicate keys. So an alternative method is:

$ sudo -u apache /opt/invenio/bin/dbexec -i

(for the interactive shell)

mysql> set foreign_key_checks = 0;
mysql> source dumpfile.sql
mysql> set foreign_key_checks = 1;

This will take some hours probably with the latest dump.

2. Clean up attached files:

Attached files will no longer be in sync if the dump originated from another instance. To remove all attached files and references to them:

$ rm -rf /opt/invenio/var/data/files/*

$ sudo -u apache /opt/invenio/bin/dbexec -i

mysql> TRUNCATE bibrec_bibdoc;
mysql> TRUNCATE bibdoc;

3. Clean up formats

Sometimes cached formats references the site the dump originated from, unless relative links are used. This is the case on INSPIRE production servers and as such this step can be skipped.

mysql> DELETE FROM bibfmt WHERE format LIKE 'h_';

4. Final cleanup:

Remove old sessions:

mysql> TRUNCATE session;

Then you can delete all scheduled BibSched tasks such as dbdump, bibrank, oaiharvest, refextract etc. from BibSched.

For good measure, it is wise to refresh page caches, and restart apache.

$ sudo -u apache /opt/invenio/bin/webcoll -u admin -f
$ sudo /etc/init.d/httpd restart

How to update BibDoc doctype (to INSPIRE-PUBLIC)

On INSPIRE we use the BibDoc variable doctype in a special way to restrict access. Any public file needs to be of the INSPIRE-PUBLIC doctype. Sometimes a file is uploaded with the wrong or default doctype. Here is how to update an existing documents doctype to having `Main' or `' doctype to `INSPIRE-PUBLIC' doctype:

(Python script courtesy of Tibor Simko. Run under apache identity)

atypes = ('Main', '') # list of initial doctypes to be changed from
btype = 'INSPIRE-PUBLIC' # final doctype value to be changed into

import os
from invenio.bibdocfile import BibDoc
from invenio.dbquery import run_sql

for atype in atypes:
res = run_sql("SELECT id_bibrec,id_bibdoc FROM bibrec_bibdoc WHERE type=%s",
(atype,))
for row in res:
id_bibrec, id_bibdoc = row
abibdoc = BibDoc(recid=id_bibrec, docid=id_bibdoc)
abibdoc_type_pathname = os.path.join(abibdoc.get_base_dir(), '.type')
# update DB value:
run_sql("""UPDATE bibrec_bibdoc SET type=%s WHERE type=%s
AND id_bibrec=%s AND id_bibdoc=%s""",
(btype, atype, id_bibrec, id_bibdoc))
# update file value:
fdesc = open(abibdoc_type_pathname, 'w')
fdesc.write(btype)
fdesc.close()
# print info
print "I: updated record %s file %s" % (id_bibrec, abibdoc_type_pathname)

How to re-configure HAProxy

In order to re-configure HAProxy configuration (for example to put a node out of rotation), you can load up haproxy.cfg (/etc/haproxy/haproxy.cfg) on the proxy node (PCUDSSW1503). Then after performing the necessary changes to the haproxy.cfg file you can reload it using the following commands (as described here):

sudo /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid)

Btw, this actually kills the connections to haproxy.

How to unbalance a node in HAProxy

This is now done using our Fabric script from the inspire-scripts repository which connects to the HAproxy socket to put nodes in maintenance mode.

$ fab disable:prod2
$ fab enable:prod2

-- JanLavik - 28-Sep-2012

  • Attachment to WebDoc editing:
    info-space.png
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng info-space.png r1 manage 20.7 K 2013-01-17 - 09:24 JanLavik Attachment to WebDoc editing
Edit | Attach | Watch | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r21 - 2017-06-22 - FlorianSchwennsen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback