INSPIRE Admin Page
The INSPIRE Admins are the people who monitor the technical operations of the INSPIRE service. If you are an admin, be sure to fully read this entry to give you a clearer picture of your responsibilities.
The shift calendar is available on
SystemOperationsAdminRota
The INSPIRE admin support phone is:
+41 764879224.The mobile phone should be passed on to the next shift holder.
Servers
For a list of INSPIRE servers visit
https://twiki.cern.ch/twiki/bin/viewauth/CDS/CERNDocumentServerMachines#INSPIRE_servers_5
Duties
- Make sure to pick-up the INSPIRE admin support cellphone from the last person on shift or from office 3-1-038. Keep this with you at all times, or alternatively forward it to your personal (swiss) phone.
- Keep yourself up-to-date with the latest list of deployed patches and fixes specified in the RT Admin logbook
- In case of emergency hot-fixes or to get an idea of live running code - see the
invenio-inspire-ops
repository for a snapshot of the current Invenio code running on the PROD servers. For INSPIRE specific customization like configuration, formats, submission and styling, see the inspire
repo. The OPS team should have access to push to these repositories should anything need to be patched:
- Check the Sentry instance
- Look for Emergencies and act accordingly
- Triage bugs and feature requests
- Take tickets to fix small bugs/formatting changes, things in the INSPIRE codebase
- Create Asana tickets (Rock Solid project
) for Invenio fixes with a link to the Sentry exception, and liase with the responsible Invenio developer (Link to the Asana ticket from Sentry in a comment)
- Occasionally interact with Users to gain more information about a bug/request
- Monitor BibSched queue on production server, restart/manage if jobs are hung/failed. Try to keep the queue running at all times, unless there are serious errors or fixes needed.
- Note you need to logon to INSPIRE production worker #1 as shown here server list (as of this writing that is inspire05.cern.ch)
- That is the production INSPIRE instance..please don't erase it/bring it down, etc...
- If /opt partition is running out of space (warning is sent to RT Admin queue), inspect /opt space of pcudssw1506 (PROD1). Cooperate with OPS team to see what can be removed safely. Log example
Log example #2
(Manually running inveniogc might also help in this case.)
- Inspect attached file folders in
/opt/cds-invenio/var/data/files
to see if there are enough folders for further file attachment Log example
- Inspect the size of the various log files located in
/opt/cds-invenio/var/log
and rotate them if the file size is getting to big. (May be helped by integration of logrotate deamons).
- Your last duty is to update the SystemOperationsAdminRota with your next shift and return the cellphone to office 3-1-014 (or next admin on duty).
Log your actions
Whenever you have interfered with or have been running commands on the production servers, you should log them as a new ticket to the RT Admin queue (Just send an email to
admin@inspirehepNOSPAMPLEASE.net. Prefix the subject line of these tickets with
logbook:
or
INSPIRE logbook
. Explain the reasons of the action you had to perform and detail the exact commands used. These logs can be useful for solving recurring problems more quickly.
Deployment
The installation/deployment scripts we use for PROD is located in Tibors own invenio-devscripts repo which is located here:
https://github.com/tiborsimko/invenio-devscripts.git
Very soon a prototype Fabric script will be committed to the INSPIRE repo for everyones consumption. For now a few assumptions have been made:
- Use local development machine to deploy from
- Use personal AFS space to host the code/branches (i.e. you push to your AFS repos from your devbox)
Important: Go carefully through each step of the recipe as things may not be 100% correct. Some improvements to this script is needed, for example using cp with backup and handling cases for new/deleted files etc.
How to setup
Note: The script 'invenio-create-deploy-recipe' is the one we've been using to create the "deploy recipe" we use. This script is located here:
/afs/cern.ch/project/inspire/repo/invenio-create-deploy-recipe
What to do about something
- Check the Invenio error log files located in
/opt/cds-invenio/var/log
. Files of interest may be:
-
invenio.err
or invenio.log
- any Invenio exceptions caught are listed in these files. Beware that this file can be huge, so use care when choosing the tool to read them (i.e. use grep
or tail
)
-
bibsched.err
- Errors related to executed BibTasks. Each BibTask also has its own error and log files; look for the right task id for files prefixed bibsched_task_
-
bibsched.log
- This file contain the log of which BibTask has been executed when, including starting and stopping time.
- Check CDS page
- Check INSPIRE Admin Guides
- When dealing with exceptions in RT look at the first one, and if it is a new issue, change the subject of the ticket and delete all others relating to that issue (Bulk Update/Update Multiple tickets -> Status-> Deleted)
- When there is an issue that requires more than simple admin changes, i.e. from web interface, or simple formatting etc, create a Trac ticket.
- If the ticket relates to anything in Invenio codebase (as opposed to INSPIRE overlay) make sure it goes in Trac. If unsure, assign it to the relevant person and they can determine correct status, or to Joe/Tibor/Jan.
BibSched and system monitoring
Currently we have
BibSched active on 3 nodes: PCUDSSW1506 (PROD #1) PCUDSSW1507 (PROD #2) and PCUDSSX1506 (PROD #3).
These should
always be in automatic mode during normal operations.
Going in manual mode
When you need to change/acknowledge/remove some tasks in the
BibSched queue, you need to enter
manual mode. Hit key 'a' and you will be prompted with a dialog asking you for how long you want to stay in manual mode. Usually 5 minutes is enough.
In the lower left corner of the monitoring display you will then see a timer counting down until it will
automatically change back to manual mode.
You should also leave a message in the MOTD column in order to inform other operators why the queue is in manual mode. Hit key 'e' to change the message. Your username is automatically suffixed to this message.
- To see the bibsched.log when inside the monitor, hit key 'b'. Useful to see who has done certain actions, what has been running and generally see what has been going on.
Periodical BibSched tasks
Here is a listing of the INSPIRE specific periodical housekeeping tasks that should be present in the
BibSched queue at all times.
- BibIndex: the indexing job taking care of keeping the indices up-to-date on record updates. 5 minute timer.
- Command:
bibindex -f50000 -s5m
- WebColl: updates the bibrec_collection table and cached web-pages. 5 minute timer.
- Command:
webcoll -v0 -s5m
- BibReformat: updates cached record formats, such as all the brief formats shown in search results. 5 minute timer.
- Command:
bibreformat -o HA,HB,HX,WAPAFF,WAPDAT,HDREF -s5m
- dbdump: Nighly task that dumps the entire database to file. Runs currently for ~4 hours. As this time increases we should start looking into alternate ways of taking backups.
- Command:
dbdump -n2 -o/afs/cern.ch/project/inspire/PROD/dbdumps -L 02:00-05:00 -s21h
(Currently runs on 21 hour rotation which is not ideal as the starting time is shifted. Awaiting patch to be able to run at fixed time.)
- Priority: 8
- inveniogc: Invenio Garbage Collector. Removes unnecessary tempfiles and cleans bibxxx tables.
- Command:
inveniogc -b -p -L 01:00-04:00 -s20h
- Priority: 6
- oaiharvest: OAI-PMH harvesting tasks that performs nighly harvest of newly added or updated arXiv records (main source of record ingestion).
- Command:
oaiharvest -r arxiv -s24h
- Priority: 7
- bibauthorid: keeps bibauthorid tables up to date (adds missing/modified authors on records, removes stale references. Should be run after every bibupload, or as often as possible (minimum twice a day)). Failing to run it often enough may result in claiming data loss, please treat it with care!
- Command:
bibauthorid --update-person-id -s12h
- webauthorprofile: precaches the contents of /author/ pages in order to reduce greatly realtime load on servers. (authorpages require in average at least 20 seconds each, likely to timeout for authors with many (~1000) papers during high server load).
- Command:
webauthorprofile -s20h
- Notes: This thing is only reading data and filling cache tables. In case of emergency can be safely stopped or killed. But please remember to schedule it up again after!
- Notes: Virtually this can run in parallel with any other task.
Current INSPIRE Issues
Failing BibIndex jobs >> NEW <<
We are seeing lots of
BibIndex tasks failing with no error logs. We are looking into this issue. For now, just ACK the task or re-initialize it if it is a periodical task. I think I fixed this by optimizing the load times. We should not see anymore failing bibindex (or I have for more optimizations)
Failing BibUploads from OAI harvesting
Currently, there are nightly errors from
BibUpload tasks after arXiv harvesting due to DOI collisions. This is a known issue and the actions to be taken are the following:
- Open the task error log (key 'l' in BibSched) and copy the line containing the error message. Ex: Failed: DOI(s) ['10.1007/s00601-012-0567-z'] found in this record (#1185716) already exist(s) in another other record (#1245794)
- Open the associated Asana task for these types of errors: https://app.asana.com/0/2345936208876/5012433024625
- Add the error message as a new sub task if it does not already exist.
Temporary bibauthorid changes
I removed bibauthorid from the non-concurrent list so it will run at the same time as bibuploads etc. While this rarely leads to problems it is possible. So you should be warned. bibauthorid should be put back one we managed to understand why it is not going to sleep in a decent amount of time.
Other INSPIRE specific notes
- Nightly OAI harvesting from arXiv should always run. It is important that any errors related to the harvesting is fixed fast and that after any delays or blockage of the queue overnight, harvesting has #1 priority.
- The
dbdump
task may sometimes fail with mysqldump errors such as 728 or similar. The reasoning behind this error is unclear, but it may be related to AFS. The frequency of this error seems to be decreasing however.
--
JanLavik - 07-Aug-2013