Stefano's experience as Computing Run Coordinator
a small blog Oct 29 - Nov 4
beware I am rearranging continously as I get more experience
Organization, responsibility etc.
- I feel like having a running experiment, Hey, this is great !!!
- need a procedure to pass control from one person to another at shift change (see below)
- CRC location should be defined. Ideally should show up in CmsCenter most of the day
- need a seating place in CmsCenter, ideally a small table next to the computing ops
- the left side of CmsCenter is offline+computing+DQM, needs a shift leader (not two: CRC + ORC)
- responsibility if any should be clarified: which action can the CRC take ? which decisions ? at which level
- we are not organized yet for continous T0 operations
- Division responsibility between DatOps and CSP, including communication (they do not talk)
- how to tell if the various disk buffers are at risk of being full
- if and which expert can be called in
- CRC functions covers two areas now, better separate
- shift supervisions (be in CmsCenter when taking data, one week shift or longer, cover all of the left side: comptuing, offline, dqm)
- need one person in CmsCenter (or close by) that anyone from CMS/WLCG etc. can talk to at any time for operational problems, sort of OperatuionManagerOnDuty
- Computing Ops Manager, Offline Ops Manager i.e. deputy for L1's as discussed, this needs longer time vision/memory, 3~6 months, located at CERN, only needed for policy making, current L1's could do * dataops other then T0 is not coming to this thread at all, even DataOps e-log only has T0 entries, why then a separate T0 e-log ? * it makes no sense that problems at T1's, T2's are found, tracked, fixed without contat with DataOps, actually with the possibility that they contact the sites independently
Communication flow
- how do CRC and others contact CSP ? need AIM and phone (mobile and CMS center console)
- I created the AIM's, and put phone for CSP console, L2's need to get a mobile phone (at least those who can make power calls)
- need https://twiki.cern.ch/twiki/bin/view/CMS/CompOpsContact like organization with AIM for shifter etc., I created https://twiki.cern.ch/twiki/bin/view/CMS/ComputingShiftContacts but should better be a single page
- in which e-log does CRC write ?
- too much time goes in copying/pasting messges to/from e-logs/savannah/ggus/HN
- whom/what is e-log for ? if use is limited, in the end none reads and then why write ?
- CRC + CSP work in isolations, this is not good, examples
- CSP report CAF has huge job load. Noted in e-log. Instructions say report. Whom do we report to ? Does anybody really need to be told ? Users and analysis groups should self-regulate looking themselves at lenght of batch queues. Are we looking for situations that may endanger the system, so need to report to IT and close queue, or to whom ? * simulation harvesting from Site X and Y is failing. Whom do we tell ? And how ? And will they do anything ?
Instructions etc.
- Tx_CC_SSSS name must be put in Savannah ticket summary and site box in Savannah
- web pages could largely be improved see SuggestionsForBetterWebPagesForComputingShifts
- open issues should be in a different place from plan of the day
- need compact page with this week shifters and managers and plan of the day, joint with DataOps i.e. the Shift Entry page, person X comes on as CSP shifter, which page shall she/he open first ?
Shift(ers) management
- whom they report to ? what they report ? which actions can/must they take and how is that tracked ?
- is CSP simply a human replacement for an alert system ? But then whom do we alert ?
- too may e-logs (including a hiddern one at FNAL), no cross-ref, data distribution topics entered in FNAL dataops e-log instead of CERN Distributed Data Transfers
- problem notices comes via mail rather then e-log due to need to look at too many e-logs
- getting one e-mail for every logbook entry is crazy
- but actively watching 8 e-logs is also impossible
- need a central e-log that CRC keeps always open where to watch for updates
- CMS Center room needs a local manager locally
- some shifters are very good we need to make more effort in making their effort and time more useful
- there are many instances where shifters do not know what to do, a local supervisor will help
- rules for when/whom to open/follow/close tickets should be defined, so that we get over the "3-shift experience"
- simply stated, it is difficult to fill a shift summary or a day summary, e.g. CSP are supposed to report issues they found, but not if they are still ongoing. CRC shuld collect summary from all shifters and flag important things to pass to experts, not create the list himself
--
StefanoBelforte - 29 Oct 2008