Survey sent to LHCb shifter crew

Dear all, some days ago we sent (and actually already received a lot of inputs from) a survey to our contact persons and site managers at T1's for gathering an external point of view about monitoring and information available. Now, with the same aim of better organizing the information available and trying to identify whether still something is missing or can be better organized, we would like to propose you -as shifters of our production crew - this list of questions targeted for you. Your answers will be then taken into account for compiling a requirements document that should drive in the next steps of the activity primarily born for limiting the proliferation of monitoring tools and information that might risk being pointless or even dangerous. We hope that, on top of a well defined list of requirements, we can provide not only input for a better organized monitoring of our resources but also for improving existing procedures and the overall computing operations in our team. Just few minutes of your time (during these vacations) could give a invaluable contribution to all of us. - As shifter, what is the first action that you do when you start your working day? In which order and which information you check? - Where do you gather information and how do you use it? Do you think that the information in the DIRAC portal is enough or do you have feeling that something is still missing? If you feel so, what is missing? Is the organization of the information that you check regularly OK or do you think that, whenever there is still room for improvement in the way they can be exposed (for example compatibly with procedures or action-lines that you follow) ? Do you just look at the quality of the activity of resources via DIRAC or also checking external pages? - Do you feel that an alarming system is important? Do you think that in the LHCb operations there are alarms enough?- Do you check how services are running based on their criticality? Do you know which the critical services for LHCb operations are? What do you think if, for critical services, we outsource to (and exploit) a 24X7 piquet service the monitor of basic metrics? Do you usually check fabric level information (Ganglia/Lemon/SLS) of our voboxes or of our core services at CERN (CASTOR/WMS/LSF/AFS/Network/Oracle)? If not when do you check these information? - How do you monitor transfers? Do you feel confident with the information available to identify and debug a xfer problem? - Do you follow already well defined procedures? If yes where do you look at them? Which procedures you feel useless/less important? Which ones are crucial but you are afraid are still incomplete? Which ones are on the other hands clear and well exercised in your opinion? - Do you feel that is missing a correlation between activities and status of the resources? Do you think there is already a view that collects all relevant information in one sight? For example do you check this page https://dashb-lhcb-ssb/dashboard/ This gives you - in one view and for all our sites - the quality of all the activities (and their intensity) the status of sites in our mask and in the GOCDB. How would you improve it? How do you decide that a site must be banned (either a T1 or a T2)? How do you otherwise realize that a site is banned and has (has not) to be reintegrated? How do you realize that a problem has been addressed? Solved? - Out of the list of links below: which ones you were aware of? Which ones do you think important to be added (and used more systematically) in our portal; i.e. which information can be exploited out of there? What external information do you think worth to be better integrated in your daily consumed pages (ex. GGUS, SAMDB, GOCDB, Dashboards, SLS,Lemon,Nagios)? - Any other question/remark?

Here the list (not exhaustive) of information that I am checking regularly

(beyond DIRAC portal)



http://gridview001/GVPC/Excel/ azione_Pilot_job_distribuiti_n

-- RobertoSantinel - 19-Jan-2010

This topic: LHCb > SurveySentToShifters
Topic revision: r1 - 2010-01-19 - unknown
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback