Feedbacks gathered from shifters


Jibo

Hi, Roberto,

- As shifter, what is the first action that you do when you start your working day? In which order and which information you check?

In the past I would check whether there are active productions in the Dirac Production monitoring page: https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/jobs/ProductionMonitor/display

then I will go to Dirac job plot page to check the total jobs, failed jobs, etc on each sites to see whether there is something wrong with some site.

Recently I would open these two web pages to get an idea of the grid status more quickly: http://d0bar.physi.uni-heidelberg.de/reporter/24h/index.html http://dashb-lhcb-ssb.cern.ch/dashboard/request.py/siteview?view=Job%20Activities

- Where do you gather information and how do you use it? Do you think that the information in the DIRAC portal is enough or do you have feeling that something is still missing?

Mostly in Dirac. Up to me, the information in Dirac are sufficient, we just need a way to gather them more quickly.

If you feel so, what is missing? Is the organization of the information that you check regularly OK or do you think that, whenever there is still room for improvement in the way they can be exposed (for example compatibly with procedures or action-lines that you follow) ?

I guess the shifters would not be able to check all the informations in Dirac. Some summary pages like http://d0bar.physi.uni-heidelberg.de/reporter/24h/index.html http://dashb-lhcb-ssb.cern.ch/dashboard/request.py/siteview?view=Job%20Activities

are very helpful and efficient.

Do you just look at the quality of the activity of resources via DIRAC or also checking external pages?

Usually just in Dirac.

- Do you feel that an alarming system is important? Do you think that in the LHCb operations there are alarms enough?

Yes. No. Maybe we could gather all the alarms in some page and the shifter needs to contact the experts to remove the alarm if the shift can not deal with them by himself and it is urgent. Or the Grid experts would always check this by themselves?

- Do you check how services are running based on their criticality? Do you know which the critical services for LHCb operations are? What do you think if, for critical services, we outsource to (and exploit) a 24X7 piquet service the monitor of basic metrics? Do you usually check fabric level information (Ganglia/Lemon/SLS) of our voboxes or of our core services at CERN (CASTOR/WMS/LSF/AFS/Network/Oracle)? If not when do you check these information?

It seems these questions are for the Grid experts? I should say I don't understand them all.

- How do you monitor transfers? Do you feel confident with the information available to identify and debug a xfer problem?

I just know that Dirac also provides the transfer monitoring. But so far, I haven't used it very often. So maybe next year the shifters also need to check this?

- Do you follow already well defined procedures? If yes where do you look at them? Which procedures you feel useless/less important? Which ones are crucial but you are afraid are still incomplete? Which ones are on the other hands clear and well exercised in your opinion?

Now I just follow what it is said on this page: https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionShifterGuide

But I guess when we start to collect real data, the shifters need to do more?

- Do you feel that is missing a correlation between activities and status of the resources? Do you think there is already a view that collects all relevant information in one sight? For example do you check this page https://dashb-lhcb-ssb/dashboard/request.py/siteview?view=Job%20Activities?

Yes, I also look at this page now.

This gives you - in one view and for all our sites - the quality of all the activities (and their intensity) the status of sites in our mask and in the GOCDB. How would you improve it?

Maybe it is better if we can also set the period to look, e.g., the last 12 hours and so on?

How do you decide that a site must be banned (either a T1 or a T2)? How do you otherwise realize that a site is banned and has (has not) to be reintegrated? How do you realize that a problem has been addressed? Solved?

If there are O(10) jobs running on one site and more than 40% failure rates (by unknown issue), I would ban the site.

For the Tier-1 site, I would submit a GGUS ticket and check it to see whether the problem has been solved. For the Tier-2 site, it seems to me that we do not submit a GGUS ticket?

- Out of the list of links below: which ones you were aware of?

I only know the following two: http://lblogbook.cern.ch/Operations/ http://dashb-lhcb-ssb.cern.ch/dashboard/request.py/siteview?view=Job%20Activities

Which ones do you think important to be added (and used more systematically) in our portal; i.e. which information can be exploited out of there? What external information do you think worth to be better integrated in your daily consumed pages (ex. GGUS, SAMDB, GOCDB, Dashboards, SLS,Lemon,Nagios)?

I don't know. If the production shifter need to check more, we need some training so that we can understand the information listed on these web pages.

-- Main.RobertoSantinel - 19-Jan-2010

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2010-01-19 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback