Feedback from T1 contact persons

Target of the survey

LHCb contact persons at T1: JT (Jeff Templon, Nikhef), LA (Luisa Arrabito, IN2P3), EL (Elisa Lanciotti from PIC), JK (John Kelly from RAL), PF (Paolo Franchini, from CNAF), RN (Raja Nandakamur, from RAL), AZ (Alexey Zelezov from Gridka).

1 - What are the first pages that you look at as first thing in your daily activity in order to understand what’s happening?

Feedback

JT:

  • Running jobs (Nikhef link):http://www.nikhef.nl/grid/stats/ndpf-prd/voview-short
  • Nagios at Nikhef (Nikhef link): http://spade.nikhef.nl/nagios/ tests of the site and official SAM tests
  • LHCb: Dashboard site availability FCR critical tests in last 48 h:
http://dashb-lhcb-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=501&sites=LCG.CERN.ch&sites=LCG.CNAF.it&sites=LCG.GRIDKA.de&sites=LCG.IN2P3.fr&sites=LCG.NIKHEF.nl&sites=LCG.PIC.es&sites=LCG.RAL.uk&algoId=82&timeRange=last48 to see what the experiments think of the site

LA:

EL:

JK:

PF:

  • reading the mail of the past night.
  • check all the SAM test
  • check operations meeting daily report.

RN:

AZ:

  • GGUS - to get an idea what are already known problems with GridKa My 24h page - to check that GridKa is not failing more than other for the same type of jobs. If I see some failed SAMs here: SAM results - to know what GridKa operators could see by themselves

Outcome:

2 - What are the most relevant information you look for?

Feedback

LA: From Dirac Monitoring page :
  • number of running jobs, compared with the expected lhcb activity. The latter is normally announced in Operations ELOG, but I look to the production monitor in the same Dirac portal to cross check.
  • rate of failed jobs and related error messages. From these messages I try to distinguish if the failure is site related or not. This point could be improved with more explicit messages.
  • Dirac SAM jobs and related messages

From the SAM dashboard :

  • results of SAM tests and related outputs if there are failures. Both are very important informations.

Some pages internal to my site, which give a complementary view of the general status of my site (the most important ones being our dCache portal and the Data Transfer Portal).

EL: number of running jobs, and rate of failures. And the site status

JK: Anomalies affecting the RAL tier1. Some things are easy to spot, eg a failed disk server, other things are more difficult to spot eg failures of FTS transfers for a specific VO from a particular service class.

PF: rate of jobs running/failed the sam tests.

RN: Whether RAL is within the LHCb mask. Number of running jobs. Rates of failures

AZ: I just check that Gridka team accepts the problem as concerning the site in case LHCb thinks so and process it in time.

Outcome

  • number of running jobs and rate of failures.
  • whether the site is in the site mask
  • results of SAM tests
  • dCache monitoring from site specific tools

3. Do you think that the information you have already available are enough? Is it enough to understand what the problem is about? And to debug?

Feedback

JT: In most cases it is ok. One notable exception is the WMS, there it is very hard to know if there is a problem or not.

LA: I think that the information is there, but it's too much spread. Besides this, it is important to cross check different informations, for example SAM tests can be ok, but many users jobs fail.

EL: the information is not complete. SAM tests can be fine, and the site banned, at the same time. This looks to me inconsistent. I would like to see a complete evaluation of site status

JK: It is difficult to understand and debug problems on the LCG. It usually takes specialist knowledge. Monitoring tools only give an indication of what is happening.

PF: Something for monitoring the data transfer would be useful

RN: More-or-less yes. At least I know mostly where to look for issues.

AZ: relative number of failed jobs is enough to understand site specific problems with jobs.

Outcome

More or less the information is enough, but it is spread on different sources. Not enough information about data transfers, and WMS.

4. What is missing in your opinion?

Feedback

JT: LHCb is usually ok in this respect, your SAM tests are pretty representative of the real use of the grid.

LA: more centralized information. Ex. possibility to select by job typein Dirac web portal (in the left menu you cannot select a job type currently) add the job type column in the job monitor default view (now you have to add it manually). and the CE the job has been submitted to. very useful to have the possibility to build dynamic plots using the different fields (for example a scatter plot 'failed status vs Owner' or whatever) in order to find correlation and help the debugging.

EL: a complete view of the site status. One has to see from one side the SAM tests (for ex. from dashboard) and on the other side has to execute the DIRAC command to see which are the banned sites.

JK: It is interesting to compare different experiments' dashboards. We have most experience with the atlas dashboard at http://arda-dashboard.cern.ch/atlas/ the equivalent LHCb dashboard is: http://dashboard.cern.ch/lhcb/index.html

PF: Something for monitoring the data transfer would be useful

RN: The transfer matrix

AZ: Other services problem (FTS, LFC, WMS, etc.) are normally checked by dedicated LHCb experts. I do not know the page where I can see all site related services in one place, so it is nice to have. CMS "happy face" looks like reasonable simple to check it 5 seconds.

Outcome

  • One complete definition of site status
  • transfer matrix
  • job type and CE in Dirac job monitor columns

5. Have you any special remarks on the content of the information?

Feedback

JT: difficult to figure out the data access test results ... it was not clear which of the "failures" were normal.

LA: the content of the information is quite good, but I would like to have it more centralized

AZ: For d-cache and tape system have to browse GridKa web site. This information is not published anywhere else, while it helps to understand near all SE problems.

Outcome

  • More verbose error message in case of data access problems
  • More organized publication of the information about dCache system (but I think this depend on the site, and we cannot do anything about...)

6. Do you personally think worth to create a single entry point for LHCb that would allow you to drill down into activities/problems?

Feedback

JT: it is OK now, except that from the dashboard, drilling down sometimes can take a VERY long time. This should be fixed, but I think this is a dashboard problem not an lhcb problem.

LA: Yes I'd like to have a single entry point and to be able to get the same level of information that I can get now, by navigating in different pages.

EL: yes, I think it would make me save time. I would like to have only one URL to check, which collects all the information about LHCb and sets the status of my site. If it's green, fine. If it's not green, then starting from there it should provide me the way to drill down into the problem in the most efficient way.

JK: A single entry point is a good idea. and the ability to drill down to 'per job' or 'per transfer' detail is a good idea.

PF: yes for sure, there is a lot of information spreads in many sites and every week I discover something new that should be checked, or I ask to experts something that I should have found in some monitor already present around the web.

RN: Not really. In case of problems, I need to be able to look at the actual log files if possible.

AZ: For me, it will be nice to have one page with:

  • DIRAC job statistic for the site (also normalized by job type)
  • All current services failures if any
  • All current SAM failures if any
  • Planned down times for today
  • Not closed GGUS tickets

Outcome

In general yes (only 2 out of 7 said no), including (taking into account also feedback of point 2):
  • Full definition of site status
  • DIRAC job statistic for the site (also normalized by job type), running jobs, failed jobs
  • All current services failures if any
  • SAM tests
  • Planned down times for today
  • Not closed GGUS tickets
  • Transfer matrix
  • Dirac production mask

Useful Links

Number code:
  • 1-yes, often for normal operations
  • 2-only sometimes
  • 3-only when there is a particular problem I am investigating
  • 4-no

Feedback

la: 1

el: 1

jk: 2

pf: 1

az: 1

very useful.

jt: 2

la: 4

el: 4

jk: 1

pf: 4

only 2 persons found it useful.

jt: 4

la: 2

el: 2

jk: 1

quite useful

jt: 4

la: 3

el: 1

jk: 4

pf: 4

sometimes useful

jt: 4

la: 2

el: 4

jk: 4

pf: 1

more or less useful

la: 1

el: 1

jk: 1

very useful

jt: 4, other dashboard views

la: 1

el: 2

jk: 1

useful

jt: 1

la: 2

el: 2

jk: 1

very useful

la: 3

el: 3

not very useful

jt: 3

la: 3

el: 3

jk: 1

pf: 2

more or less useful. In case you have to check downtimes

jt: 4

la: 4

el: 4

jk: 4

not useful

jk: 1

el: 2

can be useful

jt: 4. the have a local version at nikhef

la: 3

el: 3

jk: 3

not very used, only sometimes

jt: 3

la: 4

el: 3

jk: 2

can be useful sometimes.

-- ElisaLanciotti - 25-Jan-2010


This topic: LHCb > SurveyForContactVersion2
Topic revision: r1 - 2010-01-25 - unknown
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback