TWiki> LHCb Web>ContactFeedbacks (revision 1)EditAttachPDF

Feedbacks gathered from contact persons and/or sysadmin


Paolo Franchini - CNAF

> 1.what are the first pages that you look at as first thing in your daily activity in order to understand whatʼs happening?

The very first task is reading the mail of the past night. Next, if there is not any major problem, I check all the sam test and your daily report. Basically I make sure that all the problems are under the control of the experts if I cannot take part directly in the solving process.

> 2.what are the most relevant information that you are looking at out of there? Could you give us your perception on their importance?

Also for me the rate of jobs running/failed one of the most relevant hints, followed (at least in this period) by the sam tests.

> 3. do you think that the information you have already available are enough to understand whether some problem is happening? Is it enough to
> understand what the problem is about? Is it enough to debug a problem?

Something for monitoring the data transfer would be useful, in order to know from another point of view the amount of activity that is running through 'my' tier1.

> 4. what is missing in your opinion? What are the information that you see in other VOs specific monitoring tools (that you check routinely)
> and you would like to see implemented in LHCb too?
>
> 5. have you any special remarks on the content of the information? On the way are organized?
>
> 6. Do you personally think worth to create a single entry point for LHCb that would allow you to drill down into activities/problems? Up to
> which level of details you would expect to browse information?

yes for sure, there is a lot of information spreads in many sites and every week I discover something new that should be checked, or I ask to experts something that I should have found in some monitor already present around the web.

> 7. Could you indicate me which link (out of the ones below) you
> check/have checked in the past. Which ones are checked regularly? How
> would you improve these ones? Why you do not use the rest of them? (just because not advertized anywhere)? Do you have any other link, bit of information that do not appear in my bookmark (may it well be I have simply omitted ;-))?

> https://dashb-lhcb-ssb/dashboard/request.py/siteview?view=Job%20Activi
> ties

does not work even for me.

> http://lhcbweb.pic.es/DIRAC/info/general/diracOverview

used daily.

> http://goc.grid.sinica.edu.tw/gstat

never used.

> http://santinel.home.cern.ch/santinel/cgi-bin/space_tokens

never used that page, but I'm used to check the same infos.

> http://santinel.home.cern.ch/santinel/cgi-bin/lcg-voview

very useful.

> https://lemonweb.cern.ch/sls/service.php?id=PhysicsStreams

doesn't work

>
https://lemonweb.cern.ch/sls/history.php?id=GeneralNetwork&more=availability

doesn't work

> http://gridview001/GVPC/Excel/

doesn't work

> https://goc.gridops.org/

used to check the downtimes.

> http://sls.cern.ch/lrf-castor/index.php

I've not the account.

Raja Nandakumar - RAL

1.what are the first pages that you look at as first thing in your daily activity in order to understand whatís happening?

LHCb mask : http://lhcb-project-dirac.web.cern.ch/lhcb-project-dirac/lhcbProdnMask.html Job monitoring : https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/jobs/JobMonitor/display SAM tests : http://dashb-lhcb-sam.cern.ch/dashboard/request.py/latestresultssmry SLS tests : https://sls.cern.ch/sls/service.php?id=T1VOBOX WMS history : Dirac portal accounting

2.what are the most relevant information that you are looking at out of there? Could you give us your perception on their importance?

Whether RAL is within the LHCb mask. Number of running jobs. Rates of failures

3. do you think that the information you have already available are enough to understand whether some problem is happening? Is it enough to understand what the problem is about? Is it enough to debug a problem?

More-or-less yes. At least I know mostly where to look for issues.

4. what is missing in your opinion? What are the information that you see in other VOs specific monitoring tools (that you check routinely) and you would like to see implemented in LHCb too?

The transfer matrix

5. have you any special remarks on the content of the information? On the way are organized?

No

6. Do you personally think worth to create a single entry point for LHCb that would allow you to drill down into activities/problems? Up to which level of details you would expect to browse information?

Not really. In case of problems, I need to be able to look at the actual log files if possible.

7. Could you indicate me which link (out of the ones below) you check/have checked in the past. Which ones are checked regularly? How would you improve these ones? Why you do not use the rest of them? (just because not advertized anywhere)? Do you have any other link, bit of information that do not appear in my bookmark (may it well be I have simply omitted ;-))?

The main list of sites I use is given above. I do use all the links below at various times. One useful link I use a lot is the LHCb mask given above (http://lhcb-project-dirac.web.cern.ch/lhcb-project-dirac/lhcbProdnMask.html)

https://dashb-lhcb-ssb/dashboard/request.py/siteview?view=Job%20Activities http://lhcbweb.pic.es/DIRAC/info/general/diracOverview http://goc.grid.sinica.edu.tw/gstat http://lbrunstatus.cern.ch/LHCb?destination=OFFLINE http://santinel.home.cern.ch/santinel/cgi-bin/srm_test http://santinel.home.cern.ch/santinel/cgi-bin/space_tokens http://santinel.home.cern.ch/santinel/cgi-bin/lcg-voview https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports http://dashb-lhcb-sam.cern.ch/dashboard/request.py/latestresultssmry https://lcg-sam.cern.ch:8443/sam/sam.py http://lemonweb.cern.ch/lemon-web/ http://sls.cern.ch/sls/service.php?id=LHCb-Storage https://lemonweb.cern.ch/sls/service.php?id=PhysicsStreams http://sls.cern.ch/sls/service.php?id=ServicesForLHCb https://lemonweb.cern.ch/sls/history.php?id=GeneralNetwork&more=availability https://j2eeps.cern.ch/service-lsfweb/login http://gridview001/GVPC/Excel/ http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php https://goc.gridops.org/ http://sls.cern.ch/lrf-castor/index.php http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/ http://wmsmon.cern.ch/monitoring/monitoring.html https://cert-wms-01.cnaf.infn.it:8443/wmsmon/main/main.php http://cic.in2p3.fr/index.php?id=home&js_status=2 https://lhcbweb.bo.infn.it/twiki/bin/view.cgi/LHCbBologna/MonitorCNAFtest#Frazione_Pilot_job_distribuiti_n http://servicemap.cern.ch/ccrc08/servicemap.html?vo=LHCb

Alexey Zhelezov - GRIDKA

My answers are inline. Note, that they are a bit biased (as usual smile I have no possibility to discuss that on GridKa Operation Meeting, so the following is my personal answers. I will try to get the answers from admins next year (if still relevant) (n short, I do not think they check/care about LHCb specific info other than GGUS and SAM).

> 1.what are the first pages that you look at as first thing in your
> daily activity in order to understand whatís happening?

GGUS - to get an idea what are already known problems with GridKa My 24h page - to check that GridKa is not failing more than other for the same type of jobs. If I see some failed SAMs here: SAM results - to know what GridKa operators could see by themselves

> 2.what are the most relevant information that you are looking at out
> of there? Could you give us your perception on their importance?

As contact person, I just check that GridKa team accept the problem as there own in case LHCb think so and process it in time.

> 3. do you think that the information you have already available are
enough to understand whether some problem is happening? Is it enough to understand what the problem is about? Is it enough to debug a problem?

For me, relative number of failed jobs is enough to understand site specific problems with jobs.

>
> 4. what is missing in your opinion? What are the information that you
> see in other VOs specific monitoring tools (that you check routinely)
> and you would like to see implemented in LHCb too?

Other services problem (FTS, LFC, WMS, etc.) are normally checked by dedicated LHCb experts. I do not know the page where I can see all site related services in one place, so it is nice to have. CMS "happy face" looks like reasonable simple to check it 5 seconds.

> 5. have you any special remarks on the content of the information? On
> the way are organized?
For d-cache and type situation I have to browse GridKa web site. This information is not published anywhere else, while it helps to understand near all SE problems.

> 6. Do you personally think worth to create a single entry point for
> LHCb that would allow you to drill down into activities/problems? Up
> to which level of details you would expect to browse information?
For me, it will be nice to have one page with: 1) DIRAC job statistic for the site (also normalized by job type) 2) All current services failures if any 3) All current SAM failures if any 4) Planned down times for today 5) Not closed GGUS tickets 6) current situation with SE (in GridKa case, the number of used d-cache movers per LHCb pool and Tape transfer rates (LHCb and Total).

> 7. Could you indicate me which link (out of the ones below) you
> check/have checked in the past. Which ones are checked regularly? How
> would you improve these ones? Why you do not use the rest of them?
> (just because not advertized anywhere)? Do you have any other link,
> bit of information that do not appear in my bookmark (may it well be I
> have simply omitted ;-))?
I use various links from your page once I have some specific question. The only place I visit regularly is DIRAC Web portal (since I am also DIRAC developer smile

Jeff Templon NL-T1

Yo Roberto,

I am answering personally, I put the rest of the team in CC because they might want to see your bookmarks or add to what I said.

On 11 Dec 2009, at 15:42, Roberto Santinelli wrote:

> Here my (non exhaustive) list of questions:
> 1.what are the first pages that you look at as first thing in your
> daily activity in order to understand whatís happening?

http://www.nikhef.nl/grid/stats/ndpf-prd/voview-short http://spade.nikhef.nl/nagios/

http://dashb-alice-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=100&sites=NIKHEF-ELPROD&sites=SARA-MATRIX&algoId=6&timeRange=last48

http://dashb-atlas-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=400&sites=NIKHEF-ELPROD&sites=SARA-MATRIX&algoId=21&timeRange=last48

http://dashb-lhcb-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=501&sites=LCG.CERN.ch&sites=LCG.CNAF.it&sites=LCG.GRIDKA.de&sites=LCG.IN2P3.fr&sites=LCG.NIKHEF.nl&sites=LCG.PIC.es&sites=LCG.RAL.uk&algoId=82&timeRange=last48

the first one, to see how "busy" the site is (where are the LHCb jobs BTW?) the second one, because there are many tests of the system listed there, and any problems with our own probes, or with SAM tests, can be seen here

also the three dashboard plots, to see what the experiments think of how we are doing.

There are other ones I look at if I see something hinky on one of these.

> 2.what are the most relevant information that you are looking at out
> of there? Could you give us your perception on their importance?

I think I already explained it above?

> 3. do you think that the information you have already available are
> enough to understand whether some problem is happening? Is it enough
> to understand what the problem is about? Is it enough to debug a
> problem?

In most cases it is ok. One notable exception is the WMS, there it is very hard to know if there is a problem or not.

> 4. what is missing in your opinion? What are the information that you
> see in other VOs specific monitoring tools (that you check
> routinely) and you would like to see implemented in LHCb too?

LHCb is usually ok in this respect, your SAM tests are pretty representative of the real use of the grid.

> 5. have you any special remarks on the content of the information?
> On the way are organized?

I haven't looked in a while, but i remember it was in the past difficult to figure out the data access test results ... it was not clear which of the "failures" were normal.

> 6. Do you personally think worth to create a single entry point for
> LHCb that would allow you to drill down into activities/problems? Up
> to which level of details you would expect to browse information?

It is OK now, except that from the dashboard, drilling down sometimes can take a VERY long time. This should be fixed, but I think this is a dashboard problem not an lhcb problem.

> http://goc.grid.sinica.edu.tw/gstat

this one we check sometimes, i think there are also some nagios probes that trigger on info from here but i am not sure.

> http://dashb-lhcb-sam.cern.ch/dashboard/request.py/latestresultssmry

see above, I do look at the LHCb dashboard but not exactly this link.

> https://lcg-sam.cern.ch:8443/sam/sam.py

yes.

> http://gridview001/GVPC/Excel/

is this a CERN link?

> https://goc.gridops.org/

whenever needed we look here, but it is for me not often.

> http://wmsmon.cern.ch/monitoring/monitoring.html

we have our own version of this running here.

> http://cic.in2p3.fr/index.php?id=home&js_status=2

the CIC portal? Sometimes when necessary but it is not often.

ps : I cut many of the links, I will fwd the entire original mail to the rest of the admin crew, and to SARA.

Luisa Arrabito - IN2p3

Hi Roberto, Here my answers.

Roberto Santinelli wrote: >
> Dear LHCb contact persons at T1ís,
>
> this survey is aimed to collect as much information and requirements
> as possible in order to provide you, your site admins and our
> production crew with a coherent and consistent interface for both
> activities and services/resources monitoring.
>
> This initiative must be considered as the response to the
> proliferation of (too many) monitoring tools (in place since years
> now) that have to be better organized or entirely rethought.
>
> We would really appreciate your (and your site) administrators opinion
> about this subject by just answering the following questions and/or
> providing us with any suggestion that you feel important. At the end
> of this mail I have dumped my personal bookmark collecting all sources
> of information that I personally use to gather (or to communicate)
> information about LHCb.
>
> Here my (non exhaustive) list of questions:
>
> 1.what are the first pages that you look at as first thing in your
> daily activity in order to understand whatís happening?
>
First of all I read my mails, since most of the times I'm contacted personally by production crew in case of problems with my site. This is a good practice, the weak point is that if I'm absent nobody will answer. For important and urgent issues lhcb submits GGUS tickets, which is a good practice since there will be always somebody who will take care of them. The practice to announce the submission of GGUS tickets to lhcb-grid is also good, so it's not necessary to monitor GGUS page. Then I look to the Dirac Monitoring Page and to the SAM dashboard http://dashb-lhcb-sam.cern.ch/dashboard/dashboard/request.py/latestresultssmry

> 2.what are the most relevant information that you are looking at out
> of there? Could you give us your perception on their importance?
>
From Dirac Monitoring page : - number of running jobs, compared with the expected lhcb activity. The latter is normally announced in Operations ELOG, but I look to the production monitor in the same Dirac portal to cross check. - rate of failed jobs and related error messages. From these messages I try to distinguish if the failure is site related or not. This point could be improved with more explicit messages. - results of dirac sam jobs and related messages From the SAM dashboard : - results of sam tests and related outputs if there are failures. Both are very important informations.

I also check some pages internal to my site, which give a complementary view of the general status of my site (the most important ones being our dCache portal and the Data Transfer Portal).

> 3. do you think that the information you have already available are
> enough to understand whether some problem is happening?
>
I think that the information is there, but it's too much spread. Besides this, it is important to cross check different informations, for example SAM tests can be ok, but many users jobs fail.

> Is it enough to understand what the problem is about?
>
Yes, but with a work of investigation that can be quite long and that can need the intervention of experts. >
> Is it enough to debug a problem?
>
For debugging, monitoring tools are not enough, bur they are a good starting point. Then, it's necessary to write some specific scripts. For example, quite often I use a mixture of dirac commands and our local batch system commands.

> 4. what is missing in your opinion? What are the information that you
> see in other VOs specific monitoring tools (that you check routinely)
> and you would like to see implemented in LHCb too?
>
I don't know so much about other VO monitoring tools. What is missing is a more centralized information. The example given by Elisa, about the job type is a good one. If I look at the Dirac Portal, I can always understand if a job is a reconstruction job or of other type, but this is done by gathering informations from different entry points (for example by looking at the production monitor in the hope that the name of the production is explicit enough, or looking if the Brunel Step is present in the logging info of the job). I think that some more columns could be added to the Job Monitoring Page of Dirac, like the Job Type and the CE the job has been submitted to. Then the possibility to sort jobs by using any of the fields associated to a job would be nice. Finally, it could be very useful to have the possibility to build dynamic plots using the different fields (for example a scatter plot 'failed status vs Owner' or whatever) in order to find correlation and help the debugging.

> 5. have you any special remarks on the content of the information? On
> the way are organized?
>
The content of the information is quite good, but I would like to have it more centralized. >
> 6. Do you personally think worth to create a single entry point for
> LHCb that would allow you to drill down into activities/problems? Up
> to which level of details you would expect to browse information?
>

Yes I'd like to have a single entry point and to be able to get the same level of information that I can get now, by navigating in different pages.

> 7. Could you indicate me which link (out of the ones below) you
> check/have checked in the past. Which ones are checked regularly? How
> would you improve these ones? Why you do not use the rest of them?
> (just because not advertized anywhere)? Do you have any other link,
> bit of information that do not appear in my bookmark (may it well be I
> have simply omitted ;-))?
>
> Please take your time to get these questions answered. It is important
> for us your opinion. It would be very good if you could forward it to
> your site admins too.
>
> With our best regards,
>
> R.
>
> https://dashb-lhcb-ssb/dashboard/request.py/siteview?view=Job%20Activi
> ties
>
doesn't work >
> http://lhcbweb.pic.es/DIRAC/info/general/diracOverview
>
yes daily >
> http://goc.grid.sinica.edu.tw/gstat
>
no. It seems to me redondant with sam dashboard, correct me if I'm wrong. >
> http://lbrunstatus.cern.ch/LHCb?destination=OFFLINE
>
not aware of >
> http://santinel.home.cern.ch/santinel/cgi-bin/srm_test
>
sometimes. >
> http://santinel.home.cern.ch/santinel/cgi-bin/space_tokens
>
in case of problems. >
> http://santinel.home.cern.ch/santinel/cgi-bin/lcg-voview
>
sometimes >
> https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdail
> yReports
>
yes >
> http://dashb-lhcb-sam.cern.ch/dashboard/request.py/latestresultssmry
>
yes often. >
> https://lcg-sam.cern.ch:8443/sam/sam.py
>
not so often, redondant with http://dashb-lhcb-sam.cern.ch/dashboard/request.py/latestresultssmry, wich it looks to me more complete (historical view etc.) >
> http://lemonweb.cern.ch/lemon-web/
>
not aware of >
> http://sls.cern.ch/sls/service.php?id=LHCb-Storage
>
yes in case of problems >
> https://lemonweb.cern.ch/sls/service.php?id=PhysicsStreams
>
doesn't work >
> http://sls.cern.ch/sls/service.php?id=ServicesForLHCb
>
yes in case of problems >
> https://lemonweb.cern.ch/sls/history.php?id=GeneralNetwork&more=availa
> bility
> <https://lemonweb.cern.ch/sls/history.php?id=GeneralNetwork&more=avail
> ability>
>
doesn't work >
> https://j2eeps.cern.ch/service-lsfweb/login
>
not aware of >
> http://gridview001/GVPC/Excel/
>
doesn't work >
> http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php
>
no >
> https://goc.gridops.org/
>
when I look for a specific information >
> http://sls.cern.ch/lrf-castor/index.php
>
no >
> http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/
>
no >
> http://wmsmon.cern.ch/monitoring/monitoring.html
>
in case I look for some specific information >
> https://cert-wms-01.cnaf.infn.it:8443/wmsmon/main/main.php
>
not, but I'll bookmark it >
> http://cic.in2p3.fr/index.php?id=home&js_status=2
> <http://cic.in2p3.fr/index.php?id=home&js_status=2>
>
when I look for a specific information >
> https://lhcbweb.bo.infn.it/twiki/bin/view.cgi/LHCbBologna/MonitorCNAFt
> est#Frazione_Pilot_job_distribuiti_n
>
no >
> http://servicemap.cern.ch/ccrc08/servicemap.html?vo=LHCb
>
sometimes

Thanks a lot for this useful work. Cheers, Luisa

Elisa Lanciotti - PIC

Hi Roberto, my answers inline: On Thu, Dec 17, 2009 at 2:10 PM, Roberto Santinelli <Roberto.Santinelli@cern.ch> wrote: Dear all, some days ago we sent (and actually already received a lot of inputs from) a survey to our contact persons and site managers at T1's for gathering an external point of view about monitoring and information available. Now, with the same aim of better organizing the information available and trying to identify whether still something is missing or can be better organized, we would like to propose you -as shifters of our production crew - this list of questions targeted for you. Your answers will be then taken into account for compiling a requirements document that should drive in the next steps of the activity primarily born for limiting the proliferation of monitoring tools and information that might risk being pointless or even dangerous. We hope that, on top of a well defined list of requirements, we can provide not only input for a better organized monitoring of our resources but also for improving existing procedures and the overall computing operations in our team. Just few minutes of your time (during these vacations) could give a invaluable contribution to all of us.

- As shifter, what is the first action that you do when you start your working day? In which order and which information you check?

- Where do you gather information and how do you use it? Do you think that the information in the DIRAC portal is enough or do you have feeling that something is still missing? If you feel so, what is missing? Is the organization of the information that you check regularly OK or do you think that, whenever there is still room for improvement in the way they can be exposed (for example compatibly with procedures or action-lines that you follow) ? Do you just look at the quality of the activity of resources via DIRAC or also checking external pages? I check the following pages: to check the ongoing productions: basically I look at these 2 views in DIRAC:

1- Production monitor: https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/jobs/ProductionMonitor/display 2- registered production requests: https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb/production/ProductionRequest/display

To monitor the status of the runs: 3- http://lbrunstatus.cern.ch/ and in particular the runs sent to offline: 4- http://lbrunstatus.cern.ch/LHCb/?destination=OFFLINE

Then, other very useful views in Dirac: 5- the run DB monitor: https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/data/RunDBMonitor/display same view than above of the ongoing runs, but here you can select the partition name 'LHCb' from the left menu (but it seems not to work today..) so you visualize only the interesting runs. 6- RAW integrity DB monitor: https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/data/RAWIntegrityMonitor/display very useful to see the status of the transfers from the pit to Castor. Finally, I also check that the files that have been transfered are registered in the Bookeeping, through the Bk GUI: dirac-bookkeeping-gui

Comments on these tools: I wish that the list of active production could be obtained with a command line. Is it possible? In the shifter guide (https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionShifterGuide#Grid_Shifter_Guide) it says the command dirac-production-list-active should do the work, but I think there is some problem with it, as it returns 1903 active productions.

- Do you feel that an alarming system is important? Do you think that in the LHCb operations there are alarms enough? I think that an alarming system is very useful, but only if it is very well tuned. The problem is that often they are not well tuned and they send too many mails. Personally, I prefer to actively check some monitoring system which displays the status of the applications it monitors.

- Do you check how services are running based on their criticality? Do you know which the critical services for LHCb operations are? as far as I know they are the services displayed in Dashboard selecting 'Service types'= 'WLCG availability (FCR critical tests)' which gives this: http://dashb-lhcb-sam.cern.ch/dashboard/request.py/historicalserviceavailability?mode=serviceavl&siteSelect3=500&sites=LCG.CERN.ch&algoId=5&serviceTypeSelect3=5&services=CE&services=SRMv2&timeRange=last24

basically it's CE and srmv2. What do you think if, for critical services, we outsource to (and exploit) a 24X7 piquet service the monitor of basic metrics? I think it is a good idea.

Do you usually check fabric level information (Ganglia/Lemon/SLS) of our voboxes or of our core services at CERN (CASTOR/WMS/LSF/AFS/Network/Oracle)? If not when do you check these information? I check this information only when some problem arises, but not on a regular basis.

- How do you monitor transfers? Do you feel confident with the information available to identify and debug a xfer problem? when I am on shift I only monitor transfers from the pit to Castor@CERN, and I use the links I reported above. I do not monitor other transfers on a regular basis. But for sure it is a good idea to do it, which view should I look at?

- Do you follow already well defined procedures? If yes where do you look at them? Which procedures you feel useless/less important? Which ones are crucial but you are afraid are still incomplete? Which ones are on the other hands clear and well exercised in your opinion? I think it would be very useful to define a procedure to write down the report at the end of the shift.

- Do you feel that is missing a correlation between activities and status of the resources? Do you think there is already a view that collects all relevant information in one sight? For example do you check this page https://dashb-lhcb-ssb/dashboard/request.py/siteview?view=Job%20Activities? I check this page: http://dashb-siteview.cern.ch/ and this page: http://dashb-lhcb-ssb.cern.ch/dashboard/request.py/siteview?view=Job Activities I think that they are very useful, but still incomplete. For example data transfers are not reported. This gives you - in one view and for all our sites - the quality of all the activities (and their intensity) the status of sites in our mask and in the GOCDB. How would you improve it? How do you decide that a site must be banned (either a T1 or a T2)? this is a good point! I think a clear procedure should be defined to ban/unban sites. I know that other experiments (ATLAS for example) are making an effort to elaborate a strict policy to ban/unban sites, maybe it could be useful to do something similar for LHCb.

How do you otherwise realize that a site is banned and has (has not) to be reintegrated? How do you realize that a problem has been addressed? Solved? during my last shift there was a problem with SARA SE, but it was not me (the shifter) who took the decision to ban it, it was rather a common decision taken in the computing control room. - Out of the list of links below: which ones you were aware of? Which ones do you think important to be added (and used more systematically) in our portal; i.e. which information can be exploited out of there? What external information do you think worth to be better integrated in your daily consumed pages (ex. GGUS, SAMDB, GOCDB, Dashboards, SLS,Lemon,Nagios)?

- Any other question/remark?

Here the list not exhaustive of information that I am checking regularly (beyond DIRAC portal)

http://sls.cern.ch/sls/service.php?id=ServicesForLHCb no, but I see that many of the services listed in this page have also SAM tests defined, so I guess that if some of these services fail also the corresponding SAM test should fail.

https://dashb-lhcb-ssb/dashboard/request.py/siteview?view=Job%20Activities yes, but I think this is the correct link: http://dashb-lhcb-ssb.cern.ch/dashboard/request.py/siteview?view=Job Activities as said before, very useful but not complete. https://dashb-lhcb-ssb/dashboard/request.py/siteview?view=spacetokens I can't open this page, sorry

http://dashb-lhcb-sam.cern.ch/dashboard/request.py/latestresultssmry yes, but more often I check the historical view: http://dashb-lhcb-sam.cern.ch/dashboard/request.py/historicalsmryview http://lbcomet.cern.ch/static/RunStatus/lhcb.display.htm?type=status&system= LHCb yes, this is on one of the big screens

http://lblogbook.cern.ch/Operations/ I receive a notification of any new entry. Of course very useful.

http://goc.grid.sinica.edu.tw/gstat no. http://lbrunstatus.cern.ch/LHCb?destination=OFFLINE yes http://santinel.home.cern.ch/santinel/cgi-bin/srm_test yes

http://santinel.home.cern.ch/santinel/cgi-bin/space_tokens I check it only when I receive an alarm

http://santinel.home.cern.ch/santinel/cgi-bin/lcg-voview no https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports yes

https://lcg-sam.cern.ch:8443/sam/sam.py yes

http://lemonweb.cern.ch/lemon-web/ not often, only if I need information about a particular system

http://sls.cern.ch/sls/service.php?id=LHCb-Storage yes, sometimes https://lemonweb.cern.ch/sls/service.php?id=PhysicsStreams it gives error: Couldn't find service PhysicsStreams. https://lemonweb.cern.ch/sls/history.php?id=GeneralNetwork&more=availability ERROR: Couldn't find service GeneralNetwork. https://j2eeps.cern.ch/service-lsfweb/login no

http://gridview001/GVPC/Excel/ address not found

http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php no

https://goc.gridops.org/ not often, only if I am looking for some specific information. http://sls.cern.ch/lrf-castor/index.php no

http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/ no

http://wmsmon.cern.ch/monitoring/monitoring.html no, unless there is a problem with a wms

https://cert-wms-01.cnaf.infn.it:8443/wmsmon/main/main.php no

http://cic.in2p3.fr/index.php?id=home&js_status=2 no, unless I am looking for some specific information

https://lhcbweb.bo.infn.it/twiki/bin/view.cgi/LHCbBologna/MonitorCNAFtest#Fr azione_Pilot_job_distribuiti_n no

http://servicemap.cern.ch/ccrc08/servicemap.html?vo=LHCb sometimes.

Thank you for you effort to collect all this information, I hope this feedback will help. cheers Elisa

John Kelly - RAL

Hi Roberto, Your mail to Raja has come to me via Gareth. You are certainly correct when you say 'response to the proliferation of (too many) monitoring tools (in place since years now) that have to be better organized or entirely rethought.' There a lots of monitoring tools for LCG many of them of varying usefulness.

To answer your questions:

1.what are the first pages that you look at as first thing in your daily activity in order to understand what's happening?

Internal monitoring tools from RAL. in particular https://lcgwww.gridpp.rl.ac.uk/status/ Our internal ticketing system Our ganglia plots http://ganglia.gridpp.rl.ac.uk/ganglia/ And out mimic system (not publicly available)

2.what are the most relevant information that you are looking at out of there? Could you give us your perception on their importance?

Anomalies affecting the RAL tier1. Some things are easy to spot, eg a failed disk server, other things are more difficult to spot eg failures of FTS transfers for a specific VO from a particular service class.

3. do you think that the information you have already available are enough to understand whether some problem is happening? Is it enough to understand what the problem is about? Is it enough to debug a problem?

It is difficult to understand and debug problems on the LCG. It usually takes specialist knowledge. Monitoring tools only give an indication of what is happening.

4. what is missing in your opinion? What are the information that you see in other VOs specific monitoring tools (that you check routinely) and you would like to see implemented in LHCb too?

It is interesting to compare different experiments' dashboards. We have most experience with the atlas dashboard at http://arda-dashboard.cern.ch/atlas/ the equivalent LHCb dashboard is: http://dashboard.cern.ch/lhcb/index.html

5. have you any special remarks on the content of the information? On the way are organized?

No

6. Do you personally think worth to create a single entry point for LHCb that would allow you to drill down into activities/problems? Up to which level of details you would expect to browse information?

A single entry point is a good idea. and the ability to drill down to 'per job' or 'per transfer' detail is a good idea.

7. Could you indicate me which link (out of the ones below) you check/have checked in the past. Which ones are checked regularly? How would you improve these ones? Why you do not use the rest of them? (just because not advertized anywhere)? Do you have any other link, bit of information that do not appear in my bookmark (may it well be I have simply omitted ;-))?

I have attached your list with comments on each entry.

regards,

John Kelly RAL tier1

https://dashb-lhcb-ssb/dashboard/request.py/siteview?view=Job% 20Activities Broken link 'dashb-lhcb-ssb could not be found'

http://lhcbweb.pic.es/DIRAC/info/general/diracOverview Not particularly relevant to RAL tier1

http://goc.grid.sinica.edu.tw/gstat Useful and used by tier1 at RAL

http://lbrunstatus.cern.ch/LHCb?destination=OFFLINE Not particularly revelant to RAL tier1

http://santinel.home.cern.ch/santinel/cgi-bin/srm_test Already used by tier1

http://santinel.home.cern.ch/santinel/cgi-bin/space_tokens Not particularly useful to RAL tier1.

http://santinel.home.cern.ch/santinel/cgi-bin/lcg-voview Not particularly useful to RAL tier1.

https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyRepor ts Useful to RAL tier1

http://dashb-lhcb-sam.cern.ch/dashboard/request.py/latestresultssmry Useful to RAL tier1

https://lcg-sam.cern.ch:8443/sam/sam.py Already used by tier1

http://lemonweb.cern.ch/lemon-web/ Need a CERN account - so no use to me.

http://sls.cern.ch/sls/service.php?id=LHCb-Storage Need a CERN account - so no use to me.

https://lemonweb.cern.ch/sls/service.php?id=PhysicsStreams Need a CERN account - so no use to me.

http://sls.cern.ch/sls/service.php?id=ServicesForLHCb Need a CERN account - so no use to me.

https://lemonweb.cern.ch/sls/history.php?id=GeneralNetwork <https://lemonweb.cern.ch/sls/history.php?id=GeneralNetwork&more=availabilit y> &more=availability Need a CERN account - so no use to me.

https://j2eeps.cern.ch/service-lsfweb/login Need a CERN account - so no use to me.

http://gridview001/GVPC/Excel/ BROKEN link 'Unable to determine IP address from host name for www.gridview001.com'

http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php Already used by RAL tier1 - though not this particular view

https://goc.gridops.org/ Already used by RAL tier1

http://sls.cern.ch/lrf-castor/index.php Need a CERN account - so no use to me.

http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/ Useful to RAL tier1.

http://wmsmon.cern.ch/monitoring/monitoring.html CERN wms monitoring - limited use to RAL tier1

https://cert-wms-01.cnaf.infn.it:8443/wmsmon/main/main.php ITFN wms monitoring - limited use to RAL tier1

http://cic.in2p3.fr/index.php?id=home http://cic.in2p3.fr/index.php?id=home&js_status=2&gt; &js_status=2 Already used by RAL tier1

https://lhcbweb.bo.infn.it/twiki/bin/view.cgi/LHCbBologna/MonitorCNAFtest#Fr azione_Pilot_job_distribuiti_n Not particularly relevant to RAL tier1

http://servicemap.cern.ch/ccrc08/servicemap.html?vo=LHCb Different view used by the RAL tier1

-- Main.RobertoSantinel - 19-Jan-2010

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r1 - 2010-01-19 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback