Feedbacks gathered from contact persons and/or sysadmin
Paolo Franchini - CNAF
>
1.what are the first pages that you look at as first thing in your daily activity in order to understand whatʼs happening?
The very first task is reading the mail of the past night. Next, if there is not any major problem, I check all the sam test and your daily report.
Basically I make sure that all the problems are under the control of the experts if I cannot take part directly in the solving process.
>
2.what are the most relevant information that you are looking at out of there? Could you give us your perception on their importance?
Also for me the rate of jobs running/failed one of the most relevant hints, followed (at least in this period) by the sam tests.
>
3. do you think that the information you have already available are enough to understand whether some problem is happening? Is it enough to
>
understand what the problem is about? Is it enough to debug a problem?
Something for monitoring the data transfer would be useful, in order to know from another point of view the amount of activity that is running through 'my'
tier1.
>
4. what is missing in your opinion? What are the information that you see in other VOs specific monitoring tools (that you check routinely)
>
and you would like to see implemented in LHCb too?
>
>
5. have you any special remarks on the content of the information? On the way are organized?
>
>
6. Do you personally think worth to create a single entry point for LHCb that would allow you to drill down into activities/problems? Up to
>
which level of details you would expect to browse information?
yes for sure, there is a lot of information spreads in many sites and every week I discover something new that should be checked, or I ask to experts something that I should have found in some monitor already present around the web.
>
7. Could you indicate me which link (out of the ones below) you
>
check/have checked in the past. Which ones are checked regularly? How
>
would you improve these ones? Why you do not use the rest of them? (just because not advertized anywhere)? Do you have any other link, bit of information that do not appear in my bookmark (may it well be I have simply omitted ;-))?
>
https://dashb-lhcb-ssb/dashboard/request.py/siteview?view=Job%20Activi
>
ties
does not work even for me.
>
http://lhcbweb.pic.es/DIRAC/info/general/diracOverview
used daily.
>
http://goc.grid.sinica.edu.tw/gstat
never used.
>
http://santinel.home.cern.ch/santinel/cgi-bin/space_tokens
never used that page, but I'm used to check the same infos.
>
http://santinel.home.cern.ch/santinel/cgi-bin/lcg-voview
very useful.
>
https://lemonweb.cern.ch/sls/service.php?id=PhysicsStreams
doesn't work
>
https://lemonweb.cern.ch/sls/history.php?id=GeneralNetwork&more=availability
doesn't work
>
http://gridview001/GVPC/Excel/
doesn't work
>
https://goc.gridops.org/
used to check the downtimes.
>
http://sls.cern.ch/lrf-castor/index.php
I've not the account.
Raja Nandakumar - RAL
1.what are the first pages that you look at as first thing in your daily activity in order to understand what’s happening?
LHCb mask :
http://lhcb-project-dirac.web.cern.ch/lhcb-project-dirac/lhcbProdnMask.html
Job monitoring :
https://lhcbweb.pic.es/DIRAC/LHCb-Production/lhcb_prod/jobs/JobMonitor/display
SAM tests :
http://dashb-lhcb-sam.cern.ch/dashboard/request.py/latestresultssmry
SLS tests :
https://sls.cern.ch/sls/service.php?id=T1VOBOX
WMS history : Dirac portal accounting
2.what are the most relevant information that you are looking at out of there? Could you give us your perception on their importance?
Whether RAL is within the LHCb mask. Number of running jobs. Rates of failures
3. do you think that the information you have already available are enough to understand whether some problem is happening? Is it enough to understand what the problem is about? Is it enough to debug a problem?
More-or-less yes. At least I know mostly where to look for issues.
4. what is missing in your opinion? What are the information that you see in other VOs specific monitoring tools (that you check routinely) and you would like to see implemented in LHCb too?
The transfer matrix
5. have you any special remarks on the content of the information? On the way are organized?
No
6. Do you personally think worth to create a single entry point for LHCb that would allow you to drill down into activities/problems? Up to which level of details you would expect to browse information?
Not really. In case of problems, I need to be able to look at the actual log files if possible.
7. Could you indicate me which link (out of the ones below) you check/have checked in the past. Which ones are checked regularly? How would you improve these ones? Why you do not use the rest of them? (just because not advertized anywhere)? Do you have any other link, bit of information that do not appear in my bookmark (may it well be I have simply omitted ;-))?
The main list of sites I use is given above. I do use all the links below at various times. One useful link I use a lot is the LHCb mask given above (
http://lhcb-project-dirac.web.cern.ch/lhcb-project-dirac/lhcbProdnMask.html
)
https://dashb-lhcb-ssb/dashboard/request.py/siteview?view=Job%20Activities
http://lhcbweb.pic.es/DIRAC/info/general/diracOverview
http://goc.grid.sinica.edu.tw/gstat
http://lbrunstatus.cern.ch/LHCb?destination=OFFLINE
http://santinel.home.cern.ch/santinel/cgi-bin/srm_test
http://santinel.home.cern.ch/santinel/cgi-bin/space_tokens
http://santinel.home.cern.ch/santinel/cgi-bin/lcg-voview
https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports
http://dashb-lhcb-sam.cern.ch/dashboard/request.py/latestresultssmry
https://lcg-sam.cern.ch:8443/sam/sam.py
http://lemonweb.cern.ch/lemon-web/
http://sls.cern.ch/sls/service.php?id=LHCb-Storage
https://lemonweb.cern.ch/sls/service.php?id=PhysicsStreams
http://sls.cern.ch/sls/service.php?id=ServicesForLHCb
https://lemonweb.cern.ch/sls/history.php?id=GeneralNetwork&more=availability
https://j2eeps.cern.ch/service-lsfweb/login
http://gridview001/GVPC/Excel/
http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php
https://goc.gridops.org/
http://sls.cern.ch/lrf-castor/index.php
http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/
http://wmsmon.cern.ch/monitoring/monitoring.html
https://cert-wms-01.cnaf.infn.it:8443/wmsmon/main/main.php
http://cic.in2p3.fr/index.php?id=home&js_status=2
https://lhcbweb.bo.infn.it/twiki/bin/view.cgi/LHCbBologna/MonitorCNAFtest#Frazione_Pilot_job_distribuiti_n
http://servicemap.cern.ch/ccrc08/servicemap.html?vo=LHCb
Alexey Zhelezov - GRIDKA
My answers are inline. Note, that they are a bit biased (as usual

I have no possibility to discuss that on GridKa Operation Meeting, so the following is my personal answers. I will try to get the answers from admins next year (if still relevant) (n short, I do not think they check/care about LHCb specific info other than GGUS and SAM).
>
1.what are the first pages that you look at as first thing in your
>
daily activity in order to understand what’s happening?
GGUS - to get an idea what are already known problems with GridKa My 24h page - to check that GridKa is not failing more than other
for the same type of jobs. If I see some failed SAMs here:
SAM results - to know what GridKa operators could see by themselves
>
2.what are the most relevant information that you are looking at out
>
of there? Could you give us your perception on their importance?
As contact person, I just check that GridKa team accept the problem as there own in case LHCb think so and process it in time.
>
3. do you think that the information you have already available are
enough to understand whether some problem is happening? Is it enough to understand what the problem is about? Is it enough to debug a problem?
For me, relative number of failed jobs is enough to understand site specific problems with jobs.
>
>
4. what is missing in your opinion? What are the information that you
>
see in other VOs specific monitoring tools (that you check routinely)
>
and you would like to see implemented in LHCb too?
Other services problem (FTS, LFC, WMS, etc.) are normally checked by dedicated LHCb experts. I do not know the page where I can see all site related services in one place, so it is nice to have. CMS "happy face"
looks like reasonable simple to check it 5 seconds.
>
5. have you any special remarks on the content of the information? On
>
the way are organized?
For d-cache and type situation I have to browse GridKa web site. This information is not published anywhere else, while it helps to understand near all SE problems.
>
6. Do you personally think worth to create a single entry point for
>
LHCb that would allow you to drill down into activities/problems? Up
>
to which level of details you would expect to browse information?
For me, it will be nice to have one page with:
1) DIRAC job statistic for the site (also normalized by job type)
2) All current services failures if any
3) All current SAM failures if any
4) Planned down times for today
5) Not closed GGUS tickets
6) current situation with SE (in GridKa case, the number of used
d-cache movers per LHCb pool and Tape transfer rates (LHCb and Total).
>
7. Could you indicate me which link (out of the ones below) you
>
check/have checked in the past. Which ones are checked regularly? How
>
would you improve these ones? Why you do not use the rest of them?
>
(just because not advertized anywhere)? Do you have any other link,
>
bit of information that do not appear in my bookmark (may it well be I
>
have simply omitted ;-))?
I use various links from your page once I have some specific question.
The only place I visit regularly is DIRAC Web portal (since I am also DIRAC developer
Jeff Templon NL-T1
Yo Roberto,
I am answering personally, I put the rest of the team in CC because they might want to see your bookmarks or add to what I said.
On 11 Dec 2009, at 15:42, Roberto Santinelli wrote:
>
Here my (non exhaustive) list of questions:
>
1.what are the first pages that you look at as first thing in your
>
daily activity in order to understand what’s happening?
http://www.nikhef.nl/grid/stats/ndpf-prd/voview-short
http://spade.nikhef.nl/nagios/
http://dashb-alice-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=100&sites=NIKHEF-ELPROD&sites=SARA-MATRIX&algoId=6&timeRange=last48
http://dashb-atlas-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=400&sites=NIKHEF-ELPROD&sites=SARA-MATRIX&algoId=21&timeRange=last48
http://dashb-lhcb-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=501&sites=LCG.CERN.ch&sites=LCG.CNAF.it&sites=LCG.GRIDKA.de&sites=LCG.IN2P3.fr&sites=LCG.NIKHEF.nl&sites=LCG.PIC.es&sites=LCG.RAL.uk&algoId=82&timeRange=last48
the first one, to see how "busy" the site is (where are the LHCb jobs
BTW?)
the second one, because there are many tests of the system listed there, and any problems with our own probes, or with SAM tests, can be seen here
also the three dashboard plots, to see what the experiments think of how we are doing.
There are other ones I look at if I see something hinky on one of these.
>
2.what are the most relevant information that you are looking at out
>
of there? Could you give us your perception on their importance?
I think I already explained it above?
>
3. do you think that the information you have already available are
>
enough to understand whether some problem is happening? Is it enough
>
to understand what the problem is about? Is it enough to debug a
>
problem?
In most cases it is ok. One notable exception is the WMS, there it is very hard to know if there is a problem or not.
>
4. what is missing in your opinion? What are the information that you
>
see in other VOs specific monitoring tools (that you check
>
routinely) and you would like to see implemented in LHCb too?
LHCb is usually ok in this respect, your SAM tests are pretty representative of the real use of the grid.
>
5. have you any special remarks on the content of the information?
>
On the way are organized?
I haven't looked in a while, but i remember it was in the past difficult to figure out the data access test results ... it was not clear which of the "failures" were normal.
>
6. Do you personally think worth to create a single entry point for
>
LHCb that would allow you to drill down into activities/problems? Up
>
to which level of details you would expect to browse information?
It is OK now, except that from the dashboard, drilling down sometimes can take a VERY long time. This should be fixed, but I think this is a dashboard problem not an lhcb problem.
>
http://goc.grid.sinica.edu.tw/gstat
this one we check sometimes, i think there are also some nagios probes that trigger on info from here but i am not sure.
>
http://dashb-lhcb-sam.cern.ch/dashboard/request.py/latestresultssmry
see above, I do look at the LHCb dashboard but not exactly this link.
>
https://lcg-sam.cern.ch:8443/sam/sam.py
yes.
>
http://gridview001/GVPC/Excel/
is this a CERN link?
>
https://goc.gridops.org/
whenever needed we look here, but it is for me not often.
>
http://wmsmon.cern.ch/monitoring/monitoring.html
we have our own version of this running here.
>
http://cic.in2p3.fr/index.php?id=home&js_status=2
the CIC portal? Sometimes when necessary but it is not often.
ps : I cut many of the links, I will fwd the entire original mail to the rest of the admin crew, and to SARA.
Luisa Arrabito - IN2p3
Hi Roberto,
Here my answers.
Roberto Santinelli wrote:
>
>
Dear LHCb contact persons at T1’s,
>
>
this survey is aimed to collect as much information and requirements
>
as possible in order to provide you, your site admins and our
>
production crew with a coherent and consistent interface for both
>
activities and services/resources monitoring.
>
>
This initiative must be considered as the response to the
>
proliferation of (too many) monitoring tools (in place since years
>
now) that have to be better organized or entirely rethought.
>
>
We would really appreciate your (and your site) administrators opinion
>
about this subject by just answering the following questions and/or
>
providing us with any suggestion that you feel important. At the end
>
of this mail I have dumped my personal bookmark collecting all sources
>
of information that I personally use to gather (or to communicate)
>
information about LHCb.
>
>
Here my (non exhaustive) list of questions:
>
>
1.what are the first pages that you look at as first thing in your
>
daily activity in order to understand what’s happening?
>
First of all I read my mails, since most of the times I'm contacted personally by production crew in case of problems with my site. This is a good practice, the weak point is that if I'm absent nobody will answer. For important and urgent issues lhcb submits GGUS tickets, which is a good practice since there will be always somebody who will take care of them. The practice to announce the submission of GGUS tickets to lhcb-grid is also good, so it's not necessary to monitor GGUS page.
Then I look to the Dirac Monitoring Page and to the SAM dashboard
http://dashb-lhcb-sam.cern.ch/dashboard/dashboard/request.py/latestresultssmry
>
2.what are the most relevant information that you are looking at out
>
of there? Could you give us your perception on their importance?
>
From Dirac Monitoring page :
- number of running jobs, compared with the expected lhcb activity. The latter is normally announced in Operations ELOG, but I look to the production monitor in the same Dirac portal to cross check.
- rate of failed jobs and related error messages. From these messages I try to distinguish if the failure is site related or not. This point could be improved with more explicit messages.
- results of dirac sam jobs and related messages From the SAM dashboard :
- results of sam tests and related outputs if there are failures.
Both are very important informations.
I also check some pages internal to my site, which give a complementary view of the general status of my site (the most important ones being our dCache portal and the Data Transfer Portal).
>
3. do you think that the information you have already available are
>
enough to understand whether some problem is happening?
>
I think that the information is there, but it's too much spread. Besides this, it is important to cross check different informations, for example SAM tests can be ok, but many users jobs fail.
>
Is it enough to understand what the problem is about?
>
Yes, but with a work of investigation that can be quite long and that can need the intervention of experts.
>
>
Is it enough to debug a problem?
>
For debugging, monitoring tools are not enough, bur they are a good starting point. Then, it's necessary to write some specific scripts. For example, quite often I use a mixture of dirac commands and our local batch system commands.
>
4. what is missing in your opinion? What are the information that you
>
see in other VOs specific monitoring tools (that you check routinely)
>
and you would like to see implemented in LHCb too?
>
I don't know so much about other VO monitoring tools.
What is missing is a more centralized information. The example given by Elisa, about the job type is a good one. If I look at the Dirac Portal, I can always understand if a job is a reconstruction job or of other type, but this is done by gathering informations from different entry points (for example by looking at the production monitor in the hope that the name of the production is explicit enough, or looking if the Brunel Step is present in the logging info of the job).
I think that some more columns could be added to the Job Monitoring Page of Dirac, like the Job Type and the CE the job has been submitted to.
Then the possibility to sort jobs by using any of the fields associated to a job would be nice.
Finally, it could be very useful to have the possibility to build dynamic plots using the different fields (for example a scatter plot 'failed status vs Owner' or whatever) in order to find correlation and help the debugging.
>
5. have you any special remarks on the content of the information? On
>
the way are organized?
>
The content of the information is quite good, but I would like to have it more centralized.
>
>
6. Do you personally think worth to create a single entry point for
>
LHCb that would allow you to drill down into activities/problems? Up
>
to which level of details you would expect to browse information?
>
Yes I'd like to have a single entry point and to be able to get the same level of information that I can get now, by navigating in different pages.
>
7. Could you indicate me which link (out of the ones below) you
>
check/have checked in the past. Which ones are checked regularly? How
>
would you improve these ones? Why you do not use the rest of them?
>
(just because not advertized anywhere)? Do you have any other link,
>
bit of information that do not appear in my bookmark (may it well be I
>
have simply omitted ;-))?
>
>
Please take your time to get these questions answered. It is important
>
for us your opinion. It would be very good if you could forward it to
>
your site admins too.
>
>
With our best regards,
>
>
R.
>
>
https://dashb-lhcb-ssb/dashboard/request.py/siteview?view=Job%20Activi
>
ties
>
doesn't work
>
>
http://lhcbweb.pic.es/DIRAC/info/general/diracOverview
>
yes daily
>
>
http://goc.grid.sinica.edu.tw/gstat
>
no. It seems to me redondant with sam dashboard, correct me if I'm wrong.
>
>
http://lbrunstatus.cern.ch/LHCb?destination=OFFLINE
>
not aware of
>
>
http://santinel.home.cern.ch/santinel/cgi-bin/srm_test
>
sometimes.
>
>
http://santinel.home.cern.ch/santinel/cgi-bin/space_tokens
>
in case of problems.
>
>
http://santinel.home.cern.ch/santinel/cgi-bin/lcg-voview
>
sometimes
>
>
https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdail
>
yReports
>
yes
>
>
http://dashb-lhcb-sam.cern.ch/dashboard/request.py/latestresultssmry
>
yes often.
>
>
https://lcg-sam.cern.ch:8443/sam/sam.py
>
not so often, redondant with
http://dashb-lhcb-sam.cern.ch/dashboard/request.py/latestresultssmry
,
wich it looks to me more complete (historical view etc.)
>
>
http://lemonweb.cern.ch/lemon-web/
>
not aware of
>
>
http://sls.cern.ch/sls/service.php?id=LHCb-Storage
>
yes in case of problems
>
>
https://lemonweb.cern.ch/sls/service.php?id=PhysicsStreams
>
doesn't work
>
>
http://sls.cern.ch/sls/service.php?id=ServicesForLHCb
>
yes in case of problems
>
>
https://lemonweb.cern.ch/sls/history.php?id=GeneralNetwork&more=availa
>
bility
>
<https://lemonweb.cern.ch/sls/history.php?id=GeneralNetwork&more=avail
>
ability>
>
doesn't work
>
>
https://j2eeps.cern.ch/service-lsfweb/login
>
not aware of
>
>
http://gridview001/GVPC/Excel/
>
doesn't work
>
>
http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php
>
no
>
>
https://goc.gridops.org/
>
when I look for a specific information
>
>
http://sls.cern.ch/lrf-castor/index.php
>
no
>
>
http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/
>
no
>
>
http://wmsmon.cern.ch/monitoring/monitoring.html
>
in case I look for some specific information
>
>
https://cert-wms-01.cnaf.infn.it:8443/wmsmon/main/main.php
>
not, but I'll bookmark it
>
>
http://cic.in2p3.fr/index.php?id=home&js_status=2
>
<http://cic.in2p3.fr/index.php?id=home&js_status=2>
>
when I look for a specific information
>
>
https://lhcbweb.bo.infn.it/twiki/bin/view.cgi/LHCbBologna/MonitorCNAFt
>
est#Frazione_Pilot_job_distribuiti_n
>
no
>
>
http://servicemap.cern.ch/ccrc08/servicemap.html?vo=LHCb
>
sometimes
Thanks a lot for this useful work.
Cheers,
Luisa
Elisa Lanciotti - PIC
1.what are the first pages that you look at as first thing in your daily activity in order to understand what’s happening?
the first page I visit to check the status of my site is Siteview:
http://dashb-siteview.cern.ch/#site=pic
In this page I can quickly see if there are problems with LHCb, and also other experiments.
If there are problems (some red status), the gridmap provides link to the source of the information, so you can investigate further.
Good point of Siteview: all the information available in one view.
Negative point: it is not complete. Why? Because the 'general site status', represented in the upper rectangle, is not really complete, as it only takes into account SAM tests defined as critical for LHCb (
http://dashb-lhcb-sam.cern.ch/dashboard/request.py/historicalserviceavailability?mode=serviceavl&algoId=82&sites=LCG.PIC.es&services=all&timeRange=last24
)
this is necessary for LHCb to be happy with the site, but it is not enough! and sometimes it happens that the site status is declared 'green' in Siteview, but then the site is banned from the production mask..
What you propose seems to me like a 'virtual control room', which would be extremely useful for different users: site admins, shifters, and all people involved in the computing operations of LHCb. It would be the official and complete list of tests that a site has to pass in order to have LHCb jobs running happily. I like the idea. Then, this URL could be linked from Siteview, instead of SAM tests.
2.what are the most relevant information that you are looking at out of there? Could you give us your perception on their importance?
number of running jobs, and rate of failures. And the site status (even if, as said before, it is not a complete definition of the site status..).
3. do you think that the information you have already available are enough to understand whether some problem is happening? Is it enough to understand what the problem is about? Is it enough to debug a problem?
the information is not complete:
-the site status (as said before) has to be complemented with more bits of information
-the data transfers are missing in the Siteview gridmap.
-the job type also is missing. It just says the number of jobs, but I would like to know if they are reconstruction, user analysis, MC prod..
4. what is missing in your opinion? What are the information that you see in other VOs specific monitoring tools (that you check routinely) and you would like to see implemented in LHCb too?
what said above.
5. have you any special remarks on the content of the information? On the way are organized?
6. Do you personally think worth to create a single entry point for LHCb that would allow you to drill down into activities/problems? Up to which level of details you would expect to browse information?
yes, as I said before, I think it would make me save time. I would like to have only one URL to check, which collects all the information about LHCb and sets the status of my site. If it's green, fine. If it's not green, then starting from there it should provide me the way to drill down into the problem in the most efficient way.
https://dashb-lhcb-ssb/dashboard/request.py/siteview?view=Job%20Activities
sorry, can't open it..
http://lhcbweb.pic.es/DIRAC/info/general/diracOverview
of course I open Dirac pages very often. Especially job monitor.
http://goc.grid.sinica.edu.tw/gstat
not really..
http://lbrunstatus.cern.ch/LHCb?destination=OFFLINE
yes.
http://santinel.home.cern.ch/santinel/cgi-bin/srm_test
yes.
http://santinel.home.cern.ch/santinel/cgi-bin/space_tokens
yes, and I think it is of crucial importance to keep it updated.
http://santinel.home.cern.ch/santinel/cgi-bin/lcg-voview
no, at least not so far. What does a rank = -1 mean?
https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports
yes, daily.
http://dashb-lhcb-sam.cern.ch/dashboard/request.py/latestresultssmry
not so far, but I will include it in my bookmarks.
https://lcg-sam.cern.ch:8443/sam/sam.py
when I get an alarm from Nagios, I always cross check here.
http://lemonweb.cern.ch/lemon-web/
no. I look to a similar tool we have here to monitor our systems at PIC.
http://sls.cern.ch/sls/service.php?id=LHCb-Storage
only if I get an alarm message about some token running out of space.
https://lemonweb.cern.ch/sls/service.php?id=PhysicsStreams
it gives ERROR: Couldn't find service PhysicsStreams
http://sls.cern.ch/sls/service.php?id=ServicesForLHCb
not so far, but I will bookmark this url. thanks.
https://lemonweb.cern.ch/sls/history.php?id=GeneralNetwork&more=availability
says: ERROR: Couldn't find service GeneralNetwork.
https://j2eeps.cern.ch/service-lsfweb/login
no.
http://gridview001/GVPC/Excel/
says address not found.
http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php
no.
https://goc.gridops.org/
not usually, only if I need some particular information.
http://sls.cern.ch/lrf-castor/index.php
no.
http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/
no.
http://wmsmon.cern.ch/monitoring/monitoring.html
not often, only if there is some problem which points to the WMS..
https://cert-wms-01.cnaf.infn.it:8443/wmsmon/main/main.php
not so far, but I think I will bookmark it (one more!)
http://cic.in2p3.fr/index.php?id=home&js_status=2
not often
https://lhcbweb.bo.infn.it/twiki/bin/view.cgi/LHCbBologna/MonitorCNAFtest#Frazione_Pilot_job_distribuiti_n
no.
http://servicemap.cern.ch/ccrc08/servicemap.html?vo=LHCb
not often.
John Kelly - RAL
Hi Roberto,
Your mail to Raja has come to me via Gareth.
You are certainly correct when you say 'response to the proliferation of
(too many) monitoring tools (in place since years now) that have to be
better organized or entirely rethought.'
There a lots of monitoring tools for LCG many of them of varying usefulness.
To answer your questions:
1.what are the first pages that you look at as first thing in your daily
activity in order to understand what's happening?
Internal monitoring tools from RAL.
in particular
https://lcgwww.gridpp.rl.ac.uk/status/
Our internal ticketing system
Our ganglia plots
http://ganglia.gridpp.rl.ac.uk/ganglia/
And out mimic system (not publicly available)
2.what are the most relevant information that you are looking at out of
there? Could you give us your perception on their importance?
Anomalies affecting the RAL tier1. Some things are easy to spot, eg a failed
disk server, other things are more difficult to spot eg failures of FTS
transfers for a specific VO from a particular service class.
3. do you think that the information you have already available are enough
to understand whether some problem is happening? Is it enough to understand
what the problem is about? Is it enough to debug a problem?
It is difficult to understand and debug problems on the LCG. It usually
takes specialist knowledge. Monitoring tools only give an indication of what
is happening.
4. what is missing in your opinion? What are the information that you see in
other VOs specific monitoring tools (that you check routinely) and you would
like to see implemented in LHCb too?
It is interesting to compare different experiments' dashboards. We have most
experience with the atlas dashboard at
http://arda-dashboard.cern.ch/atlas/
the equivalent LHCb dashboard is:
http://dashboard.cern.ch/lhcb/index.html
5. have you any special remarks on the content of the information? On the
way are organized?
No
6. Do you personally think worth to create a single entry point for LHCb
that would allow you to drill down into activities/problems? Up to which
level of details you would expect to browse information?
A single entry point is a good idea. and the ability to drill down to 'per
job' or 'per transfer' detail is a good idea.
7. Could you indicate me which link (out of the ones below) you check/have
checked in the past. Which ones are checked regularly? How would you
improve these ones? Why you do not use the rest of them? (just because not
advertized anywhere)? Do you have any other link, bit of information that do
not appear in my bookmark (may it well be I have simply omitted ;-))?
I have attached your list with comments on each entry.
regards,
John Kelly
RAL tier1
https://dashb-lhcb-ssb/dashboard/request.py/siteview?view=Job%
20Activities
Broken link 'dashb-lhcb-ssb could not be found'
http://lhcbweb.pic.es/DIRAC/info/general/diracOverview
Not particularly relevant to RAL tier1
http://goc.grid.sinica.edu.tw/gstat
Useful and used by tier1 at RAL
http://lbrunstatus.cern.ch/LHCb?destination=OFFLINE
Not particularly revelant to RAL tier1
http://santinel.home.cern.ch/santinel/cgi-bin/srm_test
Already used by tier1
http://santinel.home.cern.ch/santinel/cgi-bin/space_tokens
Not particularly useful to RAL tier1.
http://santinel.home.cern.ch/santinel/cgi-bin/lcg-voview
Not particularly useful to RAL tier1.
https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyRepor
ts
Useful to RAL tier1
http://dashb-lhcb-sam.cern.ch/dashboard/request.py/latestresultssmry
Useful to RAL tier1
https://lcg-sam.cern.ch:8443/sam/sam.py
Already used by tier1
http://lemonweb.cern.ch/lemon-web/
Need a CERN account - so no use to me.
http://sls.cern.ch/sls/service.php?id=LHCb-Storage
Need a CERN account - so no use to me.
https://lemonweb.cern.ch/sls/service.php?id=PhysicsStreams
Need a CERN account - so no use to me.
http://sls.cern.ch/sls/service.php?id=ServicesForLHCb
Need a CERN account - so no use to me.
https://lemonweb.cern.ch/sls/history.php?id=GeneralNetwork
<https://lemonweb.cern.ch/sls/history.php?id=GeneralNetwork&more=availabilit
y> &more=availability Need a CERN account - so no use to me.
https://j2eeps.cern.ch/service-lsfweb/login
Need a CERN account - so no use to me.
http://gridview001/GVPC/Excel/
BROKEN link 'Unable to determine IP address from host name for
www.gridview001.com'
http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php
Already used by RAL tier1 - though not this particular view
https://goc.gridops.org/
Already used by RAL tier1
http://sls.cern.ch/lrf-castor/index.php
Need a CERN account - so no use to me.
http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/
Useful to RAL tier1.
http://wmsmon.cern.ch/monitoring/monitoring.html
CERN wms monitoring - limited use to RAL tier1
https://cert-wms-01.cnaf.infn.it:8443/wmsmon/main/main.php
ITFN wms monitoring - limited use to RAL tier1
http://cic.in2p3.fr/index.php?id=home
http://cic.in2p3.fr/index.php?id=home&js_status=2>
; &js_status=2 Already used
by RAL tier1
https://lhcbweb.bo.infn.it/twiki/bin/view.cgi/LHCbBologna/MonitorCNAFtest#Fr
azione_Pilot_job_distribuiti_n
Not particularly relevant to RAL tier1
http://servicemap.cern.ch/ccrc08/servicemap.html?vo=LHCb
Different view used by the RAL tier1
-- Main.RobertoSantinel - 19-Jan-2010