LCG Management Board |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Date/Time |
Tuesday
24 November 2009 16:00-17:00 – Phone Meeting
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Agenda
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Members |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
(Version 1 – 28.11.2009) |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Participants |
A.Aimar
(notes), D.Barberis, J.-Ph.Baud, I.Bird (chair), M.Bouwhuis, D.Britton,
T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, Qin Gang, J.Gordon,
A.Heiss, F.Hernandez, M.Kasemann, M.Lamanna, P.Mato, G.Merino, A.Pace,
M.Schulz, Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Invited |
J.Andreeva,
J.Casey, D.Kelsey, W.Salter |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Action
List |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Mailing
List Archive |
https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/ |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Next Meeting |
Tuesday
1 December 2009 16:00-18:00 – F2F Meeting |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1. Minutes and Matters arising (Minutes)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1.1 Minutes of Previous Meeting
The
minutes of the previous meeting were approved. Some comments received about differences between
GridView report and dashboard. GridView and Dashboard Ph.Charpentier suggested using
the dashboard report also for the monthly report instead of GridView. A.Aimar replied that the MB had decided
to use the GridView report because it includes only tests about Sites
reliability. Instead the dashboards may contain also tests of the VO
frameworks or other software issues which are not the Site’s fault. If the MB
decides to move to the Dashboard it can be done. Ph.Charpentier added that in some
cases the SAM tests are not updated and cause other problems. J.Casey added that if results are
incorrect there is a procedure to correct the data if needed. G.Merino noted that in the
dashboard of the VOs there are many dashboards, (FCR, Ganga, etc for ATLAS)
which is which? J.Andreeva replied that the VO
chooses which of the dashboards to use every time. I.Bird proposed not to change the
reliability report for the moment and stay with the GridView reports. . IN2P3 Email about Reliability Calculations Seems there are problems between
sites down and in maintenance. J.Casey replied that he will look
into the issue and report to the MB. Usage of Fair Algorithm J.Casey added that the fair
availability report will be the same on the report and on the web pages by
the end of the month. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2. Action List Review (List of actions)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
See http://sls.cern.ch/sls/s0065rvice.php?id=WLCG_Tier1_Tape_Metrics Done. All Sites are providing metrics now. In some cases incomplete. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3.
LCG
Operations Weekly Report (Slides)
– J.-Ph.Baud
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings 3.1 SummaryThis summary covers the period from 9th November to 22th November. There were a few incidents leading to SIR reports: - Kernel upgrade at CNAF took too long and the CE and SE were not available for 4 days. - CASTOR SRM problems at CERN - GGUS: impossible to send ALARM tickets last Friday Meeting Attendance
GGUS Summary The number of tickets is increasing.
3.2 SIRs and VO AvailabilitySARA
Space Tokens With only one alarm ticket from LHCb to SARA because of lack of disk space. M.Bouwhuis
added that Sara had to increase the disk token space. D.Barberis
added that the alarm from ATLAS was submitted on the same issue but was not
counted. Ph.Charpentier
added that LHCb had posted an alarm ticket on AFS. But after the 22nd
which is the period covered by this report.
GGUS
Problem With the LHC startup it is now urgent to make sure GGUS infrastructure ensures a 24/7 availability See https://savannah.cern.ch/support/?101122 SAM
Availability (slide 7) In slide 7 one can notice that: -
ASGC for ATLAS Qin Gang added that the
SAM CE tests are failing because the Site is overloaded by ATLAS and one
million jobs. Same happened for CMS in STEP09. Ph.Charpentier noted that
this should not be a problem for a Site if it is properly configured. Sites
should use the large amount of jobs as an explanation for failures. -
CNAF for CMS L.Dell’Agnello
agreed that was not well planned and announced. J.-Ph.Baud
reminded that Sites must announce following the right procedure via GOCDB and
well in advance. SRM
Problems at CERN Sometimes the wrong status is returned by CASTOR and the FTS transfers fail. The problem of Thread exhaustion because it takes more than 30 minutes and there are core dumps. For example on 20th November there were problems exporting from CASTORATLAS to Tier1s. The Post mortem is here: https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20Nov09 Miscellaneous
Problems There was a series of problems in this period: - CMS MC production issues with WMS at CNAF due to Condor bug (fix successfully tested) - SRM instabilities at BNL (problem now understood) - dCache golden release deployed at SARA, IN2P3 and KIT. Some configurations have changed. M.Ernst
added that there is a race condition in the code and the golden release was
affected and the golden release had to be fixed. One Java library was not
thread safe and all Sites in the same situation should upgrade - Dashboard DB migration at CERN. Working well for Atlas. Low performance after migration for CMS (now understood and fixed was relate to the Oracle DBs) - AFS problems at CERN: is the service critical for experiments and what is the coverage? I.Bird
replied that T.Cass will present a proposal later. - Fibre cut to PIC (between Barcelona and Madrid) - Incorrectly labelled cartridges at ASGC - CREAM CEs for Alice at CERN are unstable since more than a week. - NIKHEF: compute capacity increased, network infrastructure upgraded to 160 Gbps - PPS has been replaced by staged rollouts for middleware M.Schulz
added that they need the support from the Experiments to find Sites to try
out the new versions. If Sites and Experiments are interested in trying and
having new versions they should have interest in deploy and test them. This model work very well for FTS for
instance. Kernel
Upgrades J.Gordon
asked whether the kernel upgrade is about the Tier-1 or also Tier 2 Sites. The
Tier-2 in EGEE is followed by the EGEE Security and the EGEE ROC managers. D.Kelsey
added that a review at next GDB would be useful. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4. Pilot Jobs Policy (GDB
Document; Slides)
– I.Bird
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I.Bird summarized the situation on pilot jobs and policies. 4.1 PoliciesTwo years ago the WLCG MB agreed this: “WLCG sites must allow
job submission by the LHC VOs using pilot jobs that submit work on behalf of
other users. It is mandatory to change the job identity to that of the real
user to avoid the security exposure of a job submitted by one user running
under the credentials of the pilot job user.” But there were pre-requisites for the policy to come into force: - Security review of glExec: done - Review of pilot frameworks: done - GlExec tested with all needed batch systems: done - SCAS completed, certified, deployed: not deployed. NB. forgot glExec deployed: not done There is a MUPJ JSPG policy: https://edms.cern.ch/file/855383/2/PilotJobsPolicy-v1.0.pdf Approved in Aug and Sep 2008 by WLCG and EGEE respectively. This states that the VO must obtain Site approval before submitting MUPJ (i.e. optional from the site point of view). It also states that the pilot framework must meet the requirements of fine-grained traceability, e.g. via identity switching. 4.2 Deployment (on SL5)This is taking too long. GlExec gas been available in SL4 for some time, but we need SL5. Several glexec-SL5 patches failed in certification and had to be sent back to developers. But now glExec/SL5 is certified SCAS on SL5 has been a long time coming now it passed certification and started roll-out very recent. J.Gordon
noted that this was never a real show stopper as only the WN must be on SL5. I.Bird
replied that it is true but has been used often as an excuse. While sites have been encouraged to deploy glExec/SCAS they have also been urged to move to SL5 and so lack of deployment of glExec/SCAS is understandable. Ph.Charpentier
noted that Experiments cannot commission the Sites one by one, It is too late
now. Sites are not configured in the same way and need to be verified one by
one. I.Fisk
added that also for CMS is too complex to discover which Site has glExec
installed and there for is switched off in CMS. D.Barberis
confirmed that is the same situation for ATLAS. They tested the current
version but are not using it. A new version will have to be rested but must
be installed everywhere by early 2010. Sites must have it installed in large
majority. M.Schulz
added that Sites should decide whether they want to install it. 4.3 Current SituationSites have not been able to deploy glExec/SCAS until now to support the implementation of the policy. Is not their fault but is it now the time to add such critical services? Experiments need to run MUPJ to support their analysis; but adaptations to glExec/SCAS have not yet been fully implemented and tested. Again is now the time to make such changes in essential software? In summary: - The experiments need to process and analyse data according to the ways that they have been preparing. - The sites need to be able to manage their security 4.4 Proposal from I.BirdIn the absence of any deployable fine-grained authentication mechanism that the experiments can make use of now I.Bird proposed to establish a policy that: - Permits the Experiments to run pilot jobs that run workloads of other members of the VO - Agrees that the owner of the pilot job – or the VO itself – is responsible for all work run by that pilot job. i.e. in case of problems the entire VO may be banned at a site. - The VO framework should provide the fine-grained traceability - This policy must be reviewed in “xx” (a few) months, or earlier if there are operational issues - During this time the existing JSPG policy would need to be suspended (?) WLCG must continue to push the deployment of glExec/SCAS as rapidly as possible to be in a situation to implement the agreed policy at the earliest opportunity. D.Kelsey
commented that the MB maintains the trust between VOS and Sites. Suspending
the agreement can be a problem in the future and should be done carefully. In
principle nobody should run MUPJ Sites have not agreed. D.Barberis
replied that Sites know very well that MUPJ are run by the VOs. I.Bird
added that Sites had agreed to accept MUPJ. D.Kelsey
replied that not all requirements were respected. J.Gordon
added that Sites have not given permission explicitly. I.Bird
asked how Sites are going to be told that they should allow MUPJ for some
months. Ph.Charpentier
added that “role=pilot” is accepted then the Sites accepts the pilot jobs.
And if the policy is not implemented after 3 years should be re-discussed.
VOs can provide the traceability of the jobs. D.Kelsey
noted that if payload and framework run under the same user and can
compromise the framework. Ph.Charpentier
noted that this possible danger has never been shown with an example. And as
of now glExec is useless. I.Fisk
asked that the problem is that the policy is not enforced and Sites and VOs cannot
check each other about glExec configuration and MUPJ execution on the Site. Sites
should not be imposed a solution, Sites should have a way to trace and block
specific individuals. Not necessarily had to be done with identity switching
and with the current implementation. Note:
Some discussions concerning storage security are not minuted, for obvious
reasons. Proposal: The proposal is that 1. Sites accept the proposal for next 3 months and
meanwhile 2. The Tech Forum will review traceability and blocking. 3. Sites must continue to press with the deployment so that
when needed this deployment is done. Ph.Charpentier
noted that only 3 months is too short for the VOs. The Sites should have some
testing on their own in the 3 months. VOs always find too many issues that
could be discovered by the Sites. The SAM OPS tests should test glExec at the
Sites. M.Kasemann
proposed to present the plan at next GDB so that the Sites are informed next
week and asked to deploy glExec and SCAS. I.Bird
will send a summary of the discussion. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5.
Changes to OPN Mandate (Proposal
at CB) – W.Salter
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The GDB discussed about expansion of the OPN Mandate. W.Salter summarised the issue with the questions in the slides: -
Should Tier1-Tier1 links (not
foreseen for Tier0-Tier1 traffic) be part of LHCOPN? -
Are there any new networking
requirements, e.g. for improving Tier1-Tier2 connectivity? -
Should we define SLA for
links? Is the Tier-1 and Tier-2 point is accepted the Experiments and some Tier-1 Sites should provide experts for working with the OPN Community. L.Dell’Agnello
noted that the Tier-1 to Tier-2 links can be very many and complex for some
VOs (example CMS). And is followed by national networks or across NRENs. D.Barberis
added that he is very much in favour of the proposal. M.Bouwhuis
asked whether all connections will be evaluated and whether actions will be
proposed for each of them W.Salter
replied that is not the intention to change unless is needed for specific
links. Change will not be done for the sake of it and not without all
agreements which are needed. I.Bird
supported the proposal so that the OPN can follow the whole infrastructure
even if it will not be able to follow all details of each link. For the SLAs
will be discussed by the OPN group and reported. Action: Experiments and some Sites should provide names for working
with the OPN on the needs and actions needed on Tier-1 to Tier-2 links. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6. AOB
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6.1 24x7 support (Slides) – T.Cass
The services have 24x7 support but the response is not always guaranteed outside office hours. When there is not is due to the likelihood and redundancy of the service. Two levels of support for critical services - Guaranteed Support — Piquet Service. Where level and/or risk of service incidents is high and where the support team is large enough. - Best Efforts Support. All other cases. Services often offer mechanisms to protect users from hardware or software failure. But these should be used: e.g. connect to database identifier (cluster TNS entry), not specific host name Operators will call support following documented procedures in case of alarms in response to - a GGUS alarm ticket - a mail to the <exp>-operator-alarm email list. GGUS alarm tickets are routed here directly as well as to GGUS for tracking. - a phone call to 75011 from a member of the <exp>-operator-alarm list. It should be as follow-up to email, however, not as sole contact method. Emails should be phrased to help the operator to identify the problematic service quickly and easily. D.Barberis
noted that the VO shift do not know which service to call some explanation on
the services and a list would be useful AFS and ORACLE are 24x7 on best effort, without piquet service. Ph.Charpentier
noted that when the services were defined critical. It was not agreed like
that. T.Cass
replied that critical service can be addressed by hardware and software
protection instead of people in piquet. Actually is much better to put
additional protections than piquet which is purely reactive. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7. Summary of New Actions |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
No new
actions for the MB. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||