LCG Management Board

Date/Time

Tuesday 24 November 2009 16:00-17:00 – Phone Meeting 

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=71052

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 28.11.2009)

Participants

A.Aimar (notes), D.Barberis, J.-Ph.Baud, I.Bird (chair), M.Bouwhuis, D.Britton, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, Qin Gang, J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, M.Lamanna, P.Mato, G.Merino, A.Pace, M.Schulz, Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout

Invited

J.Andreeva, J.Casey, D.Kelsey, W.Salter

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 1 December 2009 16:00-18:00 – F2F Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous meeting were approved.  Some comments received about differences between GridView report and dashboard.

 

GridView and Dashboard

Ph.Charpentier suggested using the dashboard report also for the monthly report instead of GridView.

A.Aimar replied that the MB had decided to use the GridView report because it includes only tests about Sites reliability. Instead the dashboards may contain also tests of the VO frameworks or other software issues which are not the Site’s fault. If the MB decides to move to the Dashboard it can be done.

 

Ph.Charpentier added that in some cases the SAM tests are not updated and cause other problems.

J.Casey added that if results are incorrect there is a procedure to correct the data if needed.

 

G.Merino noted that in the dashboard of the VOs there are many dashboards, (FCR, Ganga, etc for ATLAS) which is which?

J.Andreeva replied that the VO chooses which of the dashboards to use every time.

 

I.Bird proposed not to change the reliability report for the moment and stay with the GridView reports. .

 

IN2P3 Email about Reliability Calculations

Seems there are problems between sites down and in maintenance.

J.Casey replied that he will look into the issue and report to the MB.

 

Usage of Fair Algorithm

J.Casey added that the fair availability report will be the same on the report and on the web pages by the end of the month.

 

2.   Action List Review (List of actions)

 

  • Each Site should send the URL of the XML File with the Tape Metrics to A.Aimar.

See http://sls.cern.ch/sls/s0065rvice.php?id=WLCG_Tier1_Tape_Metrics

Done. All Sites are providing metrics now. In some cases incomplete.

3.   LCG Operations Weekly Report (Slides) – J.-Ph.Baud 
 

 

Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Summary

This summary covers the period from 9th November to 22th November.

 

There were a few incidents leading to SIR reports:

-       Kernel upgrade at CNAF took too long and the CE and SE were not available for 4 days.

-       CASTOR SRM problems at CERN

-       GGUS: impossible to send ALARM tickets last Friday

 

Meeting Attendance

 

Site

M

T

W

T

F

CERN

Y

Y

Y

Y

Y

ASGC

Y

Y

Y

Y

Y

BNL

Y

Y

Y

Y

CNAF

Y

Y

Y

Y

FNAL

FZK

Y

Y

Y

Y

Y

IN2P3

Y

Y

Y

NDGF

NL-T1

Y

Y

Y

Y

Y

PIC

Y

Y

Y

RAL

Y

Y

Y

Y

TRIUMF

n/a

n/a

n/a

n/a

n/a

 

GGUS Summary

The number of tickets is increasing.

 

Site

M

T

W

T

F

CERN

Y

Y

Y

Y

Y

ASGC

Y

Y

Y

Y

Y

BNL

Y

Y

Y

Y

CNAF

Y

Y

Y

Y

FNAL

FZK

Y

Y

Y

Y

Y

IN2P3

Y

Y

Y

NDGF

NL-T1

Y

Y

Y

Y

Y

PIC

Y

Y

Y

RAL

Y

Y

Y

Y

TRIUMF

3.2      SIRs and VO Availability

SARA Space Tokens

With only one alarm ticket from LHCb to SARA because of lack of disk space.

 

M.Bouwhuis added that Sara had to increase the disk token space.

D.Barberis added that the alarm from ATLAS was submitted on the same issue but was not counted.

Ph.Charpentier added that LHCb had posted an alarm ticket on AFS. But after the 22nd which is the period covered by this report. 

 

GGUS Problem

With the LHC startup it is now urgent to make sure GGUS infrastructure ensures a 24/7 availability

See https://savannah.cern.ch/support/?101122

 

SAM Availability (slide 7)

In slide 7 one can notice that:

-       ASGC for ATLAS
The ASGC SRM tests are failing for ATLAS. Because they were pointing to a wrong endpoint.

 

Qin Gang added that the SAM CE tests are failing because the Site is overloaded by ATLAS and one million jobs. Same happened for CMS in STEP09.

Ph.Charpentier noted that this should not be a problem for a Site if it is properly configured. Sites should use the large amount of jobs as an explanation for failures.

 

-       CNAF for CMS
This is because CNAF did other interventions while they were down for the kernel upgrade, like changing the software area hardware for CMS where the server had problems.

 

L.Dell’Agnello agreed that was not well planned and announced.

J.-Ph.Baud reminded that Sites must announce following the right procedure via GOCDB and well in advance.

 

SRM Problems at CERN

Sometimes the wrong status is returned by CASTOR and the FTS transfers fail.

The problem of Thread exhaustion because it takes more than 30 minutes and there are core dumps.

 

For example on 20th November there were problems exporting from CASTORATLAS to Tier1s. The Post mortem is here: https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20Nov09

 

Miscellaneous Problems

There was a series of problems in this period:

-       CMS MC production issues with WMS at CNAF due to Condor bug (fix successfully tested)

 

-       SRM instabilities at BNL (problem now understood)

 

-       dCache golden release deployed at SARA, IN2P3 and KIT. Some configurations have changed.

M.Ernst added that there is a race condition in the code and the golden release was affected and the golden release had to be fixed. One Java library was not thread safe and all Sites in the same situation should upgrade

 

-       Dashboard DB migration at CERN. Working well for Atlas. Low performance after migration for CMS (now understood and fixed was relate to the Oracle DBs)

 

-       AFS problems at CERN: is the service critical for experiments and what is the coverage?

I.Bird replied that T.Cass will present a proposal later.

 

-       Fibre cut to PIC (between Barcelona and Madrid)

 

-       Incorrectly labelled cartridges at ASGC

 

-       CREAM CEs for Alice at CERN are unstable since more than a week.

 

-       NIKHEF: compute capacity increased, network infrastructure upgraded to 160 Gbps

 

-       PPS has been replaced by staged rollouts for middleware

M.Schulz added that they need the support from the Experiments to find Sites to try out the new versions. If Sites and Experiments are interested in trying and having new versions they should have interest in deploy and test them.  This model work very well for FTS for instance.

 

Kernel Upgrades

J.Gordon asked whether the kernel upgrade is about the Tier-1 or also Tier 2 Sites.

The Tier-2 in EGEE is followed by the EGEE Security and the EGEE ROC managers.

 

D.Kelsey added that a review at next GDB would be useful.

 

 

 

4.   Pilot Jobs Policy (GDB Document; Slides) – I.Bird

 

 

I.Bird summarized the situation on pilot jobs and policies.

4.1      Policies

Two years ago the WLCG MB agreed this:

“WLCG sites must allow job submission by the LHC VOs using pilot jobs that submit work on behalf of other users. It is mandatory to change the job identity to that of the real user to avoid the security exposure of a job submitted by one user running under the credentials of the pilot job user.”

 

But there were pre-requisites for the policy to come into force:

-       Security review of glExec: done

-       Review of pilot frameworks: done

-       GlExec tested with all needed batch systems: done

-       SCAS completed, certified, deployed: not deployed.

NB. forgot glExec deployed: not done

 

There is a MUPJ JSPG policy: https://edms.cern.ch/file/855383/2/PilotJobsPolicy-v1.0.pdf

 

Approved in Aug and Sep 2008 by WLCG and EGEE respectively. This states that the VO must obtain Site approval before submitting MUPJ (i.e. optional from the site point of view). It also states that the pilot framework must meet the requirements of fine-grained traceability, e.g. via identity switching.

4.2      Deployment (on SL5)

This is taking too long.

 

GlExec gas been available in SL4 for some time, but we need SL5. Several glexec-SL5 patches failed in certification and had to be sent back to developers. But now glExec/SL5 is certified

 

SCAS on SL5 has been a long time coming now it passed certification and started roll-out very recent.

 

J.Gordon noted that this was never a real show stopper as only the WN must be on SL5.

I.Bird replied that it is true but has been used often as an excuse.

 

While sites have been encouraged to deploy glExec/SCAS they have also been urged to move to SL5 and so lack of deployment of glExec/SCAS is understandable.

 

Ph.Charpentier noted that Experiments cannot commission the Sites one by one, It is too late now. Sites are not configured in the same way and need to be verified one by one.

I.Fisk added that also for CMS is too complex to discover which Site has glExec installed and there for is switched off in CMS.

D.Barberis confirmed that is the same situation for ATLAS. They tested the current version but are not using it. A new version will have to be rested but must be installed everywhere by early 2010. Sites must have it installed in large majority.

M.Schulz added that Sites should decide whether they want to install it.

4.3      Current Situation

Sites have not been able to deploy glExec/SCAS until now to support the implementation of the policy. Is not their fault but is it now the time to add such critical services?

 

Experiments need to run MUPJ to support their analysis; but adaptations to glExec/SCAS have not yet been fully implemented and tested. Again is now the time to make such changes in essential software?

 

In summary:

-       The experiments need to process and analyse data according to the ways that they have been preparing.

-       The sites need to be able to manage their security

4.4      Proposal from I.Bird

In the absence of any deployable fine-grained authentication mechanism that the experiments can make use of now I.Bird proposed to establish a policy that:

-       Permits the Experiments to run pilot jobs that run workloads of other members of the VO

-       Agrees that the owner of the pilot job – or the VO itself – is responsible for all work run by that pilot job. i.e. in case of problems the entire VO may be banned at a site.

-       The VO framework should provide the fine-grained traceability

-       This policy must be reviewed in “xx” (a few) months, or earlier if there are operational issues

-       During this time the existing JSPG policy would need to be suspended (?)

 

WLCG must continue to push the deployment of glExec/SCAS as rapidly as possible to be in a situation to implement the agreed policy at the earliest opportunity.

 

D.Kelsey commented that the MB maintains the trust between VOS and Sites. Suspending the agreement can be a problem in the future and should be done carefully. In principle nobody should run MUPJ Sites have not agreed.

 

D.Barberis replied that Sites know very well that MUPJ are run by the VOs.

 

I.Bird added that Sites had agreed to accept MUPJ.

D.Kelsey replied that not all requirements were respected.

 

J.Gordon added that Sites have not given permission explicitly.

I.Bird asked how Sites are going to be told that they should allow MUPJ for some months.

 

Ph.Charpentier added that “role=pilot” is accepted then the Sites accepts the pilot jobs. And if the policy is not implemented after 3 years should be re-discussed. VOs can provide the traceability of the jobs.

D.Kelsey noted that if payload and framework run under the same user and can compromise the framework.

Ph.Charpentier noted that this possible danger has never been shown with an example. And as of now glExec is useless.

 

I.Fisk asked that the problem is that the policy is not enforced and Sites and VOs cannot check each other about glExec configuration and MUPJ execution on the Site.

Sites should not be imposed a solution, Sites should have a way to trace and block specific individuals. Not necessarily had to be done with identity switching and with the current implementation.

 

Note: Some discussions concerning storage security are not minuted, for obvious reasons.

 

Proposal:

The proposal is that

1. Sites accept the proposal for next 3 months and meanwhile

2. The Tech Forum will review traceability and blocking.

3. Sites must continue to press with the deployment so that when needed this deployment is done.

 

Ph.Charpentier noted that only 3 months is too short for the VOs. The Sites should have some testing on their own in the 3 months. VOs always find too many issues that could be discovered by the Sites. The SAM OPS tests should test glExec at the Sites.

 

M.Kasemann proposed to present the plan at next GDB so that the Sites are informed next week and asked to deploy glExec and SCAS.

 

I.Bird will send a summary of the discussion.

 

 

5.   Changes to OPN Mandate (Proposal at CB) – W.Salter

 

The GDB discussed about expansion of the OPN Mandate.

 

W.Salter summarised the issue with the questions in the slides:

 

-       Should Tier1-Tier1 links (not foreseen for Tier0-Tier1 traffic) be part of LHCOPN?
In order to have a consistent operational model
Fuller overview of Tier0/Tier1 networking for WLCG
Known Tier1-Tier1 links not in LHCOPN: NLT1-TRIUMF, NLT1-ASGC, FNAL-KIT, NLT1-FNAL

-       Are there any new networking requirements, e.g. for improving Tier1-Tier2 connectivity?
If so, should LHCOPN community investigate this?

-       Should we define SLA for links?
As links are shared among Sites and would be good to have the correct bandwidth specified only (ex: shared link between BNL and FNAL)

 

Is the Tier-1 and Tier-2 point is accepted the Experiments and some Tier-1 Sites should provide experts for working with the OPN Community.

 

L.Dell’Agnello noted that the Tier-1 to Tier-2 links can be very many and complex for some VOs (example CMS). And is followed by national networks or across NRENs.

D.Barberis added that he is very much in favour of the proposal.

M.Bouwhuis asked whether all connections will be evaluated and whether actions will be proposed for each of them

 

W.Salter replied that is not the intention to change unless is needed for specific links. Change will not be done for the sake of it and not without all agreements which are needed.

 

I.Bird supported the proposal so that the OPN can follow the whole infrastructure even if it will not be able to follow all details of each link. For the SLAs will be discussed by the OPN group and reported.

 

Action:

Experiments and some Sites should provide names for working with the OPN on the needs and actions needed on Tier-1 to Tier-2 links.

 

 

6.    AOB

 

 

6.1      24x7 support (Slides) – T.Cass

The services have 24x7 support but the response is not always guaranteed outside office hours. When there is not is due to the likelihood and redundancy of the service.

 

Two levels of support for critical services

-       Guaranteed Support — Piquet Service. Where level and/or risk of service incidents is high and where the support team is large enough.

-       Best Efforts Support. All other cases. Services often offer mechanisms to protect users from hardware or software failure. But these should be used: e.g. connect to database identifier (cluster TNS entry), not specific host name

 

Operators will call support following documented procedures in case of alarms in response to

-       a GGUS alarm ticket

-       a mail to the <exp>-operator-alarm email list. GGUS alarm tickets are routed here directly as well as to GGUS for tracking.

-       a phone call to 75011 from a member of the <exp>-operator-alarm list. It should be as follow-up to email, however, not as sole contact method.

Emails should be phrased to help the operator to identify the problematic service quickly and easily.

 

D.Barberis noted that the VO shift do not know which service to call some explanation on the services and a list would be useful

 

AFS and ORACLE are 24x7 on best effort, without piquet service.

Ph.Charpentier noted that when the services were defined critical. It was not agreed like that.

T.Cass replied that critical service can be addressed by hardware and software protection instead of people in piquet. Actually is much better to put additional protections than piquet which is purely reactive.

 

 

 

7.    Summary of New Actions

 

 

 

No new actions for the MB.