LCG Management Board

Date/Time

Tuesday 15  December 2009 16:00-17:00 – Phone Meeting 

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=71056

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 20.12.2009)

Participants

A.Aimar (notes), D.Barberis, I.Bird (chair), K.Bos, M.Bouwhuis, D.Britton, Ph.Charpentier, I.Fisk, Qin Gang, J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, M.Litmaath, P.Mato, G.Merino, A.Pace, M.Schulz, J.Shiers

Invited

M.Girone

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 12 January 2010 16:00-18:00 – F2F Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments received. The minutes of the previous meeting were approved by the WLCG MB.

1.2      Contacts for OPN Requirements needed

W.Salter is still waiting for names from the Experiments and some Tier-2 Sites. Tier-1 Sites are already represented. None received until now!

 

For ATLAS will be K.Bos, for the time being.

1.3      LHC Schedule (more information)

It is confirmed that the accelerator will restart between the 14 and the 16 February 2010.

 

2.   Action List Review (List of actions)

 

·         I.Bird will prepare the answer to the Scrutiny group about their request to have the 2011 Experiments' Requirements by early 2010.

Done. Discussed later in the meeting.

·         OPN Mandate: Experiments and some Sites should provide names for working with the OPN on the needs and actions needed on Tier-1 to Tier-2 links.

To be done. Already discussed as matter arising earlier.

·         Gstat data: Tier-1 Sites should explain the differences among pledges, monthly accounts and the Gstat values (see presentation at the MB 1.12.2009). They should also check their Tier-2 Sites.

The table should be generated every month ad distributed to the Sites. Action removed.

 

3.   LCG Operations Weekly Report (SIRs - When, Where and Why; Slides; WLCG workshop, Prague) – J.Shiers
 

 

Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Overview

One year ago J.Shiers had proposed that the weekly report would be a review of the KPIs only.

Site availability using the VO SAM tests was a good KPI and improved during 2009. The SIRs number instead is increasing.

 

First data brings also new issues, including the realization that we need to be more agile in scheduling interventions. E.g. have interventions “ready to go” for a convenient slot in accelerator operation, rather than a fixed slot – which may no longer be convenient when the time comes around. Interventions must adapt to suit the LHC slots.

 

Here is the link of the Experiments during Xmas and in January. Link

 

Slide 3 shows that in Q4 there were about twice as many incidents vs. the previous quarters. Is this an issue to follow?

 

The current WLCG “service model” has been built up and proven over the past 5+ years from the run-up to the Experiment-oriented Service Challenges starting in 2005, followed by CCRC’08 and STEP’09.

 

This model works – no doubt can be improved / optimized – but needs to survive the multiple transitions we are now facing. New group structures, new EU projects, new SCODs.

 

Ii is important to maintain attendance at WLCG “daily meeting” – reports from ATLAS and ALICE (as generated since a long time by CMS then LHCb) would really help!

Is the level of incidents leading to SIRs acceptable? (The new baseline? Is there some way we can improve here?

Include “SIR follow-up” in quarterly GDB operations review? Improve “official” information flow from machine out to sites, particularly during early day.

3.2      Summary and Conclusions

This summary covers the weeks 30 November to 13 December. Included period of LHC operation giving 1 million collisions at 450 GEV per beam and 50000 collisions at 1.18 TEV per beam.

 

Mixture of problems. 2 alarm tickets from ATLAS:

-       Test alarm to FZK on 9 Dec after GGUS release

-       Alarm to CERN-PROD on 12 Dec:

-       REQUEST_TIMEOUT for ATLASDATADISK

 

Incidents leading to (eventual) service incident reports

-       RAL 30 Nov LHCb data loss

-       CERN 2 Dec site wide power cut for just over 2 hours

-       IN2P3 8 Nov DNS load balancing failure affected grid services for 1.5 hours

 

Note: The usual information is in the slides but was not presented at the meeting.

 

Most Experiments reporting good performance for data export and event reconstruction at Tier-0 and over the grid.

Xmas Experiment and site plans are available at https://twiki.cern.ch/twiki/bin/view/LCG/WLCGExperimentandSiteplansfortheendof2009holidays

 

Nothing special required for ALICE, CMS and LHCb but ATLAS plan to reconstruct between 30 and 150 million events over Tier-0/1. Should start after validation completed by 22 December and be finished by 1 Jan.

 

After many (many) years of preparation we finally see first real data taking from (accelerated) pp collisions. Need to remain agile and attentive – now the fun begins. IMHO the grid needs to be part of the solution (and not part of the problem)

 

Surely there are ways that we can combine our efforts, knowledge and experience grid-wide to provide a better service at lower manpower cost? Often the same problem is solved by all Sites individually. Something to drive through the HEP VRC / HUC in 2010?

 

F.Hernandez commented that the reports at IN2P3 are written in French and it is a problem to repeat the same level of details in the reports. Producing also an English version is a burden and they try to report some summary. The daily Experiments reports are very useful for the Sites. Their access should not be restricted. For instance he cannot access the CMS reports.

 

M.Kasemann replied that all the LCG e-group should be enabled to read these reports.

 

 

4.   GDB Summary (Slides) – J.Gordon

 

 

4.1      Future Meetings

March and April meetings on second Wednesdays replaced by a single meeting.

-       March 24th at NIKHEF

-       F2F MB Tuesday 23rd.

-       JSPG Thursday 25th

4.2      SL5

Three Experiments present said that they had no long term requirement for SL4 on WN.

-       CMS: Happy with SL5. Done for CMS sites to a large extent. Require SL4 but not on WNs. SL5 only release only a few weeks away – then will require SL5.

-       ATLAS: Will not need SL4 much longer.

-       LHCb: Dirac ships SLC4 compatible libraries so no issue. No on-going requirement for SL4.

 

ALICE have said elsewhere that they will ban sites still running SL4 in January

4.3      Virtualization

Tony Cass is gathering the working group together. First (virtual) meeting being planned for January.

4.4      Passing Job Parameters to Batch

Douglas McNab (Glasgow) and Dennis Van Dok (Nikhef) showed their experiences with passing job parameters from the users, through WMS and CREAM to the batch system running the job.

http://indico.cern.ch/getFile.py/access?contribId=2andsessionId=6andresId=0andmaterialId=slidesandconfId=64669

 

Memory and cpu requested alone will make a big difference in allowing the batch service to schedule more efficiently.

EGEE/SA3 will propose a default set of parameters to go into the default release.

4.5      Patching Your System

R.Wartel reported that the second recent root vulnerability was patched more quickly. Many sites patch immediately, or even before they are alerted by R.Wartel. Other sites wait until they are under threat of suspension.

 

To reduce the exposure of the whole Grid we should push to cut the length of time until all sites are patched. Proposal to give sites 7 day banning threat after 7 days.

4.6      Pilot Jobs

Publicised the MB statement on suspending the multi-user pilot job policy for three months.

Note that SCAS is available for SL5 now.

 

GlExec for SL5 is available as part of staged roll-out.

-       Anyone can download but it is a conscious decision.

-       Give feedback on deployment.

-       Not straightforward to monitor who has installed glExec. ML ran jobs to check.

 

Slight confusion over lcg-CE using SCAS. It is not required or certified. 

 

Need to develop SAM test for correct deployment of glExec with SCAS. Not a straightforward as needs to handle two proxies.

 

Recommendation to Experiments to check for the existence of glExec before using (as used by LHCb). This will mean one software version can run at all sites and ease the introduction. Publishing the “role=pilot” in the BDII allows Experiments to tell to which sites they should submit pilot jobs.

 

LHCb claimed their system would pick up when glExec was deployed and start using it automatically.

 

ATLAS and CMS admitted possibility of testing glExec but no evidence of a plan yet. Sites sense this reluctance of Experiments and this feeds back into their enthusiasm to deploy.

 

First step. Each site deploys just one queue that supports SCAS/glExec this is still useful as a first step. Want the maximum number of sites deploying something even if not a widespread deployment right now.

 

D.Barberis noted that a small scale test is a functional test and this is not missing. Large scale tests are missing.

J.Gordon replied that would be a step to move to 20-30 Sites and then have more WNs available.

 

I.Bird added that Sites should proceed with the installations as already agreed..

 

 

5.   Update from the Pilot Jobs working group – M.Litmaath

 

A WG was created to discuss MUPJ and Security, it has 55 members already. 20% of them actually discuss the issues.

There is a summary of the discussions and arguments. The summary is now being reviewed and will be distributed and presented later.

 

Every admin in WLCG will be asked to respond to a questionnaire to the deployment modes acceptable for the Site (setuid or log-only modes). The Experiments will be asked the kind of usage they intend to do in their frameworks.

 

Some Sites were never involved in the discussion and one is repeating some of the past discussions.

 

I.Bird asked how it can be that Sites find this as a new topic. Some communication channel is missing or not working.

 

M.Schulz noted that the contributors to the discussion are users and developers but not many Sites participating. How can attention of the Sites be improved? D.Barberis noted that Experiments talk to their contacts at the Sites. Not to the Sites admins.

 

I.Bird added that the motivations of the reluctance should be clarified and answered in writing so that are not re-discussed every few months. Sites should write their reasons and receive answers. An action will be put in the Action List when Sites need to proceed.

 

In conclusion: M.Litmaath will write and distribute the questionnaire and all Tier-1 Sites should make sure that the Tier-2 reply to the questionnaire. 

 

 

6.   Summary from the 3D Workshop (Slides) – M.Girone

 

 

6.1      Overview

In total 45 people participated in the workshop with:

-       ATLAS, CMS and LHCb coordinators and developers

-       8 sites: ASGC, BNL, IN2P3, NDGF,KIT, PIC, SARA, RAL,

-       CNAF and TRIUMF could not participate

 

Very lively and interactive workshop reflecting the well-established community: WLCG and beyond e.g. ESA, GSI.

Below is a summary main conclusions from the workshop as well as achievements of “3D” in last 2 years

6.2      Goal

There were many goals for the workshop.

 

Review the Experiments and Database Operations (day 1)

-       Experiments DB Strategies and Requests

-       FroNTier and Coral Server Status

-       Sites Status and Plans

 

DB Services Readiness (day 2)

-       Alarm and Problem Escalation and Handling

-       Storage and DB parameter configuration Review

-       DB and Streams Monitoring

-       Review the Backup and Recovery Strategy/Policies

-       Demonstrate the capability to recover production DBs with a recovery validation exercise

 

Beyond the 2010 run one will have to prepare for Oracle 11gR2 

6.3      DB Milestones in the WLCG MB

This below was the input from the MB.

 

Databases

A number of sites, including ASGC and RAL, have been unable to recover production databases from backups / recovery areas with major downtimes occurring as a result. A coordinated DB recovery validation exercise that is regularly tested should be considered to avoid such problems.

 

J.Gordon commented that RAL was able to restore files but meanwhile they had been removed.

 

Recovery exercise performed and results analyzed

-       One setup at RAL, three at CERN

-       Recent incidents at ASGC and RAL addressed

-       Extensive review recommendations of ASM configurations and backup policies well received

 

All sites managed to perform a point-in-time recovery – strong agreement that sites need to regularly repeat exercise

6.4      Experiments and Tier-0 Activities

Experiments very satisfied with level of service and coordination for online and offline DBs;

 

Standby-DBs – introduced prior to 2008 run – have provided additional redundancy and have been extremely useful for recovery (human error, failover during maintenance etc.)

Strong interest in Coral Server – currently used by ATLAS online

Archive DBs Services – introduced prior to 2009 for ATLAS and CMS for read-only applications (TAGS, conditions snap-shots)

6.5      Tier-1 Status

Below is a summary of the status of the Tier-1 Sites.

 

The main conclusion is that all Sites use RAC or ASM which is one of the achievements of the project.

 

c

ASGC

CNAF

GridKa

IN2P3

SARA

BNL

RAL

PIC

TRIUMF

NDGF

CERN

FTS

RAC

Y

Y

Y

Y

Y

Y

Y

Y

Y

n/a

Y

 

ASM

N

Y

Y

Y

Y

Y

Y

Y

Y

n/a

Y

 

ASM: #Disk Arrays

 

2

2

1

1

2

2

1

1

n/a

8

 

ASM: #Failgroups

 

1

2

1

1

1

2

1

1

n/a

8

 

ASM: Redundancy

 

Ext

Ext

Ext

Ext

Ext

Normal

Ext

Normal

n/a

Normal

 

RAID

6

10

5

5

5

10

6

6

None

n/a

None

 

Flash copy

N

N

N

N

N

Y

N

Y

N

n/a

Y

 

Backup to tape

N

Q2/10

N

Y

Y

Y

Y

N

N

n/a

Y

 

Backup to disk

 

Y

 

N

Y

Y

Y

Y

Y

n/a

N

 

Data Guard

N

N

N

N

N

N

N

N

N

n/a

Y

ASGC

CNAF

GridKa

IN2P3

SARA

BNL

RAL

PIC

TRIUMF

NDGF

CERN

LFC

RAC

Y

Y

Y

Y

Y

Y

Y

Y

n/a

n/a

Y

 

ASM

N

Y

Y

Y

Y

Y

Y

Y

n/a

n/a

Y

 

ASM: #Disk Arrays

 

2

2

1

1

2

2

1

n/a

n/a

8

 

ASM: #Failgroups

 

1

2

1

1

1

2

1

n/a

n/a

8

 

ASM: Redundancy

 

Ext

Ext

Ext

Ext

Ext

Normal

Ext

n/a

n/a

Normal

 

RAID

6

10

5

5

5

10

6

6

n/a

n/a

None

 

Flash copy

N

N

N

N

N

Y

N

Y

n/a

n/a

Y

 

Backup to tape

N

Q2/10

N

Y

Y

Y

Y

N

n/a

n/a

Y

 

Backup to disk

 

Y

 

N

Y

Y

Y

Y

n/a

n/a

N

 

Data Guard

N

N

N

N

N

N

N

N

n/a

n/a

Y

 

 

ASGC

CNAF

GridKa

IN2P3

SARA

BNL

RAL

PIC

TRIUMF

NDGF

CERN

CASTOR

RAC

Y

Y

n/a

n/a

n/a

n/a

Y

n/a

n/a

n/a

Y

 

ASM

Y

Y

n/a

n/a

n/a

n/a

Y

n/a

n/a

n/a

N

 

ASM: #Disk Arrays

1

2

n/a

n/a

n/a

n/a

2

n/a

n/a

n/a

NAS

 

ASM: #Failgroups

1

1

n/a

n/a

n/a

n/a

2

n/a

n/a

n/a

n/a

 

ASM: Redundancy

Ext

Ext

n/a

n/a

n/a

n/a

Normal

n/a

n/a

n/a

n/a

 

RAID

6

10

n/a

n/a

n/a

n/a

None

n/a

n/a

n/a

6

 

Flash copy

N

N

n/a

n/a

n/a

n/a

N

n/a

n/a

n/a

N

 

Backup to tape

N

Q2/10

n/a

n/a

n/a

n/a

Y

n/a

n/a

n/a

Y

 

Backup to disk

 

Y

n/a

n/a

n/a

n/a

Y

n/a

n/a

n/a

N

 

Data Guard

N

N

n/a

n/a

n/a

n/a

N

n/a

n/a

n/a

N

 

 

ASGC

CNAF

GridKa

IN2P3

SARA

BNL

RAL

PIC

TRIUMF

NDGF

CERN

3D Atlas

RAC

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

 

ASM

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

 

ASM: #Disk Arrays

1

2

2

1

1

2

2

1

1

1

11

 

ASM: #Failgroups

1

1

2

1

1

1

2

1

1

1

11

 

ASM: Redundancy

Ext

Ext

Ext

Ext

Ext

Ext

Normal

Ext

Normal

Ext

Normal

 

RAID

6

10

5

5

5

10

6

6

None

10

None

 

Flash copy

N

N

N

N

N

Y

N

Y

N

N

Y

 

Backup to tape

N

Q2/10

N

Y

Y

N

Y

N

N

Y

Y

 

Backup to disk

N

Y

Y

N

Y

Y

Y

Y

Y

 

N

 

Data Guard

N

N

N

N

N

N

N

N

N

N

N

ASGC

CNAF

GridKa

IN2P3

SARA

BNL

RAL

PIC

TRIUMF

NDGF

CERN

3D LHCb

RAC

n/a

Y

Y

Y

Y

n/a

Y

Y

n/a

n/a

Y

 

ASM

n/a

Y

Y

Y

Y

n/a

Y

Y

n/a

n/a

Y

 

ASM: #Disk Arrays

n/a

2

2

1

1

n/a

2

1

n/a

n/a

7

 

ASM: #Failgroups

n/a

1

2

1

1

n/a

2

1

n/a

n/a

7

 

ASM: Redundancy

n/a

Ext

Ext

Ext

Ext

n/a

Normal

Ext

n/a

n/a

Normal

 

RAID

n/a

10

5

5

5

n/a

6

6

n/a

n/a

None

 

Flash copy

n/a

N

N

N

N

n/a

N

Y

n/a

n/a

Y

 

Backup to tape

n/a

Q2/10

Y

Y

Y

n/a

Y

N

n/a

n/a

Y

 

Backup to disk

n/a

Y

 

N

Y

n/a

Y

Y

n/a

n/a

N

 

Data Guard

n/a

N

N

N

N

n/a

N

N

n/a

n/a

Y

6.6      Project Achievements

The main achievement of the 3D project were:

-       Built community across CERN online and offline and with WLCG Tier1 sites: proven during CCRC’08, STEP’09 and recent data taking

-       Fully integrated into overall WLCG operations, including WLCG workshops and daily phone calls

-       Sharing of architecture, knowledge and procedures – important for minimizing manpower costs and demonstrated benefits in helping sites to recover

-       Project extended to cover coordination of CASTOR and SRM DBs at Tier1s: ASGC DB now configured following recommendations from 3D + IT-DM

-       Status of the project: data taking!

 

A.Heiss noted that DE-KIT has backups on both tape and disk unlike stated on some tables above.

M.Girone replied that could be that some values in the table above need to be verified.

 

J.Gordon asked whether now CNAF and TRIUMF should repeat the tests.

M.Girone replied that tests will be repeated regularly.

 

I.Bird suggested that a wiki page is set up to describe the issues and how Sites address them.

M.Bouwhuis suggested having a pre-GDB on best practices.  I.Bird agreed.

J.Shiers suggested that a best practices wiki would be useful to all Sites.

 

D.Barberis agreed that the level of support must be reviewed as they are a critical point for many activities. And he is worried that there is no-backup data. Should be normal.

 

Ph.Charpentier and D.Barberis thanked Maria for her work and the MB acknowledged her excellent work for the 3D project.

I.Bird added that, for now, her role will be taken by T.Cass who is leading the DB group.

 

 

7.   Experiments' Activities over Christmas (and until February) (WLCG OPS Twiki) – Experiments’ Roundtable

 

 

The issue was discussed in the J.Shiers talk.

 

 

8.   Next LHCC and C-RSG Review Preparation (Slides) – I.Bird

 

 

The review will take place on the 16 February 2010 at CERN.

 

LHCC Reviewers:

-       Amber Boehnlein

-       Chris Hawkes

-       Jean-Francois Grivaz

-       + other LHCC

 

C-RSG:

-       Domenec Espriu

-       + other RSG members

 

The proposed agenda is below, still waiting for feedback.

 

10.00

Project overview and status report – Ian

Include: planning, milestones, resources status (installation, pledges, use), status of planning for new Tier 0, etc.

10.30

Report from service delivery/operations – Jamie

Includes summary of Tier0, Tier1, and Tier2 experiences from ops and site viewpoints.

11.00

Applications Area status – Pere

11.30

Middleware and MSS status and summary – XXX

12.00

Brief summary of situation with EGI/EMI etc proposals (depends on level of feedback from EC - may not have much until March) - XXX

14.00

Experiment reports - to include experience with first data, issues arising, and resource planning for 2011: we must consider that there will be accelerator running in 2011 - we need to construct a realistic scenario.  Fallback is no 2011 running with no increase in resources?

Outcome of this meeting must be a scenario that we present to the RRB in April - something has to be said.

 

45 min per Experiment?

17:00

Summary and conclusions

 

After the meeting there must be information sent to the RRB. By the C-RBB in April we must present a clear report and recommendations to the funding agencies that are now completely on the dark in what respects the projections for the computing needs in 2011 and 2012.

 

Slide 4 contains the message from D.Espriu and slide 5 the answer by I.Bird.

No answers from the LHCC Reviewers.

 

Ph.Charpentier noted that the scenario must be proposed by the chairman of the RRB. Only later the Experiments can define their resource requirements. It depends on the number of days of run.

 

 

9.    AOB

 

 

 

Next meeting on 12 January the F2F meeting.

 

I.Fisk reported that the SL5-only release was done the night before. And in January CMS will not have any SL4 analysis applications running. Reconstruction software will still run on SL4 but will be migrated soon.

 

 

10.    Summary of New Actions