WLCG Management Board

Date/Time

Tuesday 17 March 2009 –Phone Meeting - 16:00-17:00

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=49395

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 22.3.2009)

Participants

A.Aimar (notes), I.Bird, T.Cass, L.Dell’Agnello, Qin Gang, J.Gordon (chair), A.Heiss, F.Hernandez, S.Foffano, M.Kasemann, M.Lamanna, P.Mato, G.Merino, S.Newhouse, A.Pace, R.Pordes, H.Renshall, M.Schulz, Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Invited

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 31 March 2009 16:00-17:00 – Phone Meeting

1.   Minutes and Matters Arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous meeting were approved without comments.

1.2      Update on LHC Schedule 2009-10

I.Bird reported that S.Bertolucci confirmed distributed that there is not shutdown in Jan and Feb 2010. The previously distributed data (i.e. run for 44 weeks, ~ 6.1 x 10**6 seconds) are to be considered correct. There could be some heavy ion runs at the end of 2009 (maybe 2 weeks over Christmas).

 

I.Bird asked for confirmation from the Experiments that their spokespersons will be present on the 7th of April.

 

On the 31st March the Experiments present their requirements and on the 7th April meeting with the spokespersons.

1.3      Updates from F.Donno (emails)

F.Donno had sent some update via email regarding Busy Storage Services, SRV V2.2 as default and asynchronous srmLs in dCache.

 

J.Shiers remarked that F.Donno is on leave and therefore her tasks are now assigned to A.Sciabá (SRM) and to S.Traylen (Installed capacity) and they should be reporting to the MB instead.

 

The two point she is clarifying are:

-       Moving to SRM V2.2: FTS can be switched by a Site configuration. Lcg-utils needs next version that is being released. Comments received only from NL-T1

-       Asynchronous srmLs in dCache: The new version of dCache will have srmLs asynchronous and Experiments should comment and take this into account.

M.Lamanna agreed with the proposal but noted that sometimes wanted changes in dCache are included with several other changes that need to be clarified and discussed (before they are deployed).

 

I.Bird noted that there is a need for a discussion forum on technical issues like these. The MB after CHEP should define some kind of new forum or working group.

 

1.4      Tier-1 Reliability Reports February 2009 (Comments.pdf; Summary.pdf; T1_All_Reports.zip)

All VOs and NL-T1 have commented the values.

 

J.Templon noted that the “unknown” availability is considered as “unavailable” and this is incorrect because it is not due to the Site is the tests do not report clear success or failure.

 

J.Gordon noted that the CMS availability does not take into account the schedule downtime and there were problems with grid-ftp in CASTOR (for LHCb). .

 

J.Templon noted that the availability plots consider unknown as unavailable (for instance in the NL-T1 LHCb plot).

 

2.   Action List Review (List of actions)

 

 

  • SCAS Testing and Certification

 

There is a report at the GDB the following day (slides from the GDB).

 

M.Schulz reported that after days of stress test there was only one error. All return codes need to be checked and will be released to PPS.

Action Completed.

 

  • VOBoxes SLAs:
    • Experiments should answer to the VOBoxes SLAs at CERN (all 4) and at IN2P3 (CMS).
    • NL-T1 and NDGF should complete their VOBoxes SLAs and send it to the Experiments for approval.

 

No progress since last week.

CMS: Several SLAs still to approve.

ALICE: Still to approve the SLA with NDGF. Comments sent to NL-T1.

 

  • 16 Dec 2008 - Sites requested clarification on the data flows and rates from the Experiments. The best is to have information in the form provided by the Data flows from the Experiments. Dataflow from LHCb

 

The dataflow and rates will be discussed at the WLCG Workshop before CHEP.

 

The recommendation on how to specify the data rate and flows are attached to the Workshop agenda.

 

  • 4 March 2009 - M.Schulz to present the list of priorities for the Analysis working group.

 

Will be presented at the WLCG Workshop and then discussed.

 

  • 17 Mar 2009 - Sites should report whether GFAL and lcg-utils can start using by default SRM V2 and will not impact VOs outside WLCG.

 

Not Done.

 

  • 17 Mar 2009 - Experiments should confirm whether the schedule for the changes regarding “Busy” Storage Services is acceptable.

 

Only CMS replied.

 

 

3.   LCG Operations Weekly Report (Slides; Weekly minutes) - J.Shiers

 

Summary of status and progress of the LCG Operations since last MB meeting. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Main Issues

The major issue was the outage of CASTOR ATLAS. It lasted about 12h and was due to a corrupted DB. Details (Link) and a post-mortem (Link) are available. The full timeline is on slide 3.

The bug was known (to Oracle) and there was a patch available but was not known as a critical patch to execute. Oracle has been asked to highlight the critical patches to be applied.

 

A series of (largely) “transparent” interventions on the CERN CASTOR DBs has been scheduled for this week.

3.2      GGUS Summary

VO concerned

USER

TEAM

ALARM

TOTAL

ALICE

3

0

0

3

ATLAS

35

21

1

57

CMS

2

0

0

2

LHCb

7

1

8

16

Totals

47

22

9

78

 

Just one left-over alarm ticket from the week before last’s scheduled test – ATLAS to ASGC, not opened during the agreed period due to understood problems at ASGC, was responded to well within targets)

 

LHCb alarms: we also should be testing the ability of the VO’s alarm team to open tickets – i.e. ALL VOs should have people able to raise alarms at any time.

 

Reminder: We will repeat such a alarm test the week after CHEP: the goal is to have completed the analysis prior to the following week’s F2F meetings (7 – 8 April 2009)

3.3      Service Summary

Slide 5 sows the VO SAM tests plots. Most are SRM V2 tests but there are many CE tests also.

 

Slide 5: Notes

 

LHCb

IN2P3 - Both CEs were failing CE-sft-job tests in the beginning of the week For SRM, often failures SRMv2-lhcb-DiracUnitTestRAW test over the week

RAL  Second half of the week SRMv2-lhcb-DiracUnitTestRAW always failed

CNAF  Problem is also on the SRM side

 

CMS

IN2P3 - Problem with CEs (test did not run or failing)

CNAF - Problem with CEs as well (CE-cms-analysis test failing all the time)

FZK - CEs again (CE-cms-analysis test failing all the time)

RAL - SRMv2 critical tests were failing time to time

CERN - SRMv2 tests were failing in middle of the week

 

ATLAS

FZK - SRMv2

IN2P3 - CEs are failing critical tests (beginning of the week

CNAF - CEs  CE-sft-job tests failing in the beginning of the week

RAL - SRMv2 tests were failing in the beginning of the week

 

 

F.Hernandez reported that IN2P3 had problems to declare the CE available again and this may have caused the CE tests to fail.

 

L.Dell’Agnello agreed that the CE-cms-analysis test fails for CMS with CASTOR under load. The file requests are queued and executed after the time out. The solution has (probably) been found.

 

A.Heiss reported that a dcap door for dCache was not working. CMS had hardcoded one door for the tests; and that one was failing. If the tests also had tried the alternative doors the tests would have succeeded. 

3.4      WMS Issues

There is growing concern on WMS stability issues – several sites (CERN, RAL, and GRIF) that have installed the mega patch report stability problems – not always load related.

 

M.Litmaath has done some investigations – the results of which are included on slides 6 and 7.

 

From M.Litmaath:

 

There are at least 2 new problems besides the known bug + workaround (Link):

-       A cron job was implemented to automate the workaround, but it changed the wrong parameter in the configuration and therefore failed; this should be fixed today.

-       On wms216 (LHCb) I temporarily disabled the cron job (it is enabled now),   fixed the correct parameter and restarted the WM.  It went fine until it crashed with a different segfault, details here: https://savannah.cern.ch/bugs/?47040

I had to move one unprocessed job out of the way to allow the recovery to proceed, after which yet a different segfault occurred, this time for a cancellation request; the good news here is that a simple restart dealt with that, so we probably can live with it.

-       Prior to investigating the WM troubles I looked into why the WMProxy had become autistic, details here: https://savannah.cern.ch/bugs/?48176

          Whatever caused those processes to hang, at least a restart must get rid of them: https://savannah.cern.ch/bugs/?48172

 

Those 3 bugs are all ranked major at the moment, but we may want to bump some or all of them to critical.

We now need to check what is happening on both WMS nodes for ATLAS and on one WMS for SAM; probably more of the same.

 

 

M.Schulz reported that the mega patch was certified for many days on several servers at CNAF and later at CERN. There were no problems even after a rather heavy testing. Only once deployed all these problems started.

 

The rest of the slides (from 9 to 14) were not discussed at the meeting but are available for information

-       Network intervention at CERN 19 March.

-       CNAF downtime 30/03 to 03 (or 06)/04

-       Workshop’s Agenda.

 

J.Gordon noted that the network intervention at CERN can be used to see what is functioning while the Tier-0 is totally unavailable. Other Sites could see how their services behave.

 

4.   Priorities for the User Analysis WG (email) - Schulz

 

M.Schulz presented the email he sent before the meeting to the Working Group members.

 

The email included the presentation at the previous MB meeting and the requirement proposed in particular on jobs:

-       Support for multi user pilot jobs

-       The ability to assign shares and priorities to different analysis group taking into account the locality of a user.

-       Fair share allocations have to balance within predefined time windows.

-       Prioritization for individual users based on recent usage.

 

And on storage:

-       SRM APIs and tools, especially to handle bulk operations and frequently executed commands such as "ls”.

-       ACLs (VOMS based) on spaces and files.

-       Quotas (but depends how the implementation will provide it)

-       Accounting

 

S.Traylen will look how to configure the batch systems at Tier-2 publishing in the information system correctly.

 

M.Kasemann noted that the requirements need to be discussed in the WG before.

M.Schulz replied that this is just the proposal for the discussion at the WLCG workshop.

 

5.   Plans Switch on the User Info in Accounting (Slides) – J.Gordon

 

J.Gordon presented the proposal about collecting user information in the accounting.

5.1      Background Information and Current Status

The APEL client has had the ability for some time to collect and publish UserDN and FQAN information. The UserDN information is encrypted when published to the central APEL repository, but the publishing is set off by default.

 

Both OSG and DGAS (and probably NGDF) collect FQAN and UserDN and technically they could publish summaries to the central APEL repository.

This has been tested with DGAS when they were using RGMA to publish.

 

The documentation on configuring the publication is available (Link) on the wiki.

 

The CESGA Portal allows people to see information according to their status. User can see own jobs, site admin see site’s, VO Member sees roles/groups for VO, VO Resource Manager sees UserDN (Link).

 

Slide 4 is an example of what can be retrieve from the CESGA Portal (maybe is still the development portal) for the CMS roles.

 

5.2      Policy Status

Updated draft distributed on 12 Feb, presented to GDB on 11th March by D.Kelsey. Request for comment from GDB. Will be tidied up and presented to MB for approval soon. Is believed to be ready for approval. Some updates will be done in future releases. As EGEE requested, maintaining privacy, one would like to see interesting information for instance users from which countries are using each Site.

 

Grid Policy on the Handling of User-Level Job Accounting Data. V0.7, 23 Jan 2009

http://www.jspg.org/wiki/Grid_Policy_on_the_Handling_of_User-Level_Job_Accounting_Data

5.3      Deployment Status

In total there are 234 sites publishing data into APEL Central Repository (some are could be “obsolete” Sites)

-       197 (84%) publishing FQAN

-       47 (20%) publishing UserDN

 

Slide 7 shows the names of the 47 Sites currently publishing the UserDN information in APEL.

5.4      Next Steps

The next steps are:

-       The Policy is mature enough. Speak up if a country is not happy.

-       The Sites supporting WLCG should be ‘encouraged’ to switch on publishing of UserDN

-       The CESGA Portal should be evaluated to see if people are happy with the functionality
(e.g. can one extract information useful for Sites and for VOs?)

 

I.Bird proposed to agree on the 3 actions below.

 

New Actions

          Tier-1 Site should start publishing the UserDN information.

          Countries should comment on the policy document on user information accounting

          The CESGA production Portal should be verified. J.Gordon sent information to the MB which portal to use (prod or pre-prod).

 

R.Pordes asked that a requirements document is written so that OSG can use it to define what to report.

 

New Action:

J.Gordon and R.Pordes will write a requirements document on user information accounting.

 

F.Hernandez asked whether is possible to hide the user name and information in APEL and only publish the role. The information cannot be published (or not) for the moment by the French Sites until they have the approval (or the refusal) from the authority verifying these procedures.

J.Gordon replied that this is already the case: user information is not published in APEL for the moment.

 

M.Schulz asked whether APEL will be abandoning RGMA and J.Gordon replied positively adding that a prototype is already available. The detailed timeline is still being discussed.

 

6.   GDB Summary (Agenda; Slides) – J.Gordon

 

J.Gordon presented a summary of the March’s GDB Meeting. (Link)

 

The main topics and the outcomes of last GDB were:

-       Accounting Policies:
Discussed today

-       ASGC Incident:
A reminder that disaster recovery plans are not just hypothetical.

-       Reporting Installed Capacity:
To be deployed soon. Lots of documentation, tests and GGUS group to advise sites. Timeline being defined.

-       Middleware Update:
Drawing up plan for SL5 for the rest of the m/ware. The MB can still influence this.

-       WN SL5/64 bit:
Release due on the 23/3, Experiments expect to test at each site, before doing the full switch.
UI and DPM are other service that could pass to SL5.

-       Multiuser Pilot Job Frameworks:
Good progress with development and certification of SCAS. ATLAS and LHCb are planning testing it.

-       CREAM:
Lots of positive progress. Encourage more sites to deploy the next (due soon) release. Should be installed to run in parallel with LCG-CE
still doesn’t pass user requirements to local batch system. More batch systems need supporting.

-       WMS Performance:
Recommended updating to mega-patch (not a success).
Was questioned whether is meeting the requirements.
Will not meet them without using bulk job-submission.

 

The Sites answers about their readiness to support SL5 and 64 bits OS (in bold missing information)

-       ASGC  64 bit already SL4 

-       BNL: N/A

-       CERN 10% already. 2 CEs. New nodes installed as SLC5

-       CNAF N/A

-       FNAL  N./A

-       FZK  end April

-       NL-T1   Require UI? If that ready by mid-April then end of summer.

-       NDGF  - ready

-       PIC – test ASAP hope April.

-       IN2P3 – Start in June with new WNs

-       RAL – end July

-       TRIUMF N/A

 

F.Hernandez noted that the SL5 64 bits installation must have the 32 bits compatibility libraries. And this should be reminded explicitly to the Tier-1 and Tier-2 Sites.

 

J.Templon noted that this should be also written in the VO ID Card.

 

New Action:

J.Gordon will send to the GDB the message about the need of 32 compatibility libraries on SL5 64 bits installations.

6.1      Future Meetings

Pre-GDB

Two proposals until now Tier-2 storage and Virtualization.

 

Tier2 Storage (DPM, Storm)

-       Both Castor and dCache have held meetings their meeting.

-       Discuss at the Collaboration Board.

 

Virtualisation

-       Lots of work happening

-       Share experience

 

GDB in April

-       Installed Capacity: Progress in deployment

-       Distributed Monitoring: Hopefully with demonstration

-       Pilot Jobs: VO tests

-       CREAM

-       WMS Performance

-       Operational Security

-       Accounting Policies

-       64 bit installations. Should some target be defined?

 

R.Pordes added that if OSG talks are needed they are glad to present to next GDB Meetings. When an EGGE person is reporting on some general item for WLCG also OSG is happy to present on their implementation.

 

7.   High Level Milestones (HLM_20090310.pdf) – A.Aimar

 

The MB discussed the major milestones for 2009 as requested by the LHCC Reviewers

 

ANALYSIS - Metrics and milestones should come from the WLCG workshop. Can one do analysis and data recall while collecting Raw Data?

 

STEP09 (i.e.”CCRC09” before the WLCG workshop)

STEP09 should be verifying storage, reprocessing and tape recall at least with ATLAS and CMS simultaneously.

They must be finished by the end of June because CMS goes in data taking mode in July. Plans must be agreed at the Workshop.

 

STEP 2009 - Tier-1 Validation

WLCG-09-23

Jun 2009

Tier-1 Validation by the Experiments

ALICE

 

 

 

 

 

 

 

 

 

 

 

 

ATLAS

 

 

 

 

 

 

 

 

 

 

 

 

CMS

 

 

 

 

 

 

 

 

 

 

 

 

LHCb

 

 

 

 

 

 

 

 

 

 

 

 

 

SL5 WN Deployment

WLCG-09-21: SLC5 gcc 4.3 (WN 4.1 binaries)Tested by the Experiments: DONE

WLCG-09-22: SLC5 Deployed: By the summer.

SLC5 Milestones

WLCG-09-21

DONE

SLC5 gcc 4.3 (WN 4.1 binaries)Tested by the Experiments
Experiments should test whether the MW on SL5 support their grid applications

ALICE

ATLAS

CMS

LHCb

WLCG-09-22

Jul 2009

SLC5 Deployed by the Sites (64 bits nodes)
Assuming the tests by the Experiments were successful. Otherwise a real gcc 4.3 porting of the WN software is needed.

 

 

 

 

 

 

 

 

 

 

 

 

 

T.Cass noted that the Experiment tested that their SL4 binaries run on SL5 in 32 bits compatibility nodes not on SL5 binaries.

R.Tafirout noted that could be milestones requiring that Experiments port their applications to SL5. 

 

SCAS Deployment

WLCG-09-17: SCAS Solutions Available for Deployment: DONE in March 2009

WLCG-09-18: SCAS Verified by the Experiments: April 2009 (WOLCG-09-17 + 1 month)

 

SCAS/glExec Milestones

WLCG-09-17

Jan 2009

SCAS Solutions Available for Deployment
Certification successful and SCAS packaged for deployment

 Done in Mar 2009

WLCG-09-18

Apr 2009

SCAS Verified by the Experiments
Experiment verify that the SCAS implementation is working (available at CNAF and NL-T1)

ALICE
n/a

ATLAS

CMS
n/a?

LHCb

WLCG-09-19

09-18 + 1 Month

SCAS + glExec Deployed and Configured at the Tier-1 Sites
SCAS and glExec ready for the Experiments.

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-09-20

09-18 + 3 Month

SCAS + glExec Deployed and Configured at the Tier-2 Sites
SCAS and glExec ready for the Experiments.

 

 

ACCOUNTING Milestones

J.Gordon will propose some milestones.

The CESGA Portal will have an option to download all values in CSV format.

 

Accounting Milestones

WLCG-09-02

Apr 2009

Wall-Clock Time Included in the Tier-2 Accounting Reports
The APEL Report should include CPU and wall-clock accounting

APEL

WLCG-09-03

TBD

Tier-2 Sites Report Installed Capacity in the Info System
Both CPU and Disk Capacity is reported in the agreed GLUE 1.3 format.   

% of T2 Sites Reporting

WLCG-09-04

TBD

User Level Accounting
(verify with the Experiments)

 

 

TIER-1 Procurement

It was already agreed to have it by end of September.

 

Tier-1 Sites Procurement – 2009

WLCG-09-01

Sept 2009

MoU 2009 Pledges Installed
To fulfill the agreement that all sites procure they  MoU pledged by April of every year

 

 

 

 

 

 

 

 

 

 

 

 

 

SRM Milestones

Need to understand the status of the different implementations. Was discussed in the GDB.

 

SRM Milestones

WLCG-09-05

Dec 2008

SRM Short-Term Solutions Available for Deployment

CASTOR

dCache

DPM

StoRM

BestMan

WLCG-09-06

TBD

SRM Short-Term Solutions Deployed at Tier-1 Sites Installation at the Tier-1 Sites

 

 

 

 

 

 

 

 

 

 

 

 

 

MSS Metrics

A web page should collect these metrics instead of the milestones.

 

FTS Deployment on SL4

Needs to be verified with the Sites.

 

New Action:

Sites should report whether (or when) FTS is deployed on SL4.

 

FTS Milestones

WLCG-09-07

TBD

FTS Deployed on SL4 at the Tier-1 Sites
FTS is ready to be installed on SL4 at the Tier-1 Sites

 

 

 

 

 

 

 

 

 

 

 

 

 

HEPSPEC-06 Milestones

Experiments need to specify the requirements and Site to benchmark their installations.

 

CPU Benchmarks/Units Milestones

WLCG-09-14

Dec 2008

CPU New Unit Working Group Completed
Agreement on Benchmarking Methods  Conversion Proposal and Test Machines

CPU New Benchmarking Unit Working Group

WLCG-09-15

Feb 2009

Sites Pledges in HEPSPEC-06
Pledged from the Sites should be converted to the new unit

LCG Office

WLCG-09-16

Apr 2009

New Experiments Requirement in HEPSPEC-06
Experiments should convert their requirements to the new unit (or by LCG Office)

ALICE

ATLAS

CMS

LHCb

WLCG-09-24

May 2009

Sites Report capacity in the HEPSPEC-06
Pledged from the Sites should be converted to the new unit

 

 

 

 

 

 

 

 

 

 

 

 

 

Pilot Jobs Frameworks

 

J.Templon asked that the milestone should be considered done when the Experiment implements what is requested by the Reviewers.

 

New Action

A.Aimar will verify with M.Litmaath the situation for the pilot hobs frameworks of ALICE, ATLAS and CMS.

 

Pilot Jobs Frameworks

WLCG-08-14

May 2008

Pilot Jobs Frameworks studied and accepted by the Review working group
Working group proposal complete and accepted by the Experiments.

ALICE

ATLAS

CMS

LHCb
Nov 2007

 

CREAM CE Milestones

M.Schulz will send Milestones for the CREAM CE.

 

Received after the Meeting from M.Schulz

 

In the GDB some weeks ago we agreed to have all T1s to run at least 1 CE. Currently we have 4 T1s and the T0 that have followed the GDB. In addition we have a few T2s sporting CREAM-CEs.
Now we are close to move at the 6th of April an improved CREAM-CE from PPS to production.
We can take the release of this version as the start for the rollout.


Assuming release of the next CREAM-CE is 6th of April. I suggest the following milestones.

June 1st:          All European T1 + TRIUMF and CERN at least 1 CE.  5 T2s supporting ALICE 1 CE

August 1st:      2 T2s for each experiment provide 1 CREAM-CE each.

October 1st:     50 sites in addition to the ones above.


At the beginning of September we should re-consider the roadmap and decide whether the October milestone can be more aggressive.

With a total of about 100 CREAM-CEs in production we probably gather enough experience during the first LHC run period to phase out the LCG-CE after the run.
This means that the LCG-CE will be on the infrastructure until late 2010. This raises the question of a port to SL5, something we tried to avoid.

 

 

 

8.   AOB

 

 

QR Reports: Material for the QR will be needed during the month of April.

 

Experiments Requirements: No meeting next week. Experiment requirements on the 31st March.

 

9.   Summary of New Actions

 

 

New Actions

          Tier-1 Site should start publishing the UserDN information.

          Countries should comment on the policy document on user information accounting

          The CESGA production Portal should be verified. J.Gordon sent information to the MB which portal to use (prod or pre-prod).

 

New Action:

J.Gordon and R.Pordes will write a requirements document on user information accounting.

 

 

New Action:

J.Gordon will send to the GDB the message about the need of 32 compatibility libraries on SL5 64 bits installations.

 

 

New Action:

Sites should report whether (or when) FTS is deployed on SL4.

 

 

New Action

A.Aimar will verify with M.Litmaath the situation for the pilot hobs frameworks of ALICE, ATLAS and CMS.