LCG Management Board

Date/Time

Tuesday 8 September 2009 16:00-18:00 – F2F Meeting 

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=62557

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 13.9.2009)

Participants

A.Aimar (notes), D.Barberis, O.Barring, I.Bird (chair), K.Bos, Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, S.Foffano, Qin Gang, M.Girone, J.Gordon, D.Groep, A.Heiss, F.Hernandez, M.Kasemann, M.Lamanna, M.Litmaath, P.Mato, G.Merino, A.Pace, B.Panzer, R.Pordes, M.Schulz, Y.Schutz, J.Shiers , O.Smirnova, R.Tafirout 

Invited

A.Sciabá, T.Bell

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 15 September 2009 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments received about the minutes.

1.2      SAM Availability Calculations

I.Bird reported that the SAM Availability calculations are going to be modified in order to take into account the downtimes scheduled in GOCDB but not recorded enough in advance will not be accounted as scheduled, as they were.

1.3      2009/2010 Resources and Requirements (More Information) – I.Bird

There is now updated information at the Resources page on the WLCG Web (Link).

 

There are 2  tables; one after the meeting of the Scrutiny group and the second after ALICE’s corrections:

-       (Sep 3, 2009) Resource Requirements for 2009/2010 after LHCC/C-RSG review July 2009

-       (Sep 4, 2009) Requirements for 2009/10 as below but with ALICE updated figures

 

The main difference is the ALICE Tier-1 CPU request that is 50% smaller than before.

 

S.Foffano added that each Site will be contacted to have confirmation (or changes) on the 2009/10 pledges and data. Is needed for the RRB and should be sent by end of September.

1.4      Referees Meeting (Agenda) – I.Bird

On the 21 September there is a Referees meeting.

-       Global status report

-       Follow-ups to the STEP09 actions.

-       Preparation to EGEE to EGI Transition

-       15’ per Experiments

 

Ph.Charpentier reminded that LHCb will not be able to be represented but will post some slides.

1.5      Move to SLC5 at CERN (on lxplus) – I.Bird

The proposed switch of the “lxplus” alias from SLC4 to SLC5 received several objections from the Experiments (via emails to the MB mailing list).

The SLC5 resources at CERN are mostly unused and one should encourage the users to migrate.

 

I.Bird proposed that each Experiment presents their status about SLC5.

 

D.Barberis noted that the alias is not the reason for the low usage of SLC5 resources. The reason is that SL4 for building will run on SL5. Not vice versa. SL4 will be still used for building. Even if the alias is changed the SL4 resources should not diminish.

 

Ph.Charpentier added that changing the alias will cause 2 weeks of help requests. The change should happen when SL5 will be supported on most of the Sites. Sites should migrate to SL5 urgently.

 

I.Bird asked whether the alias is not switched and users choose SL5 or Sl4 resources.

 

P.Mato added that the switch should happen when most resources are on SL5. Changing the alias will affect the scripts using lxplus. Users should already have been explained exactly on how to go on SL4 or on SL5. Physicists do not know there are all these SL5 resources and how to address them. IT should distribute this information more widely and repeatedly.

 

O.Barring added that the instructions are there in the lxbatch pages and when you login is explained to the users.

 

D.Barberis noted that building on SL5 one cannot run on SL4 therefore one cannot use as many resources. Only when SL4 will be negligible users will switch. The alias can be changed as long as the SL4 and SL5 resources can be addressed specifically.

 

Ph.Charpentier noted that as many Sites have not migrated users only trust building on SL4. Otherwise they risk not being able to run on some Sites.

 

I.Bird concluded that:

-       The alias switch should not be done now

-       Enough resources on SL4 should be kept

-       Experiments should present their step towards SL5

 

A new High Level milestone should be added for SL5 for interactive users.

 

Ph.Charpentier added that the lxplus alias should not be changed but aliases for SL4 and SL5 should be advertised. Physicists will know exactly what they are doing.

 

2.   Action List Review (List of actions)

 

  • 5 May 2009 - CNAF completes, and sends to R.Wartel, their plans to solve the issues concerning security procedures, alarm contact point, etc.

R.Wartel told A.Aimar that the action is still incomplete and the information is insufficient.

L.Dell’Agnello replied that he will send additional information.

  •  Each Site should send the URL of the XML File with the Tape Metrics to A.Aimar.

Not done by: ES-PIC, FR-CCIN2P3, NDGF, NL-T1,
Sites can provide what they have at the moment.

See http://sls.cern.ch/sls/service.php?id=WLCG_Tier1_Tape_Metrics

Sites should send URLs to existing information until they do not provide the required information.

 

3.   LCG Operations Weekly Report (Slides) – M.Girone
 

 

Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

 

Smooth week of operations, good attendance (~15 people) and no major problems. More systematic use of Site Status Board (SSB) for Tier0 scheduled intervention. There will be quite a few interventions in coming days.

 

Usual level of on-going problems that are picked up and solved quickly (see daily minutes).

 

Tier-1: There was significant site instability at SARA for ALICE due to a VObox issue: the proxy renewal service stopped. It seemed solved on 1st Sept but came back on 2nd Sept and is still under investigation.

 

Tier-0: There was a Castor outage for LHCb on Monday 07.09 from 6.10am to 7.45am due clusterware problems. SIR requested.

 

No alarms this week.

VO

User

Team

Alarm

Total

ALICE

2

0

0

2

ATLAS

8

33

0

41

CMS

3

0

0

3

LHCb

0

14

0

14

Totals

13

47

0

60

 

 

 

4.   Real Time MSS Metrics – Roundtable

 

 

The Experiments asked for more real time metrics during STEP09.

 

CERN

T.Bell reported that CERN was asked for changes during STEP09:

-       Showing data rates quasi-real time in 5 minutes. Experiments asked for immediate feedback not waiting for the information collected over 4 hours.

-       Move to 24 h sliding windows instead of the 4h sliding window. These changes are done for CERN.

 

The real time data rates will be in October at the CASTOR upgrade. There are only early pictures of the proposal.

 

J.Gordon asked whether Tier-1 Sites can produce these or other real time metrics. Sites do not have dedicated hardware or system to specific Experiments or not even to HEP. Therefore this information can be very difficult to achieve.

 

M.Kasemann clarified that for the moment the urgent priority is for CERN. Is needed to correlate the Experiment’s performance and the data storage. Tier-1 will also be useful but is a second degree problem.

 

K.Bos added that for ATLAS is also sufficient. Some Sites already give summaries after one hour. And MSS shared with others are difficult to monitor.

 

 

5.   Update on the SRM MoU Addendum features (Slides; More Information) – A.Sciabá 

 

 

The Experiments were asked to prioritize the requirements. The scale is as follow:

-       0 = useless

-       1 = if available, it could allow for more functionality in data management, or better performance, or easier operations, but it is not critical

-       2 = critical: it should be implemented as soon as possible and its lack causes a significant degradation of functionality / performance / operations

-       Fractional values will also be used.

 

All information is now collected in a wiki page with all details on the features and the ranking of each Experiment.

https://twiki.cern.ch/twiki/bin/view/LCG/SrmMoUStatus

5.1      Feature by Feature Priorities

The slides presented summarize the wiki page content as of today.

 

Protection of Spaces

Implementation

-       CASTOR: yes, but not by DN/FQAN only by the local user identification.

-       DCache: only tape protection (by DN/FQAN). Staging of data can be controlled.

-       DPM: only Write-To-Space protection

Priorities

-       ATLAS: 2: extremely important

-       CMS: 1: protect spaces dedicated to special activities (e.g. T1D0 at T1)

-       LHCb: 1: needed for better data protection

 

Full VOMS awareness

Implementation

-       CASTOR: not available, no estimate of availability

-       Presented in the other implementation

Priorities

-       ATLAS: 1.5: very important…

-       CMS: 1: easier management of access privileges

-       LHCb: 1: needed for better data protection

 

Select Spaces for Read Operations

Implementation

-       CASTOR available

-       dCache: no, not foreseen at the time

-       StoRM: no, not foreseen (but ok for T2’s)

Priorities

-       ATLAS: 1

-       CMS: 1: to enable use of space tokens at all T1s

-       LHCb: 1: needed to understand data movement and used disk space

 

“ls” Returns all Space Tokens with a Copy of the File

Implementation

-       CASTOR: not yet

-       dCache/StoRM: no, but files can be in one space only

Priority

-       ATLAS: 0.5

-       CMS: 1: to understand where a file is

-       LHCb: 0.5

 

Note: On the client side all above features are accessible via GFAL/lcg-utils

 

File Pinning

Implementation

-       CASTOR; only (very) soft pinning. If any.

-       Other implementation have it

Priority

-       ATLAS: 2: essential; soft pinning acceptable in view of the upcoming Pre-stage Service

-       CMS: 2: essential

-       LHCb: 2: essential. Soft pinning can be accepted but must have a real effect on the probability of a file to be garbage collected

 

Ph.Charpentier noted that real pinning is really needed otherwise files can be not available on disk when needed.

5.2      Summary of the Priorities

Extremely important

-       Space protection. But at least tape protection is available everywhere

-       File pinning: On CASTOR is almost non-existent

 

Rather important

-       VOMS awareness: Missing from CASTOR

 

Useful

-       Target space selection: Missing in dCache an StoRM

 

Nice to have

-       Ls returns all spaces with a copy of the file: Missing in CASTOR (where it makes sense)

 

Everything considered, file pinning and VOMS awareness on CASTOR have the highest weight and require a significant development. Everything else, from the MoU addendum, is more or less acceptable, or the Experiments can survive without.

 

Ph.Charpentier reminded the importance of protecting the data from intentional or unintentional deletion.

 

K.Bos asked now how these priorities are going to be implemented and installed at the Sites.

Sites should move to dCache 1.9.4 that has ACL but Sites have not moved. ATLAS would insist to have ACLs before the run (i.e. 1.9.4).

 

A.Sciabá added that they will not require the move to Chimera.

 

K.Bos repeated that it is essential for ATLAS protecting the tapes from reading will be essential.

 

A.Heiss asked whether tape protection is sufficient or real ACLs are needed.

J.Gordon replied that even protecting from reading (to avoid overloading of the MSS) requires ACL.

A.Sciabá added that the MSS systems have also a dCache configuration file with the list of DNs that can access the tapes. In CASTOR the list is of local users not of grid users. One must know the local mapping for data management.

 

Note: Other discussions describing (in) security details and data access are not reported in the minutes.

 

 

6.   Update on Data Access (Slides) – M.Litmaath

 

 

M.Litmaath presented the latest news on data access, the issues encountered during STEP09 and the ongoing work.

6.1      Summary

Data access efficiencies need improvements at many sites; one must reach lower failure rates, better CPU-wall clock ratios.

This depends on the supported experiments; for some Experiment may work sell and not for others. It ca depends on the site layout, network infrastructure and usually cannot be adapted in the short term.

 

Can also depend on the access protocols and their usage. For instance read-ahead is good for sequential access, bad for random access. And also RFIO read ahead buffer size cannot (yet) be set by client code but at the Sites.

One value per WN set by admin but different experiments need different buffer sizes due to different event and processing models.

 

Should we aim at phasing out some protocols in favour of others? Probably not feasible in the short term.

6.2      Short Term Plans

ATLAS

ATLAS are determining the best access method per site. The activity mostly driven by the clouds those want to improve working with the central steering through HammerCloud team.

Some global tests possibly will occur early Oct 2009.

The best method is recorded for each site is stored in central configuration.

 

CMS

CMS intend to investigate improvements early Oct 2009. Will be taking note of ATLAS results in order to avoid duplication of work. Each CMS site already configures the protocol to be used by jobs.

 

ALICE and LHCb

LHCb, ALICE: no plans so far as there are not problems in STEP09.

A small LHCb problem (dCache client vs. ROOT plugin) is now solved shipping the dCache client.

 

Ph.Charpentier noted that LHCb is having activities but is not an issue for the moment

 

6.3      Preliminary Recommendations (Max Baak, ATLAS)

Here is what M.Baak has found analysing the performance at the CERN T3 Site:

 

-       Xrootd & rfio read-ahead buffering: very inefficient.
Lots of unnecessary data transfer (sometimes >50x data processed!)
1 job completely blocks up 1Gbit Ethernet card of lxbatch machines
Large spread in job times, i.e. Unreliable

 

-       Xrootd: frustrating dependency on PoolFileCatalog.xml

 

-       Don’t use rfio protocol to loop over files on CASTOR!
>5x slower & takes up too much network bandwidth

 

Different recommendations are needed for single or multiple jobs:

 

-       Single jobs: FileStager does very well.

 

-       Multiple, production-style jobs (at CERN T3, but maybe not everywhere)
Xrootd (no buffer) works extremely stable & fast on files in disk pool cache. A factor ~2 slow-down when read-ahead buffer turned on.

Two recommendations:
- Xrootd, no buffer, for cached files
- FileStager or Xrootd for uncached files

 

D.Barberis warned to take carefully the results because this is only from some measurements at CERN. Is not a general statement.

 

Ph.Charpentier noted that these tools are needed because the MSS systems do not have a useable efficient caching system accessible to the users. In addition the MSS should return proper TURLs so that Experiments do not need to manage them by themselves.

 

 

7.    AOB

 

 

7.1      Next ATLAS Jamboree

ATLAS will organize an ATLAS Jamboree in October 13, the day before next GDB.

7.2      Frontier in ATLAS

ATLAS started using Frontier at some Sites for accessing their data. This will be in addition to 3D which will continue to work as now. The goal is to help in access for analysis.

 

 

8.    Summary of New Actions

 

 

 

No new actions.