LCG Management Board |
||||||||||||||||||||||||||||||||
Date/Time |
Tuesday
8 September 2009 16:00-18:00 – F2F Meeting
|
|||||||||||||||||||||||||||||||
Agenda
|
||||||||||||||||||||||||||||||||
Members |
||||||||||||||||||||||||||||||||
|
(Version 1 – 13.9.2009) |
|||||||||||||||||||||||||||||||
Participants |
A.Aimar
(notes), D.Barberis, O.Barring, I.Bird (chair), K.Bos, Ph.Charpentier,
L.Dell’Agnello, M.Ernst, I.Fisk, S.Foffano, Qin Gang, M.Girone, J.Gordon,
D.Groep, A.Heiss, F.Hernandez, M.Kasemann, M.Lamanna, M.Litmaath, P.Mato,
G.Merino, A.Pace, B.Panzer, R.Pordes, M.Schulz, Y.Schutz, J.Shiers ,
O.Smirnova, R.Tafirout |
|||||||||||||||||||||||||||||||
Invited |
A.Sciabá, T.Bell |
|||||||||||||||||||||||||||||||
Action
List |
||||||||||||||||||||||||||||||||
Mailing
List Archive |
https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/ |
|||||||||||||||||||||||||||||||
Next Meeting |
Tuesday
15 September 2009 16:00-17:00 – Phone Meeting |
|||||||||||||||||||||||||||||||
1. Minutes and Matters arising (Minutes)
|
||||||||||||||||||||||||||||||||
1.1 Minutes of Previous Meeting
No
comments received about the minutes. 1.2 SAM Availability CalculationsI.Bird
reported that the SAM Availability calculations are going to be modified in
order to take into account the downtimes scheduled in GOCDB but not recorded
enough in advance will not be accounted as scheduled, as they were. 1.3 2009/2010 Resources and Requirements (More Information) – I.Bird
There is now updated
information at the Resources page on the WLCG Web (Link). There are 2
tables; one after the meeting of the Scrutiny group and the second
after ALICE’s corrections: -
(Sep 3,
2009) Resource Requirements for 2009/2010 after
LHCC/C-RSG review July 2009 -
(Sep 4,
2009) Requirements for 2009/10 as below but with
ALICE updated figures The main difference
is the ALICE Tier-1 CPU request that is 50% smaller than before. S.Foffano added that each Site will be
contacted to have confirmation (or changes) on the 2009/10 pledges and data.
Is needed for the RRB and should be sent by end of September. 1.4 Referees Meeting (Agenda) – I.BirdOn
the 21 September there is a Referees meeting. -
Global status report -
Follow-ups to the STEP09 actions. -
Preparation to EGEE to EGI Transition -
15’ per Experiments Ph.Charpentier reminded that LHCb
will not be able to be represented but will post some slides. 1.5 Move to SLC5 at CERN (on lxplus) – I.BirdThe proposed switch of the “lxplus” alias from SLC4 to SLC5 received several objections from the Experiments (via emails to the MB mailing list). The SLC5 resources at CERN are mostly unused and one should encourage the users to migrate. I.Bird
proposed that each Experiment presents their status about SLC5. D.Barberis noted that the alias
is not the reason for the low usage of SLC5 resources. The reason is that SL4
for building will run on SL5. Not vice versa. SL4 will be still used for
building. Even if the alias is changed the SL4 resources should not diminish.
Ph.Charpentier added that
changing the alias will cause 2 weeks of help requests. The change should
happen when SL5 will be supported on most of the Sites. Sites should migrate
to SL5 urgently. I.Bird asked whether the alias is
not switched and users choose SL5 or Sl4 resources. P.Mato added that the switch
should happen when most resources are on SL5. Changing the alias will affect
the scripts using lxplus. Users should already have been explained exactly on
how to go on SL4 or on SL5. Physicists do not know there are all these SL5
resources and how to address them. IT should distribute this information more
widely and repeatedly. O.Barring added that the
instructions are there in the lxbatch pages and when you login is explained
to the users. D.Barberis noted that building on
SL5 one cannot run on SL4 therefore one cannot use as many resources. Only
when SL4 will be negligible users will switch. The alias can be changed as
long as the SL4 and SL5 resources can be addressed specifically. Ph.Charpentier noted that as many
Sites have not migrated users only trust building on SL4. Otherwise they risk
not being able to run on some Sites. I.Bird concluded that: -
The alias
switch should not be done now -
Enough
resources on SL4 should be kept -
Experiments
should present their step towards SL5 A new High Level milestone should
be added for SL5 for interactive users. Ph.Charpentier added that the
lxplus alias should not be changed but aliases for SL4 and SL5 should be
advertised. Physicists will know exactly what they are doing. |
||||||||||||||||||||||||||||||||
2. Action List Review (List of actions)
|
||||||||||||||||||||||||||||||||
R.Wartel told A.Aimar
that the action is still incomplete and the information is insufficient. L.Dell’Agnello replied
that he will send additional information.
Not done by: ES-PIC, FR-CCIN2P3, NDGF,
NL-T1, |
||||||||||||||||||||||||||||||||
3.
LCG
Operations Weekly Report (Slides)
– M.Girone
|
|
|||||||||||||||||||||||||||||||
Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings Smooth week of operations, good attendance (~15 people) and no major problems. More systematic use of Site Status Board (SSB) for Tier0 scheduled intervention. There will be quite a few interventions in coming days. Usual level of on-going problems that are picked up and solved quickly (see daily minutes). Tier-1: There was significant site instability at SARA for ALICE due to a VObox issue: the proxy renewal service stopped. It seemed solved on 1st Sept but came back on 2nd Sept and is still under investigation. Tier-0: There was a Castor outage for LHCb on Monday 07.09 from 6.10am to 7.45am due clusterware problems. SIR requested. No alarms this week.
|
|
|||||||||||||||||||||||||||||||
4. Real Time MSS Metrics – Roundtable
|
|
|||||||||||||||||||||||||||||||
The
Experiments asked for more real time metrics during STEP09. CERN T.Bell
reported that CERN was asked for changes during STEP09: - Showing data rates quasi-real time in 5 minutes. Experiments asked for immediate feedback not waiting for the information collected over 4 hours. - Move to 24 h sliding windows instead of the 4h sliding window. These changes are done for CERN. The
real time data rates will be in October at the CASTOR upgrade. There are only
early pictures of the proposal. J.Gordon asked whether Tier-1 Sites can
produce these or other real time metrics. Sites do not have dedicated
hardware or system to specific Experiments or not even to HEP. Therefore this
information can be very difficult to achieve. M.Kasemann clarified that for the moment
the urgent priority is for CERN. Is needed to correlate the Experiment’s
performance and the data storage. Tier-1 will also be useful but is a second
degree problem. K.Bos added that for ATLAS is also
sufficient. Some Sites already give summaries after one hour. And MSS shared
with others are difficult to monitor. |
|
|||||||||||||||||||||||||||||||
5. Update on the SRM MoU Addendum features (Slides;
More
Information) – A.Sciabá
|
|
|||||||||||||||||||||||||||||||
The
Experiments were asked to prioritize the requirements. The scale is as
follow: - 0 = useless - 1 = if available, it could allow for more functionality in data management, or better performance, or easier operations, but it is not critical - 2 = critical: it should be implemented as soon as possible and its lack causes a significant degradation of functionality / performance / operations - Fractional values will also be used. All
information is now collected in a wiki page with all details on the features
and the ranking of each Experiment. https://twiki.cern.ch/twiki/bin/view/LCG/SrmMoUStatus
5.1 Feature by Feature PrioritiesThe
slides presented summarize the wiki page content as of today. Protection of Spaces Implementation - CASTOR: yes, but not by DN/FQAN only by the local user identification. - DCache: only tape protection (by DN/FQAN). Staging of data can be controlled. - DPM: only Write-To-Space protection Priorities - ATLAS: 2: extremely important - CMS: 1: protect spaces dedicated to special activities (e.g. T1D0 at T1) - LHCb: 1: needed for better data protection Full
VOMS awareness Implementation - CASTOR: not available, no estimate of availability - Presented in the other implementation Priorities - ATLAS: 1.5: very important… - CMS: 1: easier management of access privileges - LHCb: 1: needed for better data protection Select
Spaces for Read Operations Implementation - CASTOR available - dCache: no, not foreseen at the time - StoRM: no, not foreseen (but ok for T2’s) Priorities - ATLAS: 1 - CMS: 1: to enable use of space tokens at all T1s - LHCb: 1: needed to understand data movement and used disk space “ls”
Returns all Space Tokens with a Copy of the File Implementation - CASTOR: not yet - dCache/StoRM: no, but files can be in one space only Priority - ATLAS: 0.5 - CMS: 1: to understand where a file is - LHCb: 0.5 Note:
On the client side all above features are accessible via GFAL/lcg-utils File
Pinning Implementation - CASTOR; only (very) soft pinning. If any. - Other implementation have it Priority - ATLAS: 2: essential; soft pinning acceptable in view of the upcoming Pre-stage Service - CMS: 2: essential - LHCb: 2: essential. Soft pinning can be accepted but must have a real effect on the probability of a file to be garbage collected Ph.Charpentier noted that
real pinning is really needed otherwise files can be not available on disk
when needed. 5.2 Summary of the PrioritiesExtremely
important -
Space
protection. But at least tape protection is available everywhere -
File
pinning: On CASTOR is almost non-existent Rather important -
VOMS awareness:
Missing from CASTOR Useful -
Target space selection: Missing in dCache an StoRM Nice to have -
Ls
returns all spaces with a copy of the file: Missing in CASTOR (where it makes
sense) Everything considered, file pinning and VOMS awareness on CASTOR have the highest weight and require a significant development. Everything else, from the MoU addendum, is more or less acceptable, or the Experiments can survive without. Ph.Charpentier
reminded the importance of protecting the data from intentional or
unintentional deletion. K.Bos
asked now how these priorities are going to be implemented and installed at
the Sites. Sites
should move to dCache 1.9.4 that has ACL but Sites have not moved. ATLAS
would insist to have ACLs before the run (i.e. 1.9.4). A.Sciabá
added that they will not require the move to Chimera. K.Bos
repeated that it is essential for ATLAS protecting the tapes from reading
will be essential. A.Heiss
asked whether tape protection is sufficient or real ACLs are needed. J.Gordon
replied that even protecting from reading (to avoid overloading of the MSS)
requires ACL. A.Sciabá
added that the MSS systems have also a dCache configuration file with the
list of DNs that can access the tapes. In CASTOR the list is of local users
not of grid users. One must know the local mapping for data management. Note: Other discussions describing (in) security details
and data access are not reported in the minutes. |
|
|||||||||||||||||||||||||||||||
6. Update on Data Access (Slides) – M.Litmaath |
|
|||||||||||||||||||||||||||||||
M.Litmaath
presented the latest news on data access, the issues encountered during
STEP09 and the ongoing work. 6.1 SummaryData
access efficiencies need improvements at many sites; one must reach lower
failure rates, better CPU-wall clock ratios. This
depends on the supported experiments; for some Experiment may work sell and
not for others. It ca depends on the site layout, network infrastructure and
usually cannot be adapted in the short term. Can
also depend on the access protocols and their usage. For instance read-ahead
is good for sequential access, bad for random access. And also RFIO read
ahead buffer size cannot (yet) be set by client code but at the Sites. One
value per WN set by admin but different experiments need different buffer
sizes due to different event and processing models. Should
we aim at phasing out some protocols in favour of others? Probably not
feasible in the short term. 6.2 Short Term PlansATLAS ATLAS are determining the best access method per site. The activity mostly driven by the clouds those want to improve working with the central steering through HammerCloud team. Some global tests possibly will occur early Oct 2009. The best method is recorded for each site is stored in central configuration. CMS CMS intend to investigate improvements early Oct 2009. Will be taking note of ATLAS results in order to avoid duplication of work. Each CMS site already configures the protocol to be used by jobs. ALICE
and LHCb LHCb, ALICE: no plans so far as there are not problems in STEP09. A small LHCb problem (dCache client vs. ROOT plugin) is now solved shipping the dCache client. Ph.Charpentier
noted that LHCb is having activities but is not an issue for the moment 6.3 Preliminary Recommendations (Max Baak, ATLAS)Here is what M.Baak has found analysing the performance at the CERN T3 Site: -
Xrootd & rfio read-ahead
buffering: very inefficient. - Xrootd: frustrating dependency on PoolFileCatalog.xml -
Don’t use rfio protocol to
loop over files on CASTOR! Different recommendations are needed for single or multiple jobs: - Single jobs: FileStager does very well. -
Multiple, production-style
jobs (at CERN T3, but maybe not everywhere) D.Barberis
warned to take carefully the results because this is only from some
measurements at CERN. Is not a general statement. Ph.Charpentier
noted that these tools are needed because the MSS systems do not have a
useable efficient caching system accessible to the users. In addition the MSS
should return proper TURLs so that Experiments do not need to manage them by
themselves. |
|
|||||||||||||||||||||||||||||||
7. AOB
|
|
|||||||||||||||||||||||||||||||
7.1 Next ATLAS JamboreeATLAS will organize an ATLAS Jamboree in October 13, the day before next GDB. 7.2 Frontier in ATLASATLAS started using Frontier at some Sites for accessing their data. This will be in addition to 3D which will continue to work as now. The goal is to help in access for analysis. |
|
|||||||||||||||||||||||||||||||
8. Summary of New Actions |
|
|||||||||||||||||||||||||||||||
No new
actions. |
|
|||||||||||||||||||||||||||||||