LCG Management Board |
|
Date/Time:
|
Tuesday 21 November 2006 at 16:00 |
Agenda: |
|
Members: |
|
|
(Version
1 - 24.11.2006) |
Participants: |
A.Aimar (notes), D.Barberis, S.Belforte, I.Bird, N.Brook, F.Carminati, Ph.Charpentier, Di Quing, C.Eck, I.Fisk, B.Gibbard, J.Gordon, F.Hernandez, M.Kasemann, E.Laure, H.Marten, P.Mato, M.Mazzucato, G.Merino, B.Panzer, R.Pordes, L.Robertson (chair), O.Smirnova, J.Templon |
Action List |
|
Next
Meeting: |
Tuesday 5 December 16:00-18:00
Face-to-face at CERN |
1. Minutes and Matters arising (minutes) |
|
1.1 Minutes of Previous MeetingThe minutes of last week where approved. A discussion took place about the CE
strategy of the EGEE TCG that was presented at last GDB and discussed during
the MB item on the GDB. M.Mazzucato considered that the TCG should first
report decisions and planning to the EGEE PMB, where the resources are
allocated, and only later present it to the LCG. This was not the opinion of
other EGEE members. The MB agreed that
this is an EGEE issue and the discussion was not continued further. 1.2 "CPU time escaping jobs" Issue - J.TemplonJ.Templon reported that there are two classes of applications that are “escaping” CPU time accounting and causing invalid results for the site: -
ATLAS is using the Condor
glide-in technology which starts a new session for every job. This causes CPU
time for ATLAS to be incorrect. - LHCb have a new version of their software compiled with special flags that will not run on all grid nodes at SARA. If the job reaches a node that cannot run it, it waits 10 minutes and then exits. This means that the job blocks the scheduler slot, causing a loss of up to 10% of the available CPU time. Ph.Charpentier
explained that on new machines, compiling with this new flag allows some 20%
performance improvement. But the resulting binaries will no run on legacy
machines (P3, Athlon, etc). J.Gordon
remarked that some sites are not homogeneous and there is no implementation
of sub-clusters available, which could allow the use of different binaries
depending on the node. This will become an issue also when there will be 32
and 64 bit machines installed in the same site. I.Fisk
noted that “wall clock” accounting is useful because it shows how
many resources were made available to a VO. The fact that the VO actually
used all CPUs, or did not, is a VO’s problem. Ph.Charpentier
raised the issue that some sites have part of the nodes with less memory (ex:
512 MB vs. 1 GB). The sites should declare correctly the configuration of the
nodes. Otherwise the batch system and the “blah” component (with
gLite CE) cannot send the jobs to the right nodes. S.Belforte
noted that the issue is that the sites should provide the details of the
configurations and correct node descriptions and this is not the case in most
sites. D.Barberis
mentioned that the ATLAS pilot jobs check the configuration and run small
jobs where the nodes have less memory. This is possible because ATLAS
discussed with the sites the machines that it uses, while LHCb is trying to
use all resources available at the site. New Action: In 2 weeks D.Barberis and I.Bird will report on the issues with
accounting vs. Condor glide-in job submission. |
|
2. Action List Review (list of actions)Actions that are late
are highlighted in RED. |
|
Done.
H.Renshall presented the situation at the MB and will distribute the
requirements.
On going.
Distributed a proposal – discussed later in the meeting (see AOB).
Started,
but not completed yet. Discussing
the mandate with J.Casey and I.Neilson, proposed as joint chairs. It was
decided to spend a few more days to carefully define the scope and the
mandate of the working group on Monitoring Tools. For the
System Analysis group I.Bird will discuss with J.Andreeva, the proposed
chair, during next week.
Not done. Several
sites reiterated their request to “first receive the summary of the
requirements from H.Renshall and then provide the procurement plans”. New
Action: 28
November 2007 - H.Renshall will send the updated table of the experiments
requirements. These
requirements are urgently needed by the sites in order to define their
procurement plans. |
|
3. Update on the Megatable - C.Eck |
|
The Megatable, by C.Eck, was distributed to the GDB and the MB mailing lists. It is also available on the LCG web site (under Planning - Technical Design Reports - Draft: Storage and Network Requirements for Tier-1s). All changes will be maintained up to date in new versions of the document above. |
|
4. First Ideas for 2007 Targets and Milestones (Slides) - A.Aimar |
|
A.Aimar presented few initial ideas about which Level-1 targets and milestones could be defined for 2007. 4.1 General RemarksSlide 3 shows the general ideas about the 2007 targets: -
Define a ramp-up from the
current values to the targets to reach -
Define realistic targets
needed for the end of 2007 and for the 2008 run. - Have a standard profile ramp-up starting in Jan 2007, reaching the target in Feb 2008 and with a pause in Sept-Dec 2008 for data taking. -
Peak rates can be defined as
2 x the average rates. J.Gordon
asked whether the progress has really to be steady as indicated in the slide
and be checked monthly. A.Aimar
answered that this is the proposal for the MB. L.Robertson noted that it is
more important to show steady
improvements than to fix targets that have to be met earlier than really
required (difficult and expensive) in order to ensure that there is time for
corrective actions. D.Barberis
noted that the ramp-up curves may change depending of the kind of target and
on the resource. For instance disks arrive in bunches; therefore it will not
a steady slope but with steps, different at each site. In Slide 4 the proposal is that the aggregation of the values from all sites is done taking into account 70% of the sites. Namely CERN + 8 Tier-1 sites. The target values will be taken from the resources and network requirements, as specified by the Experiments in H.Renshall’s and C.Eck’s tables. Question for the MB: Should the targets
be measured for each VO? Or should they instead measure the aggregate for the
site (to show that it can support all of its VOs simultaneously)? J.Templon
suggested that separated values are better, but the sites will have to
provide these values. L.Robertson noted that not all rates can be separated
by VO. For instance IO rates may use equipment common to several VOs and so
measuring for one VO says nothing about the capability when all VOs are
active. 4.2 Targets for 2007Slide 5 to 9 show examples of possible measurements and their targets: The values proposed are
there to stimulate the discussion. The MB member should send feedback to
A.Aimar before the 29 Nov 2006. Targets
for “Data Rate to Tape” (slide 5) The target of I/O rates to tape storage should be defined at each site. The peak rate should be demonstrated: - for 3 days before Jun 07 - for 7 days before Sep 07 Sustain average rates at Tier-0 and at each Tier-1, driven by the experiments (DAQ at the Tier-0, FTS at the Tier-1s) and background dteam load - 3 days at 50% - Feb 07 -
7 days at 70% - Jul 07 -
10 days at 100% - Feb 08 Data is
taken from the Megatable and the results should be monitored and reported by
each site. Should these tests be done locally or with data streamed from
CERN? Should the peak rates be tested by all sites simultaneously and be
coordinated from CERN? Targets for FTS Transfers (slide 6) The
peak 2008 rates should be sustained: - for 3 days before July 07 - Tier-1 to/from Tier-0 - Tier-1 to/from other Tier-1s - Tier-1 to/from all Tier-2s The
average rates should be sustained permanently from Tier-0 to each Tier-1,
driven by the experiments and background dteam load - 50% - Feb 07 - 70% - Jun 07 - 90% - Sep 07 - 100% - Oct 07 Targets from Megatable, results extracted from GridView J.Gordon
noted that the transfers to the Tier-1 and Tier-2 sites should be verified
regularly because sometimes they stop working and are not maintained.
Therefore, also after July 07, there should be a small, but sustained,
transfer rate required every month, in order to make sure that the Tier-1 and
Tier-2 channels continue to work properly. J.Gordon
and H.Marten suggested that the experiments should generate the network load
(replacing the DTEAM load) because they are the ones that know when the load
is appropriate. In addition if dteam uses tapes to run these tests those
tapes will not be available to the experiments any more. D.Barberis
agreed that the experiments could have their own load generator and switch it
on and off to test the transfers when they need it. S.Belforte would prefer a
single solution rather than four load generators, one per experiment. M.Mazzucato
noted that DTEAM is not assigned the same site resources (servers, etc) as
the experiments; therefore tests and load from the experiments would be more
realistic. Targets
for data from DAQ to Tier-0 and to Tier-1 sites (slide 7) This needs to reach 100% by November 2007, and obviously be driven by the experiments The metrics start low because these transfers (DAQ to Tier-0 to Tier-1) have not been tried before: - 3 days 20% before Mar 07 - 5 days 50% before Jun 07 - 7 days 100% before Sep 07 Target values will be taken from the experiments and Megatable. Results will be extracted from the Lemon monitoring system (CERN) and GridView. Target
for Job Success Rate (slide 8) In order to measure job success quality one needs to: -
set targets for job success rates, by site and by VO -
perform job log analysis (ARDA dashboard work) Measure should include both the overall
success rate (grid, network, site problems) and the site success rate (exclude
non-site-related errors) Ramp up
based on these targets (percentage of the target, not in absolute %): -
60% - Jan 07 -
70% - Mar 07 -
80% - May 07 -
90% - Sep 07 -
100% - Feb 08 The discussion that followed suggested that
the targets may be depending on the VO and can be measured by the VOs.
Probably this does not need to be a Level-1 global target. Target for SAM Reliability Tests Reach
95% by Feb 2008 -
80% - Feb 07 -
85% - Jun 07 -
90% - Sep 07 -
95% - Feb 08 Always
taking CERN + 8 best sites into account and using the monthly averages. L.Robertson noted that 95% could be too
high. Having seen the result of the SAM tests until now (even taking into
account the problem of the SAM system) about 90 % could be the reachable
limit. Milestones for 2007 Slides
10 and 11 describe the proposed milestones and the MB should comment on them. Here is
the full list from the slides: 24x7
milestones -
Definition of the levels of
support and rules to follow, - Testing the support and operation scenarios April 3D
Project milestones - Phase 1 sites in production, used by the experiments February - Tests of replication of experiments data among sites March - Experiment condition DB in operations April SAM
tests - SAM test complete and including VO tests March SRM 2.2
and MSS Implementation - Tier-1 sites upgraded to SRM 2.2 March - Storage classes implemented at sites April - Experiment software using storage classes May gLite
CE - Installed and available at the sites February - Usage by the experiments April VOBoxes - Service level, backup and restore defined February -
VOBoxes service implemented at the sites April Job
Priorities -
Mapping of the Job priorities
on the batch software April -
Configuration and maintenance
of the jobs priorities June CAF - Experiments define their needs for the CAF January - CERN implements the CAF for the experiments June General
Milestones -
Targets for procurements. -
Define “operation
challenges” to check tools April
& September -
Targets on number of nodes
available at each site March New Action: 29 Nov 06 – The MB member send feedback to A.Aimar on the
Targets and Milestones for 2007. |
|
5. AOB |
|
5.1 Proposal for a DM working group – F.CarminatiF.Carminati distributed a proposal to the MB mailing list (see proposal). This was made available after the start of the meeting, and the discussion was not conclusive. Some members considered that the proposed mandate emphasized too much the re-assessment of requirements, while at this stage we should rather be focusing on how to use the functionality implemented in SRM 2.2, adapting the computing models if necessary. Ph.Charpentier
noted that there was already another GDB group on implementing SRM 2.2
storage classes, and the MB should rather use
this. L.Robertson
added that this work could be done by the people now in the Storage Classes
group. As an evolution, with these additional tasks, and adding
representatives from the experiments to the group. The discussion that followed confirmed the preference for expanding the mandate of the current Storage Classes group, instead of launching a new GDB working group. New Action: L.Robertson and
F.Carminati will discuss with K.Bos about changes to the mandate of the
Storage Classes working group. |
|
6. Summary of New Actions |
|
28
November 2007 - H.Renshall will send the updated table of the experiments
requirements. These
requirements are urgently needed by the sites in order to define their
procurement plans. 29 Nov 2006 - L.Robertson
and F.Carminati will discuss with K.Bos about changes to the mandate of the
Storage Classes working group. 29 Nov 2006 – The MB member send feedback to A.Aimar on the
Targets and Milestones for 2007. The full Action List, current and past items, will be in this wiki page before next MB meeting. |