LCG Management Board |
|
Date/Time:
|
Tuesday 26 September 2006 at 16:00 |
Agenda: |
|
Members: |
|
|
(Version
1 - 2.10.2006) |
Participants: |
A.Aimar (notes), D.Barberis, L.Bauerdick,
S.Belforte, I.Bird, K.Bos, N.Brook, D.Duellmann, C.Eck, J.Gordon, I.Fisk, D.Foster,
F.Hernandez, J.Knobloch, P.Mato, L.Robertson (chair), Y.Schutz, J.Shiers,
O.Smirnova, R.Tafirout |
Action List |
|
Next
Meeting: |
Tuesday 3 October from 16:00 to
18:00, CERN time – face-to-face meeting |
1. Minutes and Matters arising (minutes) |
|
1.1
Minutes
of Previous Meeting
No comments or changes.
Minutes approved. 1.2 Matters Arising1.2.1 2006Q3 Quarterly Reports ReviewDates: -
Oct 2 -
QR reports with the milestones to comment sent to the QR authors. -
Oct 9 -
QR reports filled, sent to A.Aimar -
Oct 16 -
Review takes place and a Review Document is distributed -
Oct 23 -
Responses to the review and QRs updated, when appropriate. -
Oct 30 -
Updated QRs and Exec. Summary sent to Overview Board. Reviewers: -
[To Be
Confirmed B.Gibbard (BNL)] -
G.Merino (PIC) -
U.Marconi
(LHCb) 1.2.2
ATLAS position on Grid
Interoperability (document)
- D.Barberis
Distributed last week to review. Experiments should comment before next MB. 1.2.3 Proceeding with Tier-1/Tier-2 table – L.Robertson and C.EckL.Robertson continued the discussion from the previous meeting on how to proceed in order to complete the (“mega”) table maintained by C.Eck. In the previous MB meeting, concerning the new column on “disk cache size” that is to be added to the table, it was said that RAL and SARA could distribute some “rule of thumb” on how to estimate the size of the disk cache for the disk0tape1 storage classes. Since then A.Sansum reported that his calculations are very RAL-specific and would not be useful for others. Last
week also Sara (not present at this meeting) had agreed to distribute some
information on how to estimate the cache size for the disk0tape1 case. Action: J.Templon distributes information,
as used at SARA, on how to calculate the disk cache size for the case
disk0tape1. Then C.Eck reported about the meeting he had with the LHC experiments in order to complete the table. This column on “disk cache size” was discussed with the experiments, but, for now, there is not enough knowledge on how to calculate the amount of such cache. It was agreed that B.Panzer will prepare and distribute a document on “where in a Tier-1 there is need of disk buffer space” along with considerations that may be used to help in making estimates of these buffers and caches. There may be several areas not taken into account yet but they must be included in the purchase plans. L.Bauerdick
mentioned that, for CMS, the values are already considered in their TDR. And
he wonders if this additional disk caches are not just a small percentage and
there is no need to invest effort in such detail. L.Robertson agreed with the
statement but added that this activity is actually mandated to make sure that
these values are indeed only a small percentage and not 10-15% of the amount
of disk, and also to check how, depending on their model, different
experiments may imply and need different cache sizes. This will be an iterative process in order to converge on the correct values. The requirements depend on the experiment model; and the sites will decide how to organize their caches to cover these requirements. F.Hernandez
expressed his worry about estimating and committing resources in this early
phase. J.Gordon replied that this is a “best guess” and a needed
initial estimation, very useful in order to understand the kind of
percentages we are dealing with. Action: 6 October
- B.Panzer distributes to the MB a document on “where disk caches are
needed in a Tier-1 site” everything included (buffers for tapes,
network transfers, etc). F.Hernandez noted that also the network rates between
Tier-1 sites can have an influence on the size of the buffers, so separating
inbound and outbound bandwidth would be useful for correctly sizing the
storage infrastructure for data import/export. C.Eck noted that in the table the value is the maximum of
input and output network rates. ATLAS and ALICE expressed that inbound and outbound
bandwidth numbers are available for each Tier-1. L.Robertson had also
proposed that 2008 be taken as the reference year for the table. For ATLAS,
CMS and LHCb it was agreed that this is acceptable but for |
|
2. Action List Review (list of actions)Actions that are late
are highlighted in RED. |
|
Not done in September as planned. J.Gordon will present some use cases at the
GDB
in October
The only contribution is from
D.Barberis for ATLAS. It is here. If no other comments
arrive we close the action at next MB. |
|
3. Status of the 3D Project (transparencies) - D.DuellmannUpdate after the 3D workshop and 3D Phase 2 status. |
|
D.Duellmann focused on database availability at the sites, as discussed at the 3D workshop the week before. First he provided a summary of the Phase 1 sites (slide 2). 3.1 Phase 1 StatusFroNTier/SQUID -
The
setup for Tier 0, all Tier 1 and (almost) all Tier 2
sites is done. Databases - T0 and all Phase 1 sites: -
The
databases are all running at ASCG, BNL, CERN, CNAF, GridKa, IN2P3, RAL and
all sites have been involved in tests with one or two experiments. -
Consolidation
plans are ongoing at some sites, moving RAC systems for the databases used by
the grid services Remaining issues: -
Completion
of monitoring by all sites (ASGC missing) -
Backup
setup at some sites, which is required but not
implemented everywhere. 3.2
Phase 2
Sites
For the Phase 2 sites the situation is not good at all (slide 3). TRIUMF is the most active Phase 2 site: - have hired a DBA - the hardware is being installed and - they actively follow all 3D meetings and workshop. The other three sites (NDGF, PIC, SARA-NIKHEF) are not participating actively and really need to increase the participation to the 3D meetings. NIKHEF/SARA - hardware is available - waiting for SAN connection to set up a cluster NDGF - waiting for hardware arrival - DBA nominated PIC - Looking at external company for RAC setup - need to allocate DBA The three sites above did not even show up at
the workshop, which was organised also to focus on the Phase 2 problems (!!).
As a result of this situation, the experiments should not expect that these sites will meet the October deadline. The experiments requests were reviewed and should be covered: for next 6 months they do not require any modification to the current setup. ATLAS will test with a higher amount of data and see if they will store their tags in the databases (could reach 10 TB/year in total). D.Duellmann proposed that: – the 3D installations and services at the sites are reviewed every 6 months. and – new requests from the experiments and changes will go via the GDB to the MB. F.Hernandez asked whether
the 10 TB/year for ATLAS are per-site or in total. D.Duellmann replied that
this is not clear yet and could cause problems in the replication if the rate
was much higher than currently. Avoiding it would be better also because the
price of DB disks is much higher than “normal” disk space. R.Tafirout noted that the
10 TB are in total including the mirroring and other space needed for RAID
systems, etc K.Bos said that the
problem at SARA was the use of the SAN environment, but it should be
installed by October. He will check it at SARA. Additional explanations received
from D.Barberis on the ATLAS tag database and rates to/from tape (email) Action: The 3D Phase 2 sites should provide, in the next
Quarterly Report, the 3D status and the time schedules for installations and
tests of their 3D databases. |
|
4. Recommendations from SC4 Service Report (more information) - J.Shiers |
|
J.Shiers described some recommendations that
he had already presented to the LHC Comprehensive Review but not to the MB
yet. 4.1
Additional
Recommendations
Monitoring Tools - Several groups are working on tools for site
monitoring (slide 2). And therefore there are now several solutions all being
developed to solve the same problem. The proposal is that
these implementations work together whenever possible. I.Bird noted that in
the framework of HEPIX there is work going on concerning system monitoring
tools, but this does not cover the “grid services monitoring
tools” at the sites. Service
Dashboard - There is the
need of a Service Dashboard in order to clearly see the status of critical
components (Castor at CERN, FTS, network transfers, etc) more easily. Now
this is all done by scanning log files and making manual checks, which is
clearly too complicated and laborious. Slide 3 (in animation mode) shows also
some work being done by the support team in order to simplify their work
(table with the channels status, by P.Badino, is shown). This
dashboard will be discussed at the FTS admin workshop (18 October in SARA). R.Tafirout asked for VRVS being set up. I.Fisk
stated that the MB must make sure that this does not become the “N+1
monitoring tool”. J.Shiers agreed that this had to be avoided and tools
should have a clear interface to collect and provide data so that interfaces
could be combined (a la Konfabulator widgets for instance). This work for a Service
Dashboard should be coordinated so that we do not get several displays and
applications to start. But was not assigned to any one for the moment. WLCG Service Coordination Meeting - A regular (3-4 every year?) WLCG Service Coordination meeting, where Tier0 + all Tier1 sites + Tier2 federations + the experiments are represented, should be established. This coordination meeting should review the services delivered by sites and federations, main issues encountered and plans to resolve them. It should also cover the experiments’ plans for the coming quarter in more detail than can be achieved at the weekly joint operations meetings (which nevertheless could cover any updates). The model used by GridPP for their collaboration meetings is a good model to follow. Service
Coordinator On Duty (SCOD) -
A “Service Coordinator” - a rotating,
full-time activity for the length of an LHC run (but almost certainly
required also outside data taking) - should be established as soon as
possible. It is a full time activity: – Attend the daily and weekly operations meetings, relevant experiment planning and operations meetings, CASTOR deployment meetings; – Liaise with site and experiment contacts (MOD, SMOD, GMOD, DBMOD, …); – Maintain a daily log of on-going events, problems and their resolution; – Act as a single point of contact for all immediate WLCG service issues; – Escalate problems as appropriate to sites, experiments and / or management; –
Write a daily log and prepare and present a detailed
‘run report’ at the end of the period on duty. It is proposed that this
rotation be staffed by the Tier0 and Tier1 sites, each site manning ~2 2-week
periods per year (or 4 1-week periods). This should start soon. Experiments should also
have a coordinator on duty to work with the SCOD. R.Tafirout
noted that experiments will have their operation monitored and this in many
cases can help. The MB agreed that this proposal needs to be discussed further in the
near future. 4.2
Support
for Distribution of Data
During data export to the Tier1 sites (slides
11 and 12), corresponding on-call services are
required at the Tier1s. In addition inter-site contacts and escalation procedures
should be discussed, written and then followed. J.Shiers proposed a support scheme that is in
between the current 12x5 support and the future 24x7 support; a sort of
“16x5 + 8x2 support”: Working
hours support - the standard
procedures are used. GGUS and COD currently provide a service during office
hours (of the site in question) only, but can provide the primary problem
reporting route during such periods. This
requires that realistic VO-specific transfer tests are provided in the SAM
(or equivalent) framework, together with the appropriate documentation and
procedures; Out-of-hours
support - The list of
contacts and the procedures for handling out-of-hours problems will be
prepared by the WLCG Service Coordination Team and presented to the
Management Board for approval. On
the assumption that recovery from backlogs is demonstrated, expert coverage
can probably be limited to: –
~12-16 hours per work day. –
~ 8 hours per week end day. Although
inter-site problems typically require dialog between experts on both sides it
should be feasible. More than 2/3 of the data is sent to European sites,
where the maximum time difference is 1 hour; These procedures will be constructed to
facilitate their eventual adoption by standard operations teams, should
extended cover be provided. Experts should be called only when really
necessary, the usual infrastructure for alarms will be used but the
procedures must be exactly defined for each situation and kind of issue and
interruption. J.Gordon
noted that this should be studied and defined in coordination with the CIC On
Duty dashboard and issues should be evaluated also considering their
immediate “recoverability”
(e.g. if an issue, even if urgent, needs weeks to fix, there is no point to start
debugging it in the night when it happens). He also proposed that this should
be introduced progressively, starting during office hours and extending the
coverage when there is some experience with the process and some of the
automated tools have been developed. It was agreed that this is the correct
approach. F.Hernandez
noted that this must be a temporary solution and automation should be used to
solve as many issues as possible. The MB agreed. |
|
5. AOB |
|
None. |
|
6. Summary of New Actions |
|
6 Oct 2006 - J.Templon
distributes information, as used at SARA, on how to calculate the disk cache
size for the case disk0tape1. 18 Oct 2006 - The 3D Phase 2 sites should
provide, in the next Quarterly Report, the 3D status and the time schedules for
installations and tests of their 3D databases. 6 October
- B.Panzer distributes to the MB a document on “where disk caches are
needed in a Tier-1 site” everything included (buffers for tapes,
network transfers, etc). The full Action List, current and past items, will be in this wiki page before next MB meeting. |