LCG Management Board

Date/Time:

Tuesday 26 September 2006 at 16:00

Agenda:

http://agenda.cern.ch/fullAgenda.php?ida=a063267

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 2.10.2006)

Participants:

A.Aimar (notes), D.Barberis, L.Bauerdick, S.Belforte, I.Bird, K.Bos, N.Brook, D.Duellmann, C.Eck, J.Gordon, I.Fisk, D.Foster, F.Hernandez, J.Knobloch, P.Mato, L.Robertson (chair), Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Next Meeting:

Tuesday 3 October from 16:00 to 18:00, CERN time – face-to-face meeting

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of Previous Meeting

No comments or changes. Minutes approved.

1.2         Matters Arising

1.2.1          2006Q3 Quarterly Reports Review

Dates:

-          Oct 2 - QR reports with the milestones to comment sent to the QR authors.

-          Oct 9 - QR reports filled, sent to A.Aimar

-          Oct 16 - Review takes place and a Review Document is distributed

-          Oct 23 - Responses to the review and QRs updated, when appropriate.

-          Oct 30 - Updated QRs and Exec. Summary sent to Overview Board.

 

Reviewers:

-          [To Be Confirmed B.Gibbard (BNL)]

-          G.Merino (PIC)

-          U.Marconi (LHCb)

1.2.2          ATLAS position on Grid Interoperability (document) - D.Barberis

Distributed last week to review. Experiments should comment before next MB.

1.2.3          Proceeding with Tier-1/Tier-2 table – L.Robertson and C.Eck

L.Robertson continued the discussion from the previous meeting on how to proceed in order to complete the (“mega”) table maintained by C.Eck. In the previous MB meeting, concerning the new column on “disk cache size” that is to be added to the table, it was said that RAL and SARA could distribute some “rule of thumb” on how to estimate the size of the disk cache for the disk0tape1 storage classes. Since then A.Sansum reported that his calculations are very RAL-specific and would not be useful for others.

 

Last week also Sara (not present at this meeting) had agreed to distribute some information on how to estimate the cache size for the disk0tape1 case.

 

Action:

J.Templon distributes information, as used at SARA, on how to calculate the disk cache size for the case disk0tape1.

 

Then C.Eck reported about the meeting he had with the LHC experiments in order to complete the table.

 

This column on “disk cache size” was discussed with the experiments, but, for now, there is not enough knowledge on how to calculate the amount of such cache.

 

It was agreed that B.Panzer will prepare and distribute a document on “where in a Tier-1 there is need of disk buffer space” along with considerations that may be used to help in making estimates of these buffers and caches. There may be several areas not taken into account yet but they must be included in the purchase plans.

 

L.Bauerdick mentioned that, for CMS, the values are already considered in their TDR. And he wonders if this additional disk caches are not just a small percentage and there is no need to invest effort in such detail. L.Robertson agreed with the statement but added that this activity is actually mandated to make sure that these values are indeed only a small percentage and not 10-15% of the amount of disk, and also to check how, depending on their model, different experiments may imply and need different cache sizes.

 

This will be an iterative process in order to converge on the correct values. The requirements depend on the experiment model; and the sites will decide how to organize their caches to cover these requirements.

 

F.Hernandez expressed his worry about estimating and committing resources in this early phase. J.Gordon replied that this is a “best guess” and a needed initial estimation, very useful in order to understand the kind of percentages we are dealing with.

 

Action:

6 October - B.Panzer distributes to the MB a document on “where disk caches are needed in a Tier-1 site” everything included (buffers for tapes, network transfers, etc).

 

F.Hernandez noted that also the network rates between Tier-1 sites can have an influence on the size of the buffers, so separating inbound and outbound bandwidth would be useful for correctly sizing the storage infrastructure for data import/export.

 

C.Eck noted that in the table the value is the maximum of input and output network rates.

ATLAS and ALICE expressed that inbound and outbound bandwidth numbers are available for each Tier-1.
ATLAS also added that Tier-1
Tier-1 transfers are symmetric.

 

L.Robertson had also proposed that 2008 be taken as the reference year for the table. For ATLAS, CMS and LHCb it was agreed that this is acceptable but for ALICE it is more reasonable to use 2009, the first year in which they expect to have a full ions run. It was agreed that ALICE should provide their value for 2008 and for 2009, in order that sites can see the full picture for 2008.

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

 

  • 30 Jun 06 - J.Gordon reports on the defined use cases and policies for user-level accounting in agreement with the security policy working group, independently on the tools and technology used to implement it.

Not done in September as planned. J.Gordon will present some use cases at the GDB in October

  • 31 Jul 06 - Experiments should express what they really need in terms of interoperability between EGEE and OSG. Experiments agreed to send information to J.Shiers.

The only contribution is from D.Barberis for ATLAS. It is here. If no other comments arrive we close the action at next MB.

 

3.      Status of the 3D Project (transparencies) - D.Duellmann

Update after the 3D workshop and 3D Phase 2 status.

D.Duellmann focused on database availability at the sites, as discussed at the 3D workshop the week before.

 

First he provided a summary of the Phase 1 sites (slide 2).

3.1         Phase 1 Status

FroNTier/SQUID

-          The setup for Tier 0, all Tier 1 and (almost) all Tier 2 sites is done.
He mentioned that most of the work was done and tested by CMS.

 

Databases - T0 and all Phase 1 sites:

-          The databases are all running at ASCG, BNL, CERN, CNAF, GridKa, IN2P3, RAL and all sites have been involved in tests with one or two experiments.

-          Consolidation plans are ongoing at some sites, moving RAC systems for the databases used by the grid services

 

Remaining issues:

-          Completion of monitoring by all sites (ASGC missing)

-          Backup setup at some sites, which is required but not implemented everywhere.

3.2         Phase 2 Sites

For the Phase 2 sites the situation is not good at all (slide 3).

 

TRIUMF is the most active Phase 2 site:

-          have hired a DBA

-          the hardware is being installed and

-          they actively follow all 3D meetings and workshop.

 

The other three sites (NDGF, PIC, SARA-NIKHEF) are not participating actively and really need to increase the participation to the 3D meetings.

 

NIKHEF/SARA

-          hardware is available

-          waiting for SAN connection to set up a cluster

NDGF

-          waiting for hardware arrival

-          DBA nominated

PIC

-          Looking at external company for RAC setup

-          need to allocate DBA

 

The three sites above did not even show up at the workshop, which was organised also to focus on the Phase 2 problems (!!).

 

As a result of this situation, the experiments should not expect that these sites will meet the October deadline.

 

The experiments requests were reviewed and should be covered: for next 6 months they do not require any modification to the current setup. ATLAS will test with a higher amount of data and see if they will store their tags in the databases (could reach 10 TB/year in total).

 

D.Duellmann proposed that:

         the 3D installations and services at the sites are reviewed every 6 months. and

         new requests from the experiments and changes will go via the GDB to the MB.

 

F.Hernandez asked whether the 10 TB/year for ATLAS are per-site or in total. D.Duellmann replied that this is not clear yet and could cause problems in the replication if the rate was much higher than currently. Avoiding it would be better also because the price of DB disks is much higher than “normal” disk space.

 

R.Tafirout noted that the 10 TB are in total including the mirroring and other space needed for RAID systems, etc

 

K.Bos said that the problem at SARA was the use of the SAN environment, but it should be installed by October. He will check it at SARA.

 

Additional explanations received from D.Barberis on the ATLAS tag database and rates to/from tape (email)

 

Action:

The 3D Phase 2 sites should provide, in the next Quarterly Report, the 3D status and the time schedules for installations and tests of their 3D databases.

 

4.      Recommendations from SC4 Service Report (more information) - J.Shiers

 

 

J.Shiers described some recommendations that he had already presented to the LHC Comprehensive Review but not to the MB yet.

4.1         Additional Recommendations

 

Monitoring Tools - Several groups are working on tools for site monitoring (slide 2). And therefore there are now several solutions all being developed to solve the same problem.

 

The proposal is that these implementations work together whenever possible. I.Bird noted that in the framework of HEPIX there is work going on concerning system monitoring tools, but this does not cover the “grid services monitoring tools” at the sites.

 

Service Dashboard - There is the need of a Service Dashboard in order to clearly see the status of critical components (Castor at CERN, FTS, network transfers, etc) more easily. Now this is all done by scanning log files and making manual checks, which is clearly too complicated and laborious. Slide 3 (in animation mode) shows also some work being done by the support team in order to simplify their work (table with the channels status, by P.Badino, is shown).

 

This dashboard will be discussed at the FTS admin workshop (18 October in SARA). R.Tafirout asked for VRVS being set up.

I.Fisk stated that the MB must make sure that this does not become the “N+1 monitoring tool”. J.Shiers agreed that this had to be avoided and tools should have a clear interface to collect and provide data so that interfaces could be combined (a la Konfabulator widgets for instance).

 

This work for a Service Dashboard should be coordinated so that we do not get several displays and applications to start. But was not assigned to any one for the moment.

 

WLCG Service Coordination Meeting - A regular (3-4 every year?) WLCG Service Coordination meeting, where Tier0 + all Tier1 sites + Tier2 federations + the experiments are represented, should be established. This coordination meeting should review the services delivered by sites and federations, main issues encountered and plans to resolve them.

It should also cover the experiments’ plans for the coming quarter in more detail than can be achieved at the weekly joint operations meetings (which nevertheless could cover any updates).

The model used by GridPP for their collaboration meetings is a good model to follow.

 

Service Coordinator On Duty (SCOD) - A “Service Coordinator” - a rotating, full-time activity for the length of an LHC run (but almost certainly required also outside data taking) - should be established as soon as possible.

 

It is a full time activity:

         Attend the daily and weekly operations meetings, relevant experiment planning and operations meetings, CASTOR deployment meetings;

         Liaise with site and experiment contacts (MOD, SMOD, GMOD, DBMOD, …);

         Maintain a daily log of on-going events, problems and their resolution;

         Act as a single point of contact for all immediate WLCG service issues;

         Escalate problems as appropriate to sites, experiments and / or management;

         Write a daily log and prepare and present a detailed ‘run report’ at the end of the period on duty.

 

It is proposed that this rotation be staffed by the Tier0 and Tier1 sites, each site manning ~2 2-week periods per year (or 4 1-week periods). This should start soon.

 

Experiments should also have a coordinator on duty to work with the SCOD.

 

R.Tafirout noted that experiments will have their operation monitored and this in many cases can help.

 

The MB agreed that this proposal needs to be discussed further in the near future.

 

4.2         Support for Distribution of Data

 

During data export to the Tier1 sites (slides 11 and 12), corresponding on-call services are required at the Tier1s. In addition inter-site contacts and escalation procedures should be discussed, written and then followed.

 

J.Shiers proposed a support scheme that is in between the current 12x5 support and the future 24x7 support; a sort of “16x5 + 8x2 support”:

 

Working hours support - the standard procedures are used. GGUS and COD currently provide a service during office hours (of the site in question) only, but can provide the primary problem reporting route during such periods.

 

This requires that realistic VO-specific transfer tests are provided in the SAM (or equivalent) framework, together with the appropriate documentation and procedures;

 

Out-of-hours support - The list of contacts and the procedures for handling out-of-hours problems will be prepared by the WLCG Service Coordination Team and presented to the Management Board for approval.

 

On the assumption that recovery from backlogs is demonstrated, expert coverage can probably be limited to:

                ~12-16 hours per work day.

                ~ 8 hours per week end day.

Although inter-site problems typically require dialog between experts on both sides it should be feasible. More than 2/3 of the data is sent to European sites, where the maximum time difference is 1 hour;

 

These procedures will be constructed to facilitate their eventual adoption by standard operations teams, should extended cover be provided. Experts should be called only when really necessary, the usual infrastructure for alarms will be used but the procedures must be exactly defined for each situation and kind of issue and interruption.

 

J.Gordon noted that this should be studied and defined in coordination with the CIC On Duty dashboard and issues should be evaluated also considering their immediate “recoverability” (e.g. if an issue, even if urgent, needs weeks to fix, there is no point to start debugging it in the night when it happens). He also proposed that this should be introduced progressively, starting during office hours and extending the coverage when there is some experience with the process and some of the automated tools have been developed. It was agreed that this is the correct approach.

 

F.Hernandez noted that this must be a temporary solution and automation should be used to solve as many issues as possible. The MB agreed.

 

5.      AOB

 

 

None.

 

6.      Summary of New Actions

 

 

6 Oct 2006 - J.Templon distributes information, as used at SARA, on how to calculate the disk cache size for the case disk0tape1.

 

18 Oct 2006 - The 3D Phase 2 sites should provide, in the next Quarterly Report, the 3D status and the time schedules for installations and tests of their 3D databases.

 

6 October - B.Panzer distributes to the MB a document on “where disk caches are needed in a Tier-1 site” everything included (buffers for tapes, network transfers, etc).

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.