LCG Management Board

Date/Time:

Tuesday 13 June 2006 at 16:00

Agenda:

http://agenda.cern.ch/fullAgenda.php?ida=a061504

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1)

Participants:

L.Bauerdick, S.Belforte, N.Brook, A.Cass, Ph.Charpentier, A.Doyle, Dirk Duellmann, Ian Fisk, B.Gibbard, J.Knobloch, E.Laure, M.Lamanna, D.Qing, H.Marten, P.Mato, G.Merino, B.Panzer, G.Poulard, L.Robertson (chair, notes), O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList 

Next Meeting:

Tuesday 20 June 2006 at 16:00

1.      Minutes and Matters arising (minutes)

 

Minutes of the Previous Meeting

The minutes had not been distributed (apologies from L.Robertson).

 

2.      Action List Review (list of actions)

 

 

Note: In RED the milestones that are still due.

  • 23 May 06 – Tier-1 sites should confirm via email to J.Shiers that they have set-up and tested their FTS channels configuration for transfers to Tier-1 and Tier-2 sites.

Status not known – J.Shiers not present at the meeting (concurrent Tier-2 workshop).

J.Templon noted that there was some confusion about how the  information should be repo

  • 30 May 06 - J.Shiers and N.Brook will add a note in the document to remind of the need of announcing in advance draining of jobs at the sites.

Not done. N.Brook stated that a suitable sentence would be sent to J.Shiers after the meeting.

  • 31 May 06 – C.Grandi presents to the MB the EGEE middleware priorities and development of the features needed by the LHC (in Flavia's list).

This will be done during June.

  • 31 May 06 – K.Bos should start a discussion forum to share experience and tools for monitoring rates and capacities and to provide information as needed by the VOs.

Not done.

  • 31 May 06 – J.Shiers proposes a plan for demonstrating capability to recover short and long interventions on the Tier-0 and Tier-1 sites.

Not done yet.

 

  • 13 June 06 – D.Liko to distribute the Job Priority WG report to the MB.

Not done yet.

 

  • 16 June 06 – CERN+Tier-1 sites - The sites should send the accounting data for May before 16 June 2006.

 

  • 13 Jun 2006 - I.Bird to add the “discussion on the SAM tests and results” to the Operations Workshop agenda.

 

 

3.      Preparation for the meeting with the LHCC referees – 26 June

L.Robertson said that the referees had requested a Tier-1 readiness report. The internal review last week included the Tier-1 status, with a rather thorough examination of the status of each centre. However, the reviewers are not scheduled to report until the next face to face MB meeting on the 4th of July. L.Bauerdick said that he thought that it was important that the report of the reviewers should first be heard by the Management Board. L.Robertson reported that V.Gülzow, the chair of the review panel was of a similar opinion, and considered that the Tier-1 centres would have to hear the report and give any corrective feedback before it went beyond the project. It was agreed that L.Robertson should explain this to the referees.

 

The other points for the agenda of the meeting do not raise any particular issues:

  • gLite 3.0 deployment
  • progress with job failure analysis
  • 3D status
  • Results of the SRM v2 workshop and revised planning

 

4.      SRM planning following the FNAL Workshop – Maarten Litmaath present

L.Robertson introduced this item. The transparencies from the talk by M.Litmaath at the GDB describing the conclusions and agreements of the workshop are attached for information. The target of the present meeting is to decide whether the proposed development and implementation schedule is acceptable, agree on any management actions needed to ensure that the programme is given sufficient priority by the various institutes involved, and define the way on which the MB will monitor progress.

 

L.Robertson noted that the schedule agreed in the workshop foresees completion of the development programme at the beginning of November, which means that the new system could only be brought into production during the first quarter of 2007, a good three months later than had been expected in Mumbai. It was agreed that this change in the schedule should be accepted and the planning modified accordingly.

 

Ph.Charpentier asked what would happen during SC4, to provide the basic storage classes that had been agreed in Mumbai. M.Litmaath explained that different endpoints or storage paths have been or are being defined to distinguish between permanent and durable storage. It was thought that it would be reasonable to modify the basic clients to discover these endpoints automatically. However it has been realised that the clients are generally given pre-cooked SURLs. The question now is whether this approach could be continued, which would obviate the need to modify the tools. This will be discussed in the next coordination meeting (on Friday 16 June).

 

The workshop has produced a good outline plan for the development and deployment of SRM v2.2, but we now need to get the formal agreement of the management of the sites responsible for implementing the plan, and ensure that the work is treated with adequate priority and sufficient resources are made available. L.Robertson proposed that we form a group composed of management representatives of the sites concerned to meet regularly (say monthly) with Maarten Litmaath to review progress, take corrective action when necessary, and keep the MB informed. The sites concerned would be FNAL, DESY, RAL and CERN. It was agreed to go ahead with this proposal.

 

Another point that was raised in the GDB concerned the implications on the sites (Tier-1s and Tier-2s) of SRM v2.2 and

the agreed storage classes. It was agreed in the GDB that a group of site storage experts be formed to consider these implications and make appropriate proposals on how to implement the models and resolve operational difficulties. This group should also propose the processes through which the experiments interface with the sites to allocate and manage the space. K.Bos will send out a specific proposal on this group within the next few days.

 

5.      Status of the LCG 3D database project  - D.Duellmann

D.Duellmann reported on the current state of the 3D project. For details see the transparencies. Two technologies are involved: Oracle Streams, which is used for ATLAS and LHCb, and Frontier/Squid, used for CMS. ALICE requires only a central database.

 

The plan agreed at the end of last year (foil 3) foresaw two phases in setting up the production database service. During the first phase beginning in March a service would be developed at seven sites (ASGC, BNL, CNAF, CERN, GridKA, IN2P3, RAL) with the goal of achieving full production status by October. At that point another four Tier-1 sites (PIC, NIKHEF, NDGF, TRIUMF) would be brought into the database distribution service. FNAL is only taking part in the Frontier/Squid service.

 

Negotiations with Oracle (Foil 4) to obtain a sufficient number of licences for the Oracle-based part of the 3D service have been completed by CERN and accepted by all of the sites that required them. An agreement has also been reached (Foil 5) that allows the Oracle client software to be bundled with the experiment and LCG software.

 

Foil 7 summarises the current status at the Tier-1 sites. All of the sites involved in the first phase all are operational and taking part in performance tests except GridKA. GridKA has recently installed the necessary hardware and is expected to join the service soon. One of the second phase sites, TRIUMF, is now regularly attending the 3D planning meetings. Contacts have been established with Database Administrators (DBAs) at two other sites (PIC and NIKHEF/SARA). Only early contacts have been made with the final site, NDGF. The activity at these sites must ramp up very soon if the October target for going into operation is to be reached.

 

J.Templon asked what type of support is required at the Tier-1s. D.Duellmann said that database expertise (at DBA level) is needed to set up and maintain the database, and organise backup and recovery procedures. He estimated that about one FTE may be required – this is not a pure software installation task. More than one person would be needed to cover holiday periods, etc. J.Templon asked if NIKHEF/SARA is the only site that did not anticipate providing this level of support. G.Merino said that at present there is no Oracle experience at PIC, which will be a problem for the site to get going.

 

Foil 8 gives the status of the throughput tests. These started at the beginning of May as scheduled, but various installation and configuration problems were encountered that have held up the test activity. The test period will be extended by one month to the end of June. Performance issues have been uncovered that require support from Oracle experts. It is becoming clear that the support available at sites is in some cases rather thin. It will be important that backup support is available at all sites to cover absences.

 

A monitoring framework has been installed as a collaboration of CERN and RAL (Foil 10) that provides a central view of the status of all of the 3D database installations. This will be used to feed other monitoring systems (e.g. the experiment dashboards, GridView). As higher level services are deployed (e.g. COOL) it is planned to develop tests that can be integrated into the grid Site Availability Monitor.

 

Preliminary test data (Foil 11) shows that the performance (across a LAN) is sufficient for all of the experiment scenarios defined at present, without doing any optimisation. Nevertheless a series of technical meetings has been organised with Oracle Corporation to improve performance.

 

The status of the Frontier/Squid tests is summarised on Foil 12. Stress tests have been carried out at CERN to validate the configuration for a production service at Tier-1s for CMS.  A test setup has also been configured for exploratory work by ATLAS. Meanwhile Frontier tests are taking place at FNAL, concentrating on resilience and recovery.

 

Foil 13 gives the request from CMS for the second half of 2006. Servers have already been configured at CERN,  7 Tier-s, and 7 Tier-2s. A further 19 Tier-2s are scheduled to complete installation within four weeks.

 

(Foil 14) It is very important that all of the sites taking part in Phase 1 consistently attend the coordination meetings, and that they link in to the central monitoring. Time is now very short to complete the throughput tests by the end of June, before starting experiment tests. For the latter it is essential that all participating sites ensure that there is DBA coverage during the full period (July-October). The phase 2 sites are reminded that the deadline is to be in production in October. Installation must therefore start soon. Sites have been requested to inform the project of their plans. L.Robertson asked what the responses from the sites are to these points. D.Düllmann said that his intention was to bring these issues to the attention of the management, but that he did not want to go into details in the MB. L.Robertson will discuss this in more detail with D.Düllmann outside the meeting to identify the most critical cases.

 

On the experiment side (Foil 15) confirmation or updates of their requirements for the October production phase are required for both the database and Frontier services. The need for significant effort from experiments for the test period is underlined.

 

The last slide (Foil 16)  proposes adjusted milestones –

          End of June - Throughput phase closed, Experiment application and throughput tests start

          Early-July - 3D DBA day to plan database setup options with new tier 1 sites.

          Late-August - 3D workshop (experiments and sites) defining October setup and service

          September - Experiment ramp-up tests

          October - Full service open at all tier 1 sites

 

J.Templon asked what the buy-in from the experiments is – is this the only database service requested or are other databases required for other applications?  D.Düllmann said that his understanding is that this is the only database service for the experiments. G.Poulard replied for ATLAS that it is not excluded that services using MySQL may be required at Tier-2s. He also asked about the ALICE situation. D.Düllmann said that they hgave not requested any database services outside the Tier-0.

 

G.Merino asked for confirmation that only Squid is required for CMS and only Oracle Streams is required for LHCb and ATLAS. D,Düllmann confirmed this.

 

M.Litmaath said that he understood that all Tier-1s must also provide an Oracle database for the FTS service. L.Bauerdick said that at FNAL the plan was to use MySQL. M.Litmaath said that there are known performance issues with FTS/MySQL that make it unsuitable for Tier-1s, and that these are not classified as priority issues for the developers. L.Bauerdick said that this should be clarified. L.Robertson agrees that this must be done. {After the meeting E.Laure took an action to come with a clear statement on this for next week. The point had in fact been raised during the previous meeting, on 6 June, during the item on site readiness for SC4. At that time it had been said that the MySQL performance problem would not be fixed in the near future.}

 

G.Poulard asked if additional Oracle licences were foreseen for possible future applications at Tier-2 sites. D. Düllmann replied that a small number of additional licences could be provided for very special cases but that no general use of Oracle at Tier-2s was envisaged.

 

D.Qing asked if catalogs would use the 3D database. D.Düllmann replied that streams replication for LFC was being investigated for LHCb.

 

6.      AOB

 

L.Robertson said that he will send a formal proposal to start the meeting in future at 15:55 in order that business can begin at 16:00.

 

7.      Summary of New Actions

 

 

20 June - E.Laure – prepare a clear statement on the status of support for MySQL in FTS

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.