LCG Management Board

Date/Time:

Tuesday 4 April 2006 at 17:00

Agenda:

http://agenda.cern.ch/fullAgenda.php?ida=a061511

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 2 - 18.4.2006)

Participants:

A.Aimar (notes), N.Andersson, L.Bauerdick, S.Belforte, I.Bird, D.Boutigny, N.Brook, F.Carminati, T.Cass, Ph.Charpentier, L.Dell’Agnello, I.Fisk, B.Gibbard, J.Gordon , M.Lamanna, P.Mato, H.Marten, G.Merino, B.Panzer, L.Perini, Di Qing, H.Renshall, L.Robertson (chair), Y.Schutz, J.Shiers, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Next Meeting:

Tuesday 18 April 2006 at 16:00

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of the Previous Meeting

Minutes approved.

1.2         Matters Arising (transparencies )

 

Xrootd workshop at CERN (see agenda). The goal was to agree on how to interface xrootd with the various MSS systems. A plan for next two months was drafted and was decided to use the ALICE DAQ-T0 tests to verify the implementations.

 

Meeting on PROOF at CERN on the 28 March. The discussion was about (1) the configuration and operation of a PROOF farm for ALICE and (2) the plans of the PROOF project. ALICE plans to use PROOF on the CAF for 20% of the capacity. Two or three users for each sub-detector will be allowed to use the farm, for a total of 50-100 people, it will be organized by ALICE and the PROOF team will install the software on the machines assigned.

The actions agreed were:

-          PH/SFT will organize a presentation in the Application Area Meetings;

-          The plan for PROOF will be completed by end of April, focusing on robustness, scheduling and software configuration/installation;

-          Installation and monitoring will be according to standards defined by IT/FIO group. The PROOF master will be modified in order to provide more information on the detailed usage/users of the farm resources.

 

The usage of PROOF on the CAF will be as defined in the ALICE computing model.

 

2006Q1 Quarterly Reports to be filled are available and the link to the wiki page has been distributed after the MB meeting. The deadline for the reports to be sent back is the 11 April 2006.

The reviewers identified are H.Marten and D.Salomoni. A reviewer from Atlas will be proposed before the review starts, next week.

 

Update after the MB meeting: the reviewer proposed by ATLAS is L.Perini.

 

2.      Action List Review (list of actions)

 

 

No due actions.

 

3.      Status of DAQ-T0-T1 planning (more information ) - B.Panzer

 

 

The milestones for the DAQ-T0-T1 plans (Slide 2) are:

-          Plan the detailed architecture, implementation and testing plan, in order to reach the nominal data rates by the end of 2006, and be fully operational by April 2007.

-          Plan for end-to-end DAQ – T0 – T1 testing; with full path demonstration of DAQ-to-T0, tape recording, reconstruction and distribution, for at least two experiments by end July, including condition database.

 

There are several activities that will take place at the same time and different usage modes by the experiments (slides 3 to 5) need to be tested. The tests done so far emulate several use cases (slide 6) including DAQ and Castor2, as well as internal IT tests. These emulations are NOT using the real data chain and data sources.

 

The tests for 2006 will be performed with ALICE, ATLAS and CMS; the milestones of the tests need to be synchronized with the experiments plans. LHCb’s tests will be done during 2007.

 

The facilities available (slide 7 and 8) for these tests are:

-          a large dedicated data challenges installations;

-          dedicated installation for the SC4 throughput tests

-          new large experiment-dedicated Castor2 installations, currently being setup and sized

 

The tests involve a dozen streams and data buffers, and there are several setups and different configurations that need to be tested (slide 9). As examples, slide 10 and11 show the CMS and ATLAS data flows that are going to be tried and studied.

 

The initial plans and schedules are shown in slide 14. Most of the activity involving DAQ, Tier-0 and Tier-1 will be in the period Jun-Sept. A detailed compilation of the current status and plans is under way (done in next two weeks).

 

Les Robertson expressed concern that there is as yet no detailed planning, and questioned whether it will be possible to achieve the targets of the LCG milestones this year.

 

Action:

26 Apr 2006 B.Panzer - Detailed plan for the DAQ-T0-T1 activities in 2006 and match with the experiments milestones.

 

1.      Site reliability update (more information) - H.Renshall

 

 

Availability metrics should be available from the start of SC4; therefore the SFT and GridView teams are working on:

-          Service availability monitoring tools (SAME)

-          GOCDB replication and synchronization

-          GridFtp  file transfer monitoring

-          Proposed availability definitions and measurement periods

 

The Service Availability and Monitoring Environment (SAME) is the evolution of the SFT test environment, and it has a data transport and archiving based on web services for storing all test results in an Oracle database.

 

The visualization of the monitoring data is being developed and the MB should decide what kind of views are more relevant and therefore should be developed first.

 

The GOCDB work (slide 3) is about maintaining and updating a local replica of required GOCDB tables in the Gridview Database. Scheduled downtime and levels of availability are also being monitored. How scheduled downtime should be treated and accounted (vs. unscheduled downtime) will be discussed in the future, what is important is to collect all information about the sites.

 

Several aspects of GridFtp monitoring have been improved (slides 4 to 6) and the views provided are very useful for global and site-to-site transfer rates monitoring.

 

Currently the SFT tests are submitted at regular intervals. “Availability” is defined as the percentage of successful tests vs. the total number of tests submitted in a given time interval. The full set of tests is submitted every three hours, and each hour the previously failing tests are resubmitted.

 

The intention is to start submitting more frequently: execute the full tests hourly and quickly retry each failing test. These tests should be treated as high priority on all sites, in order not to be queued for a long time.

 

Concerning “service degradation” (slide 9): if a load-balanced service provides half the capacity the tests are counted as “half” succeeded. But if the service provided is still above the “expected” rate/capacity, then the service is still considered fully operational. After some discussion it was agreed that in the short term this will not be applied – replicated services will be assumed to be 100% available if part of them is available. Later more sophisticated tests may be used to measure performance degradation.

 

Four time periods are calculated and displayed for each site and for each VO:

-          Hourly for the 24 hours previous to the last submission of the tests (this will used by the CIC on Duty operations)

-          Per calendar day (as per the MB requirement)

-          Per calendar week

-          Per calendar month

 

Plots can be produced quite easily extracting the data from the Oracle database, as shown in slide 12.

 

2.      Proposal for organization of work on job reliability (transparencies) - M.Lamanna

 

 

The goal of this proposal is to improve the LCG service by quantifying the reliability of job execution. It was triggered by the need of clearly measuring the LCG efficiency, with performance/error measurements and quantifying the impact on the work of each LHC experiment.

 

In order to do this, more information needs to be extracted both from the grid infrastructure and from the experiments. This information is available in the logging and book-keeping databases, from the production services and from experiments systems (CRAB, MonaLisa, etc). But such information should be aggregated and correlations among different sources of information should be made available in order to understand the reasons of the job failures. Grid sites have also a lot of error and logging information; therefore they should be involved. Their log files contain the information (e.g. disk space, CPU load, etc) that often are useful to understand the context of the job execution.

 

For now there is no need to develop new sources of data or change their format. The work should rather focus on combining the existing information from services, sites and experiments (e.g. CMS Dashboard). Analysis of this data for now will only be possible off-line (post-mortem) because it is very time-consuming and there is no distributed debugging that can be performed online. In the light of this experience priorities should be established for providing additional information or for re-organising  existing log data.  

 

The proposed activity should last 6-9 months, and involve the people in the IT/PSS section. In particular those who already have experience in working with middleware and grid-services for the experiments. The deliverable should (1) asses the limiting factors and major problems, (2) involve the developers of the services, deployment and middleware teams in defining priorities and solutions, (3) follow-up the development of the solutions and steer the priorities of the implementation.

 

M.Lamanna will coordinate this group. The MB recommends that the JRA1 and deployment people should also be involved in the problems analysis and should agree on the priorities.

 

The MB agreed that the focus should be put into finding and solving the main issues still open, involving sites, experiments, middleware developers, deployment and all other parties concerned. The best approach is to concentrate only on one or two experiments (ATLAS and CMS for now) and solve the main issues found. In this way it will be much simpler to get started with the activity and most probably the solutions found will also be useful to the other experiments

 

3.      AOB

 

 

No AOB.

 

4.      Summary of New Actions

 

 

Action:

26 Apr 2006 B.Panzer - Detailed plan for the DAQ-T0-T1 activities in 2006 and match with the experiments milestones.

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.