LCG Management Board

Date/Time:

Tuesday 26 February 16:00-17:00 – Phone Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=27471

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 29.2.2008)

Participants:

A.Aimar (notes), I.Bird (chair), Ph.Charpentier, T.Cass, L.Dell’Agnello, T.Doyle, I.Fisk, S.Foffano, D.Foster, J.Gordon, C.Grandi, A.Heiss, U.Marconi, P.Mato, G.Merino A.Pace, Di Qing, M.Schulz, Y.Schutz, J.Shiers, O.Smirnova

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 4 March 2008 16:00-18:00 – F2F Meeting at CERN

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

1.2      Availability vs. Reliability (Example) – A.Aimar

A.Aimar described how the Availability and Reliability calculations are performed in SAM and GridView, for the monthly reports.

 

Below are, just as an example, the values for SARA for January 2008 those were discussed during the previous MB Meeting.

 

Monthly Availability = 40%  is the percentage of:

-       Up-time (dark blue) vs.

-       The whole time: unscheduled down-time (white) + scheduled down-time (light blue) + up-time (dark blue)

 

 

 

Average Daily Reliability = 57% is the percentage of:

-       Up-time (dark blue) vs.

-       All days not scheduled down (dark blue + red)

 

Reliability is calculated for each day and the daily average reliability therefore days like 16 to 22 and considered down because they is only scheduled down-time in the period where they should have been up.

 

The MB suggested also to count the Monthly Reliability without only using the average of the daily reliability:

The monthly reliability, looking at the first graph would be:

-       Up-time (dark blue) vs.

-       Up-time (dark blue) + Unscheduled down-time (white)

 

Days that are partially scheduled down should not be counted as completely down in the reliability calculation.

 

The graphs above are correct and calculate the reliability as agreed at the MB.

 

New Action:

5 March 2008 - A.Aimar will add the monthly reliability to the monthly reports of the sites.

1.3      QR Reports (QR_2008Q3_TOC)  

 

The table of content of the QR report agreed at the MB Meeting is the following:

 

 

WLCG

WLCG High Level Milestones                                    Dashboard updated 1 March 2008

Sites Reliability                                                            Values of the last months (commented)

Comments on the High Level Milestones                   From the sites (A.Aimar will collect them)

 

Services, Sites, Applications

LCG Services                                                             J.Shiers, will also include ARDA, SRM and DB Services

Grid Deployment Board                                              J.Gordon

Applications Area                                                        P.Mato

 

Experiments

ALICE                                                                         A.Aimar for the report at the MB

ATLAS                                                                                                             

CMS                                                                                                                 

LHCb                                                                                                                

 

 

The quarterly report must be finished by the mid-March in order to submit them in advance to the Overview Board meeting on the 31 March 2008.

 

2.   Action List Review (List of actions)

Actions that are late are highlighted in RED.

  • 22 Feb 2008 - Experiments should provide feedback on the priorities of the current storage issues. See this link.

Done. Input from ATLAS, CMS and LHCb received.

  • 26 Feb 2008 - Discuss/agree on a milestone about providing tape efficiency metrics should be introduced for the Tier1-s sites, before CCRC08 in May.

On the way. Input from the sites is being collected (also later in this meeting).

On the way.

  • 26 Feb 2008 - A.Aimar verifies the values of the GridView reliability calculations for NL-T1.

Done.

 

3.   CCRC08 Update (Slides) - J.Shiers

 

 

Data exports from CERN are now running at 1-2 GB/s. The peak is ~1GB/s greater than the average achieved during SC4 as dteam (1.3GB/s). CMS alone has managed >1GB/s and ATLAS starts running at similar rates.

 

The number of problems reported in e-logbook generally is rather low. About a few per VO / day. Quite a few are either known issues / configuration problems.

 

The daily / weekly (OPS) meetings continue to work well. We will continue these during March and beyond. Site participation in the daily calls could be improved.

 

Discussions on SRM v2.2 bugs / problems and longer term issues are progressing.

 

Some improvements in announcing interventions, problems and their resolution could still be made.

A few concrete actions are needed. E.g. on-call rota for FIO services / procedure for expert call-out by operators was not publicized prior to CCRC’08.

 

The existing support is described in this table:

Services

ALL

WLCG / “Grid” standards

KEY PRODUCTION SERVICES

+ Expert call-out by operator

CASTOR/Physics DBs/Grid Data Management

+ 24 x 7 on-call

 

-       On-call service established beginning February 2008 for CASTOR/FTS/LFC (not yet backend DBs)

-       Grid/operator alarm mailing lists exist – need to be reviewed & procedures documented / broadcast

 

Slide 5 to 7 show the data rates reached during several days (watch the slides with animation, link) and the SRM endpoints installed (mostly are DPM).

 

The next F2F Meeting is on Tuesday, 4th March in IT amphitheatre / EVO:

-       m/w review / roadmap

-       Storage-ware ditto

-       Site review (view from external site)

-       Experiment reviews

-       Service review

-       CCRC’08 Calendar

 

Next F2F meetings will be:

-       Tuesday 1st April

-       21st – 25th April in CERN main auditorium

-       12th – 13th June IT amphitheater / Council Chamber TBD

 

Only a small number of problems have triggered the “MoU Post-Mortems” e.g. RAL down; ATLAS exports Feb 15+. Both cases highlight the need for some ‘homework’:

-       Revisit emergency site contact phone numbers & procedures

-       Clarify & publicize (& exercise) on-call services & expert call-out

-       Review the gridmaps in order to have a clear view of where and when problems arrive.

 

May is just around the corner, very little time for fixes, too short for any ‘developments’.

 

All experiments plan to continue at least into March – and preferably beyond.

 

4.   Tape Efficiency Metrics (Tape Metrics) - Sites Roundtable

The LHCC Referees asked that all Tier-1 sites also collect tape efficiency metrics in the near future, during CCRC08.

 

I.Bird verified the status of the sites – present at the meeting - that are currently not publishing tape efficiency data:

-       FZK: some metrics will be available very soon.

-       NDGF: Started this week for ATLAS. Nothing for ALICE yet.

-       CNAF: There are problems to have the scripts to work on the CASTOR log files. Some data on reading is available and will be added.

-       ASGC: Will ask CERN about the scripts used for CASTOR.

-       FNAL: Data is already published.

 

T.Cass noted that BNL mentions that the metrics are not relevant for their storage system.

 

5.   CMS QR Report (Slides) – M.Kasemann

 

M.Kasemann presented the CMS’ grid activities since October 2007:

-       CSA07 performance & summary

-       PADA Taskforce

-       CCRC08-February tests

5.1      CAS07

Below are the workflows of the CSA07.

 

 

CSA07 produced160M Monte Carlo events since October 07, working on request of the Physics, DPG and HLT groups.

 

But in total the CSA07 event counts were: 

                        80M                 GEN-SIM

                        80M                 DIGI-RAW

                        80M                 HLT

                        330M               RECO

                        250M               AOD

                        100M               skims (mixed RECO/AOD)

 

For a total of 920 M events. The MC events were processed and reconstructed in several steps, several times (see slide 3).

 

The CSA07 signal samples have really evolved over time. They started from 50M and went up to 85M by now (not a real problem). The total data volume of CSA07 samples right now is 1.9 PB, without counting the repetitions.

 

There is a full list of lessons on the Twiki for Offline and Computing Link   

 

In CSA07 a lot was learned and a lot was achieved. The production infrastructure is in full operations and the CSA07 analysis identified tasks to be addressed.

 

Two strategies derived for Computing:

-       A new Task Force: Integrating development, deployment and commissioning
Processing And Data Access (PADA), coordinated by I.Fisk and J.Hernandez

-       Testing the computing infrastructure in CCRC08/CSA08 in February and prepare scope for May ‘08

5.2      PADA Task Force Activities

The Processing and Data Access Task Force is a series of tasks and programs of work, designed to bring the Computing Program into stable and scalable operations. See Link.

 

Here are the sub-tasks already launched of the PADA task force (more to come):

-       Distributed production commissioning (J.Hernandez)

-       Integration, commissioning and scale testing of the organized production workflows at Tier-1 (reprocessing and skimming) and Tier-2 (MC production) sites.

-       Improve the level of automation, reliability, efficiency of resource use and scale of the production system, reducing at the same time the number of operators required to run the system.

-       Commission of new components of the production system.

-       Perform functionality, reliability and scale tests.

-       Monitoring activities (S.Belforte, A.Trunov)

        Integration of monitoring tools,

        gather needs and input from users,

        provide feedback to developers, testing/evaluation,

        help in defining user/site monitoring views.

-       Site commissioning (F.Matorras, S.de Weirdt)

-       Demonstrate that CMS can access the resources that are pledged to CMS.

-       Test scalability of CEs and storage for CMS-style workflows.

-       Site commissioning is a step before demonstrating that the CMS workflow tools can be scaled.

-       Verify that the workflows don’t interfere,

-       Verify that analysis and productions jobs are shared on Tier-2s

-       Find the stable operating points of skimming and reconstruction for the Tier-1 sites.

-       Analysis activities (A.Fanfani)

-       User feedback: Collect inputs from the user community and provide feedback to developers.

-       Organize integration and testing of new functionalities of the analysis tools.

-       Deployment of CRAB server.

-       Data transfer commissioning (DDT) (J.Letts, N.Magini)

-       Demonstrate that the Tier-1 and Tier-2 sites are capable of utilizing the networking as specified in the Computing TDR.

-       Demonstrate that data management tools, networking and storage configuration at sites are adequate for data transfers at the required scale.

-       Perform link commissioning + testing following new DDT metrics.

 

The new DDT metric, including regular exercising of the link, is in place since February 11.

 

There are currently 311 commissioned links:

-       52/56               T[01]-T1

-       162/362           T1-T2

-       90/352             T2-T1

5.3      CMS CCRC’08

Here are the CMS goals for CCRC08 in February and May 2008.

 

Phase 1 - February 2008: blocks of functional and performance tests

-       Verify (not simultaneously) solutions to CSA07 issues and lessons

-       Attempt to reach ‘08 scale on individual tests at T0, T1 and T2

-       Cosmic run and MC production have priority if possible

-       Tests are independent from each other

-       Tests are done in parallel

Phase 2: - May 2008: Full workflows at all centers executed simultaneously by all 4 LHC experiments

-       Duration of challenge: 1 week setup, 4 weeks challenge

-       CMS scope defined these days

5.4      Phase 1 Tests - Data recording at CERN

Readout from P5, use HLT, w. stream definition, use Storage Manager, transfer to T0, perform repacking, write to CASTOR (D.Hufnagel).The goal to verify dataflow for CMS, commission the new 10GB fiber. 1 GB fiber used for Global runs since long

 

Status:

13.2.08: First successful transfer on new 10 Gb fibre at 100MB/s (limited by transfer node)

 

Next step:

-       integrate into transfer system

-       run in parallel to normal

-       data transfers

5.5      Phase 1 Tests - CASTOR data archiving test

The goal is to verify CASTOR performance at full CMS and ATLAS rate (M.Miller / DataOps team).

Status: very successfully completed, reached rate of 1.5 GB/s

-       Good coordination with CERN-IT, quick response

-       Test at all-VO rate, other VO’s didn’t stress the system

 

 

The graph above shows that the 1 GB/s rate is easily reached and also that was doubled in order to recover a temporary lack of tapes. CASTOR could easily catch up by reaching the higher rate.

5.6      Phase 1 Test - High Rate Processing at Tier-0

The goal is to do “high-rate” processing of CPU/RAM limited jobs. Originally: measure interaction with other VO’s on same WN. But CMS does not share WN with other VO’s @ CERN (for now)

 

The setup: as the regular operations (physics requests) with ReReco with 0pb-1 conditions of Stew and Gumbo

 

The current status is:

-       started with 41k jobs of the 80 TB Stew AllEvents

-       Finished in expected time

-       Not much action from other VO’s, no sign of WN problems

-       Again turning into a CASTOR I/O test

 

The graph below shows the input and output to the transfers, with averages of about 1 GB/s.

 

 

5.7      CCRC08 Transfer Tests

CMS also tried successfully to transfer large amount of data across Tier-0 and Tier-1 sites.

 

AA_ 5

 

The goal is to use SRMv2 data transfers where possible

The target rates to reach are:

-       T0-T1: 25/40/50% of full 2008

-       T1-T1: 50% in+outbound

-       T1-regional-T2: full/high rate

-       T2-regional-T1: full/high rate

 

A detailed Plan worked out: how to cycle through different parts of all link combinations per week

The tests are progressing well

-       T0-T1 metric goal by all T1’s

-       5 out of 7 T1’s reached T1-T1 goal

-       individual problems are being addressed and result in delayed testing

-       More detailed analysis available at the end of February

 

The graph below show the deployment of SRM V2 at the CMS sites. At week 3 is basically completed.

 

AA_ 9

5.8      CCRC08 Reconstruction Tests

Measure the performance of:

-       Migration from Tape to Buffer: pre-stage test.

-       Reprocessing exercise: use all available CMS CPU-slots at T1s

 

Plan for such test is going to be:

-       Select one (or more) dataset(s) of ~10TB size existing at T1.

-       Remove all the files from disk (aka, T1_Buffer).

-       Fire the staging from Tape to Buffer of all files.

-       Monitor the process and provide some measurements/plots

-       Run Re-reconstruction over CSA07 data present at all T1s and measure performance

 

Currently the buffer to tape migration was successfully finished at all sites:

-       Results: total staging time 8-44h, rate: ~80-250MB/s observed

-       Except IN2P3, performance was poor, reconfigure and redo

 

High performance processing without overlap with ATLAS

-       Finished at FNAL(1200 slots), CNAF(1000-1300 slots), FZK(600 slots), ASGC(300 slots)

-       IN2P3 and PIC, RAL: normal processing, no problem foreseen

 

Processing test together with ATLAS planned at two Tier-1’s:

-       special queue for Atlas and CMS is setup at IN2P3 and PIC

 

Below is the number of files and total data of the data streamed to/from the different sites.

IN2P3 and CNAF need to be verified.

 

AA_ 10

5.9      CCRC08 Monte Carlo Tests

The goals of these tests are:

-       Production tests of FastSim Monte Carlo

-       Physics groups want to use 50M of the CSA07 samples (100pb-1 calibration), reading AOD’s.

 

The current status of Fast Simulation production based on CMSSW_1_6_9 completed successfully 50M on Monday.

5.10    CCRC08 CAF Tests

CMS intends to ramp-up the CAF resources and verify the basic CMS use cases at a realistic scale.

Good progress was made. The resources are configured according to plan and regular CAF meetings with user representatives (Global Run, ALCA and Physics).

5.11    CMS QR Summary

The Computing infrastructure is fully utilized for ongoing production. Finished original CSA07 production (and much more) and a detailed analysis of CSA07 performance was performed.

 

Direct result for Computing: defined PADA tasks and CCRC08 functional tests. The PADA taskforce addresses deployment, integration, commissioning and scale testing. It will bring the elements of the Computing Program into stable and scalable operations.

 

The CCRC08 functional tests in February complemented the CSA07 activities and tested additional functionality.

 

The detailed planning of CCRC08-May, i-CSA08 and f-CSA08 is progressing and CMS expects to agree on initial scope and goals during CMS week.

 

I.Bird commented that the status seems very positive.

M.Kasemann agreed with the statement but noted that the current tests do not cover as much as it is needed.

He also clarified that, even if not specified in the presentation, the system required a high amount of human follow-up and some major development is still needed. The memory usage of the CMS applications must still be considerably reduced and will be only ready for production after May (completed only end of April and only tested in May).  

 

 

6.   AOB

 

 

LHCb QR Report (Slides) POSTPONED to the following week.

 

No AOB.

 

7.   Summary of New Actions