LCG Management Board

Date/Time:

Tuesday 18 March 16:00-17:00

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=27474  

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 19.3.2008)

Participants:

A.Aimar (notes), I.Bird (chair), K.Bos, T.Cass, L.Dell’Agnello, M.Ernst, I.Fisk, J.Gordon, F.Hernandez, M.Kasemann, M.Lamanna, U.Marconi, H.Marten, P.Mato, G.Merino, A.Pace, Di Qing, M.Schulz, Y.Schutz, O.Smirnova, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 25 March 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

 

2.   Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

-       26 Feb 2008 - The Sites and Experiments should confirm to A.Aimar that they have updated the list of their contacts (correct emails, grid operators’ phones, etc). Here is the current contact information: https://twiki.cern.ch/twiki/bin/view/LCG/TierOneContactDetails

 

Information confirmed only by:
Sites:                           ASGC, CERN, CNAF, FZK, NDGF, PIC. SARA
Experiments:               ALICE

 

-       29 Feb 2008 - A.Aimar will verify with the GridView team the possibility to recalculate the values for BNL.

Not done. Asked to GridView but it still needs to be implemented.

-       29 Feb 2008 - A.Aimar will verify why the reliability values for the Tier-2 sites seems incorrect (being lower than availability).

 

On the way. Will be fixed for March.

 

-       18 Mar 2008 - Sites should propose new tape efficiency metrics that they can implement, in case they cannot provide the metrics proposed.

 

Will be verified next week.

 

I.Bird noted that measuring the read and write rates is important for the Experiments. The other metrics requested are useful indicators of the performance of the MSS system

 

K.Bos added that the Experiments need to know that the data is stored on tape and at which rate. Otherwise alarms must be raised.

 

F.Hernandez asked what actually the targets to reach for each Experiment are. These should be the parameters used to judge the performance of the Sites.

 

K.Bos replied that the Experiments probably expect to re-process the data 3 times and he will distribute these expected parameters for ATLAS. Other Experiments did not express their target at the meeting.

 

J.Templon added that SARA will be able to provide the metrics requested.

 

New Action:

The Experiments should provide the read and write rates that they expect to reach. In terms of clear values (MB/sec, files/sec, etc) including all phases of processing and re-processing.

 

3.   CCRC08 Update (Slides) – J.Shiers
  

J.Shiers presented an update of the CCRC08 activities.

 

CCR08 is now in the phase 1.5 of CCRC'08 (i.e. between the February phase 1 run and phase 2 in May). There are no formally-coordinated activities or metrics in this phase as yet. Currently it involves individual Experiments doing functionality, throughput and stress testing of their computing model components and sites.

 

The main news of the week are:

-       The daily meetings continue and are a good place for check-pointing on current activities / follow-up on problems / issues. See the minutes of daily meetings for detailed update (https://twiki.cern.ch/twiki/bin/view/LCG/CCRC08DailyMeetingsWeek080310). \

-       The last Asia-Pacific con-call, preparing for the Tier2 workshop prior to ISGC 2008, was held this morning (focusing on ensuring that A-P sites are fully engaged in CCRC’08 activities).

-       Following on from the discussions at the last CCRC’08 F2F, several sessions have been setup at the April F2F to focus on communications issues, as well as service readiness for May.

-       A test of the LHCOPN backup links is foreseen for one day in the week 7-11 of April (unless there are strong objections). Need a 2-3 hour slot during a time frame suitable for the North American T1s, so late in the Geneva afternoon.

-       The re-organization of the table space of the WLCG production RAC was performed last Monday to free about 400GB of space.

-       To perform the table space re-organization we used Oracle Online DBMS_redefinition, but we were hit by some Oracle bugs which caused downtime of SAM and GridView applications (from Monday at 17.00 to Tuesday at 15.00).

-       A meeting was held yesterday with the application developers to improve space management and  discuss  procedures for future interventions – monthly follow-up & review of requirements

-       ATLAS COOL and LHCb LFC Streams setups have been split in order to separate the replication to CNAF and PIC and avoid any interference in the replication to the other Tier1 sites during interventions at these sites.

-       DB “house-keeping” operations have been raised before – even during SC era: very good communication / coordination with overall WLCG service continues to be essential!

-       FTS exports (dteam) will start in order to test various patches. Initially low rate – any stress tests will be coordinated through daily meeting

 

Here are the ATLAS activities of the previous week.

 

Experiment

Day

Issue

ATLAS

Mon

M6 completed – 40TB of data with calibration streams to 4 Tier2s (Rome, Naples, Munich, Michigan)

Tue

Upgrade to DQ2 0.6 went well.

Wed

T1-T1 transfers with PIC as first target. Initially functional block tests

Thu

Debugging above – rates not as high as expected and many transfers stuck. PIC configuration needs tuning but is shared with other VOs, so needs negotiation.

Fri

Have had at least 3 problems in T1 tests to PIC. Site about to enter scheduled power maintenance – will switch to IN2P3. Continue debugging next week and postpone T0-T1 throughput tests until after Easter

 

J.Shiers proposed the view above for all Experiments’ daily activities.

 

These are the upcoming meetings:

-       Next CCRC’08 Face-to-Face Tuesday 1st April (http://indico.cern.ch/conferenceDisplay.py?confId=30246).
Site focused session in the morning then Experiment and Service focused session in the afternoon.

-       21st – 25th April WLCG Collaboration Workshop (Tier0/1/2) in CERN main auditorium (58 people registered at 12:00 UTC) (http://indico.cern.ch/conferenceDisplay.py?confId=6552)
Possible themes:

WLCG Service Reliability: focus on Tier2s and progress since November 2007 workshop

CCRC'08 & Full Dress Rehearsals - status and plans (2 days)

Operations track (2 days, parallel) – including post-EGEE III operations!

Analysis track (2 days, parallel)?

-       12th – 13th June CCRC’08 Post Mortem (http://indico.cern.ch/conferenceDisplay.py?confId=26921).
Still in preparation. “Globe event” foreseen for week of 23 June.

 

F.Hernandez asked whether the test of the LHCOPN is organized by the OPN group or actions are needed.

J.Shiers replied that is all organized by the OPN but the MB ought to be informed in case there were oppositions to the tests planned.

 

K.Bos said that the WLCG Workshop will be very important for ATLAS. Because their Software Workshop on the week after ATLAS had important meetings cancelled. These meetings now going be organised during the Workshop’s week.

On the current activity ATLAS is testing Tier-1 to Tier-1 transfers for tuning their setup. ATLAS started an e-log where the other CCRC e-logs are stored. It is a very useful place to post and to find daily information.

 

4.   Tier 0 power issues (Slides) – T.Cass
 

T.Cass presented a summary of the electricity power situation of the CC at CERN.

 

I.Bird explained that the Experiments and the WLCG should be aware of the power issues at the Tier-0. This issue will be discussed at the Overview Board and the Experiments should be well aware of the situation in view of that discussion.

4.1      Background

The current overall capacity is of 2.5MW is split between critical and non-critical equipment:

-       250kW to 350kW with diesel backup for “critical” loads (email, EDH, database servers)

-       2.25MW to 2.15MW for “Physics”

 

It was planned in ~2000 when PC power looked to be flat at ~100W/box. Obvious since ~2005 that 2.5MW is insufficient for the long term. PC power now understood to scale with CPU capacity in spite of the technological improvements.

4.2      Options

A formal project to plan for additional capacity was requested at end-2006, but not approved.

 

Some informal planning during 2007 concluded that:

-       Construction of a new building at CERN is the most cost-efficient option.
Cost estimates for a building to provide 2.5MW capacity initially and grow to 5MW range from 25-55MCHF.
Time estimates range from 27 months (IBM: 18 months construction plus ~9 months needed to select contractor) to 43 months if work is overseen by CERN facilities department.

-       Hosting in the Geneva area is an option to cover short term needs, but is expensive: 3.6MCHF/year/MW. Without the cost of electricity.

 

Assuming a cost of 35MCHF for a new building, the costs can be covered within the foreseen IT budget out to 2020, but provided that CPU capacity is restricted to 30% annual growth. (C.f. ~100% annual growth since 1990).

4.3      Current Status

The IT requests to initiate selection of a design and construction company and a hosting company have not been approved.

 

There will be no additional power for computing at CERN before Autumn 2010 at earliest, possibly not before end-2011.

The current load is 1.7MW; expect ~400kW in next months to reach already at 2.1MW limit.

The aggressive removal of older equipment (only 3 years old!) will perhaps enable us to install the required additional CPU and disk capacity for 2009. Provided critical loads remain at 350kW! Demand may be up to 500kW.

 

The installation of the required CPU and disk capacity for 2010 is NOT possible with the current constraints.

The support of the Experiments spokespersons is needed at next Overview Board (31 March 2008) in order to raise the issue.

Otherwise one has to reduce what can be provided to the Experiments in the coming years.

 

L.Dell’Agnello asked whether this calculation includes all CERN activities (Tier-0, Tier-1, CAF, etc) or only the Tier-0 equipment.

T.Cass replied that these are the overall power needs for the coming years.

 

M.Schulz asked whether there is any intermediate space where the equipment could be installed instead of either a new building or hosting outside.

T.Cass replied that the plans are not approved because the power is considered enough. There is no support for any activity on the CERN site. If the management support the upgrade all solutions will be studies, including upgrading existing building, etc

 

A.Pace added that the aggressive removal is not really an option. It will only gain a few months because also the critical power needs will increase (for the database servers, etc).

 

H.Marten asked whether the problem is the power for the cooling and the lack of space. Were the black-box containers solutions evaluated?

T.Cass replied that black-box solutions can be temporary solutions but they are not cost effective in the long time. But all options would be evaluated if the project is launched. Black-boxes (or others) could be temporary solutions while a long-term one is implemented.

 

F.Hernandez added that IN2P3 has problems in finding host companies able to provide support for 1 MW.

T.Cass replied that a facility hosting up to 2 MW in the region could be available in 2010.

 

5.   GDB Summary – J.Gordon

 

J.Gordon summarized the last GDB meeting in March.

 

The main topics discussed were:

-       Monitoring of the LCG services (by J.Casey). The plan was presented on how the monitoring will be distributed. For now is only an EGEE plan, but the work is done in contact with OSG. RAL and NIKHEF volunteered to be test sites of the solutions proposed.

-       CPU Efficiency. There was a request for more instrumentation and work needs to be done in order to measure better the CPU usage at sites.

-       Some VOs are worried about the Tape Efficiency and what control and changes can be made.

-       Job Priorities work is ongoing.

-       Operational Call-out is an issue and the VOs want to have direct contact with the Sites (an action is in the MB Action List above).

 

6.   CPU/Wall Time Limits – J.Templon

 

There are two conflicting points of view between Sites and VOs.

 

Sites would like to have a maximum wall time for the execution of a job:

-       To preserve the scheduling priorities some cores must be freed quite often, and not used all the time by long jobs.

-       Interventions are bound by the maximum wall-time. If the wall-time is long, the site must wait for that time to be reached and must close the job queue sooner.

 

Experiments are monitoring the ratio of CPU/Wall time as efficiency of their applications:

-       The agreement was 24h CPU time and 36h Wall time.

-       LHCb asked to move to 48 or 72 hours as they want to process bigger files.

-       A GGUS ticket is raised when the max time is exceeded and this is highly annoying the VOs because happens quite often.

 

Unfortunately LHCb was not present when the discussion took place and their view will need to be collected next time the issue is discussed.

 

O.Smirnova noted that the file size and the job time can be separated: a file can be used by several jobs.

J.Templon replied that merging and splitting files is a time-consuming and dangerous activity that the VOs want to avoid.

 

I.Bird asked whether this is an LHCb-only issue or also other Experiments have the same problem. Maybe the Sites could have a different limit for LHCb only.

K.Bos replied that ATLAS is splitting the files when it is needed but this issue should be studied in ATLAS too.

 

M.Lamanna added that the MC processing and reprocessing will have similar issues. Some raw data files will be large and cannot be processed in 24h. He finds that 24h time limit at SARA is too small and asked what are other sites doing in this matter.

J.Templon replied that he does not have clear whether the issue is limited to SARA or to other sites.

 

H.Marten said that FZK has long queues and extra-long queues for the Experiments. The extra-long queue can last up to 96h of wall time.

Some cores are left free for short jobs.

J.Templon noted that FZK has 5 times more cores than SARA and therefore there are free cores that can be kept for short jobs. When SARA will have more resources (end of 2008) the issue will not be as important.

 

H.Marten noted that if the queue is 96h long this means that the site will need to close the queue 4 days early before a shutdown.

I.Bird noted that this should actually never happen a site should never plan a whole shutdown of the job queues but upgrades should still allow partial usage of the site.

 

F.Hernandez said that IN2P3 does a whole shutdown every quarter because is worse to do the interventions during the running time. It is the most efficient model for IN2P3. They use the full shutdown because of HPSS and Oracle they have only one instance. There is no way to keep the site running when the intervention is on such crucial elements.

 

The MB agreed that is a potential issue to follow. But for the moment is only a problem for SARA/LHCb. SARA should solve the problem with LHCb separately. The future general issues on this topic should be should be followed and clarified at the GDB.

 

7.   Tape Efficiency Metrics – Sites Roundtable
 

 

Already discussed in the Action List.

 

8.   AOB

 

No AOB.

 

9.   Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.