LCG Management Board

Date/Time

Tuesday 3 June 2008, 16:00-17:00 – Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=33697

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 7.6.2008)

Participants  

A.Aimar (notes), I.Bird (chair), D.Britton, G.Cancio Melia, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, X.Espinal, I.Fisk, S.Foffano, J.Gordon, F.Hernandez, M.Kasemann, H.Marten, P.Mato, B.Panzer, R.Pordes, G.Poulard, Di Qing, H.Renshall, M.Schulz, Y.Schutz, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 10 June 2008 16:00-18:00 – F2F Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

1.2      Matters Arising on Accounting (Accounting Summaries; From ALICE; From Ian) - I.Bird

There were several discussion and emails on the accounting reported by ALICE on the week before.

 

Y.Schutz explained the values of the previous week.

 

 

In the slide above there are the explanations of each column.

 

I.Bird commented that:

-        The first column is incorrect because is including the 3 months of 2007 pledges and April with the 2008 pledges. The numbers cannot be derived from the published documents.

-       In the second column “Delivered to ALICE (wall)” the values are incorrect for CERN, SARA-NIKHEF and RAL and different from the Accounting Summaries (Link). The values should be divided by 30, and not by 120.

 

Y.Schutz agreed that the he had a typo and was dividing by 130 and not by 30.

 

F.Hernandez added that:

-       The accounting for IN2P3 and is incorrect. IN2P3 is only reporting the part about the Tier-1 not the whole capacity for ALICE.

 

 

The highlighted in green show that the delivered to ALICE as measured to ALICE gives a different value.

Ph.Charpentier noted that actually is what is “used” by the Experiments.

 

T.Cass noted that the other columns are not comparing similar units and are not comparable.

J.Gordon asked that the no of jobs should be compared to see whether there are jobs accounted to ALICE that are not taken into account by the ALICE tools.

 

I.Bird added that the inefficiency of the pilot jobs should be taken into account. And also how the different normalizations are influencing the result.

 

New Action:

H.Marten and Y.Schutz agreed to verify and re-derive the correct data and the normalization factors applied (at FZK for instance). And compare the local accounting, APEL WLCG accounting and the ALICE MonALIsa accounting.

 

Change of ALICE data rate

I.Bird added that the increase of the rate of 5x factor is a change from ALICE. The pledges agreed are the ones before this change. ALICE cannot change the pledges and present it to the MB. This has to be dealt outside the MB (at RRB or other bodies).

The up to date Requirements and Pledges at: http://lcg.web.cern.ch/LCG/planning/phase2_resources/P2PRCcaps260508.pdf

 

Proof support

I.Bird asked whether they referred to Xrootd support. He noted that longer term than that the current contracts for these individuals cannot be guaranteed. Priorities for posts are continually reviewed.

 

2.   Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

  • 30 Apr 2008 – Sites send to H.Renshall plans for the 2008 installations and what will be installed for May and when the full capacity for 2008 will be in place.

 

H.Renshall is back and A.Aimar will ask him for an update to the MB in the coming weeks.

 

  • 9 May 2008 - Milestones and targets should be set for the LCAS solution (deployment on all sites) and for the SCAS solution (development, certification and deployment).

 

It will be installed on the pre-production test bed PPS at CERN and LHCb will test it.

Other sites that want to install it up should confirm it.

Update: Missing information from INFN, ATLAS, CMS and LHCb.

The MB Agreed that Experiments should send signed emails and the emails addresses should be created. Sites will decide how to handle the alarm.

 

MB Decision:

The Policy Agreed is that users must have their emails signed to send alarm emails. The Sites will handle the emails as they prefer (manually or automatically).

 

 

3.   CCRC08 Update (Draft Agenda of June Post-mortem Workshop; Draft LCG Service QR; Draft LCG Service QR (updated); Slides) - H.Renshall

 

3.1      Status

CCRC’08 is now formally over – leaving behind a good service base on which to steadily, but non-disruptively build.

We will have an extended post-mortem next week (Link) – today we will (briefly) summarize the key events of the last week of the May run together with some initial lessons

Some minor service “fixes” can be expected in June – these MUST be minor integrated over the whole service. They must be well motivated and (highly preferably) tested with a scheduled load test. Further upgrades can be expected later in the year – we are now in full data taking / production mode!

3.2      Weekly Highlights

Power cut on last day of run affected CERN services heavily. Details at post-mortem. This and other power / cooling problems are (yet another) reminder of our exposure in these areas. Both Tier0 (ATLAS) and site (TRIUMF) emergency contacts procedure activated. Both showed weaknesses – reminder that the procedures must be very simple, as well as exercised if they are to work when needed.

 

ATLAS SRM (CERN) service interruption (Thu) to reduce exposure to problems seen (various) with b/e DB

Minutes for the week are here: https://twiki.cern.ch/twiki/bin/view/LCG/CCRC08DailyMeetingsWeek080526

3.3      Sites

TRIUMF – CA problem solved. See comments above on emergency contacts phone procedure (also discussed at last week’s MB).

NIKHEF – New Thumpers in prod last week. Still gaining experience. Bug in growing partition size. 2TB of ATLAS data currently not accessible. ATLASDATADISK - keep until end week and then dump - don't try to recover them. xfs - bug is growing fs > 2TB. Integer overflow -> corrupt superblock.

SARA - still troubles with new dCache release (p4) - gsidcap doors less stable - should other sites upgrade?

 

J.Templon commented that the cause of the problems at NIKHEF is not clearly identified.

Ph.Charpentier added that P2, P3, P4 are faulty release and sites should not update.

3.4      Network

(Thu) One LCG backbone router (l513-c-rftec-4.cern.ch) will be configured to provide some special debug output.

-       This maintenance was completed successfully without any impact on the traffic.

 

(Fri) LCG OPN: Following the power incident this morning we have detected a problem on the router module that connects to IN2P3. We need to immediately intervene on it.

-       DATE AND TIME:
Friday 30th of June 14:20 to 15:00 CEST
IMPACT:
The link LHCOPN CERN-IN2P3 will flap several time. Traffic to IN2P3 will be rerouted to the backup path.

3.5      DB Services

(Mon) CMS dashboard high-load: tuning / optimization of application being scheduled

 

(Thu) Intervention 17:00 - 19:00 on ATLAS RAC of TRIUMF - transparent - to increase memory from 4GB to 10GB per node.

 

(Thu) CASTOR SRM (ATLAS) move to new DB h/w

 

(Fri) Affected rather heavily by power cut. Details in slide notes. Mixed configuration with ½ servers on critical power and ½ not seems sub-optimal (let’s say…)

 

Upgrades to Oracle 10.2.0.4 scheduled for the coming weeks

-       LHCb June 5, LCG June 16 + power intervention June 25(?)

3.6      Experiments

ATLAS, CMS and LHCb continued activities as foreseen during this last week. ALICE began rapid ramp-up in activities in preparation for 3rd commissioning exercise

Summaries from the experiments:

-       ATLAS:

-       CMS:https://twiki.cern.ch/twiki/bin/view/CMS/CCRC08-Phase2-OpsElog

-       LHCb: http://lblogbook.cern.ch/CCRC08/514,

 

https://prod-grid-logger.cern.ch/elog/CCRC'08+Observations/106

[Defer discussion on value and feasibility of central reporting on experiments’ activities to next week’s F2Fs…]

3.7      WLCG Services

The service runs smoothly – most of the time, Problems are typically handled rather rapidly, with a decreasing number that require escalation.

Most importantly, we have a well-proved “Service Model” that allows us to handle anything from “Steady State” to “Crisis” situations.

 

We have repeatedly proven that we can – typically rather rapidly – work through even the most challenging “Crisis Situation”. Typically, this involves short-term work-arounds with longer term solutions.

 

It is essential that we all follow the “rules” (rather soft…) of this service model which has proven so effective.

3.8      CCRC08 Procedures

We review on a weekly basis if problems were not spotted by the above  fix [ + MB report ]

With time, increase automation.

-       Experiment “shifters” use Dashboards, experiment-specific SAM-tests (+ other monitoring, e.g. PhEDEx) to monitor the various production activities

-       Problems spotted are reported through the agreed channels (ticket + elog entry)

-       Response is usually rather rapid – many problems are fixed in (<) < 1 hour!

-       A small number of problems are raised at the daily (15:00) WLCG operations meeting

3.9      CCRC09 Outlook

The proposal is that there will be a CCRC09.

 

There will be new and updated software and services: SL(C) 5, CREAM, Oracle 11g, SRM v2.2++, SCAS, etc. New hardware will be installed in 2009 and there will be the transition to EGI from EGEE III.

 

Ph.Charpentier asked whether the MB agreed about a CCRC09 as the Experiments will be doing reprocessing and new MC simulations in that period. There will not be time for a challenge: are there resources for this?

I.Bird replied that the upgrades will be scheduled progressively but there will be a few weeks where it is going to commissioned and accepted after a full verification. Not all updates will be done at the same time, will be scheduled before and verified at the CCRC09 that will try to check all bottlenecks when the system is pushed at the limit.

 

CCRC09 should be agreed in more detail at the MB in the future.

 

4.   ATLAS QR Presentation - POSTPONED TO NEXT WEEK

 

 

 

 

5.   AOB
 

5.1      DCache Problems

J.Templon stated that the situation with the dCache release and upgrades is worrying

-       Patch levels from dCache included new functionality that caused new problems and even deleted existing information (all space tokens deleted).

-       DCache must implement patch levels that ONLY modify the fixes and nothing else.

 

Ph.Charpentier and F.Hernandez added that LHCb continue to experience random (and very annoying) problems accessing data managed by dCache at CCIN2P3. The causes of those problems are still not understood.

Ph.Charpentier added that at PIC and GridKa have no problems with the same version. The problem is that there are not site with exactly the same versions of dCache. Is impossible to compare the installations.

 

I.Bird asked that the information on these dCache issues is mailed to him. He will meet with dCache after the GDB meeting next week.

5.2      WLCG Press Releases

CERN will organize at end-September a full day about the WLCG and a release will also be done then. But now is useful to state the success of CCRC and waiting for data.

 

I.Bird reported that the releases should be now and another one in September. The statement is that the computing is ready for data and should be agreed by all countries with a common part and specific parts for the different Sites.

 

M.Kasemann noted that is the press release is not done now there will no impact as if it is done at the start of data-taking. The Workshop next week could be the trigger for the press release.

 

I.Bird concluded that the MB should check to at the Workshop whether the press release will be done now. And Sites should check with their press offices.

 

 

6.   Summary of New Actions

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.

 

New Action:

H.Marten and Y.Schutz agreed to verify and re-derive the correct data and the normalization factors applied (at FZK for instance). And compare the local accounting, APEL WLCG accounting and the ALICE MonALIsa accounting.