LCG Management Board

Date/Time

Tuesday 16 June 2009 16:00-17:00 – Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=55743

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 18.6.2009)

Participants

A.Aimar (notes), D.Barberis, I.Bird(chair), D.Britton, T.Cass, Ph.Charpentier, L.Dell’Agnello, I.Fisk, S.Foffano, Qin Gang, J.Gordon, F.Hernandez, M.Kasemann, H.Marten, G.Merino, A.Pace, H.Renshall, M.Schulz, Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout, R.Trompert

Invited

A.Sciabá

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 23 June 2009 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments received about the minutes. The minutes of the previous MB meeting were approved.

1.2      Agenda for the LHCC Referees Meeting (Agenda)

The next Referees meeting is on the 6 July 2009. Between 12:00 and 14:00 will be the usual status reports (WLCG, Tier-0, Tier-1 and Tier-2); followed by remark from the WLCG, review of the computing needs of the Experiments and the CRSG report. See the proposed speakers in the Agenda.

 

D.Barberis and M.Kasemann noted that ATLAS and CMS will meet the CRSG reviewers just few days before or after that date.

 

Ph.Charpentier noted that the reviews on the 2 July could ask for last minutes changes for the 6th. It is very difficult to prepare such meetings when they are happening so close in time.

1.3      Quarterly Report Preparation

A.Aimar asked that the Experiments give the usual short presentation for the coming quarterly reports:

 

The proposed schedule is:

-       30 June: ALICE, ATLAS

-       6 July: CMS, LHCb

 

2.   Action List Review (List of actions)

  • VOBoxes SLAs:
    • CMS: Several SLAs still to approve (ASGC, IN2P3, CERN and PIC).
    • ALICE: Still to approve the SLA with NDGF. Comments exchanged with NL-T1.

CMS:
No progress for CMS.

ALICE:
NL-T1 had reported that ALICE replied positively with a few comments and NL-T1 has only to implement some minor changes.
NDGF is waiting for the approval by ALICE.

  • 5 May 2009 – CNAF completes, and sends to R.Wartel, their plans to solve the issues concerning security procedures, alarm contact point, etc.

Started, but not yet completed. L.Dell’Agnello added that CNAF is doing their internal Security Challenge with the Italian ROC Security. The document sent to R.Wartel was a bit short and needs to be updated.

  • Tier-1 Site should start publishing the UserDN information.

J.Gordon will regularly add the list of Sites that publish the information. He asked the portal developers in order to have an anonymous list available on the web.

 

·         9 Jun 2009 - Sites should report to the MB whether now, after the GDB presentations, the situation of the data rates is clear.

 

Data rates not presented by ALICE yet. Will be scheduled at the MB next week.

F.Hernandez and J.Gordon added that FR-IN2P3 and UK-T1-RAL are satisfied with the information received. Even if maybe the Sites cannot yet satisfy all the requests at the moment.

 

·         Each Site should send the URL of the XML File with the Tape Metrics to A.Aimar

 

Not done by: DE-KIT, FR-CCIN2P3, NDGF, NL-T1, US-FNAL-CMS

Sites can provide what they have at the moment. See http://sls.cern.ch/sls/service.php?id=WLCG_Tier1_Tape_Metrics

Sites should send URLs to existing information until they do not provide the required information.

 

·         Experiments should send to J.Gordon the DNs of the people that can read the details of the users in the CESGA Portal.

Done. J.Gordon added that the VO managers will be added by default.

 

3.   LCG Operations Weekly Report (Minutes; Post-mortem workshop agenda; Slides) – J.Shiers
 

 

Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting.

All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      STEP09 Summary

Positive Aspects:

A lot of activity in particular from ATLAS and CMS at Tier0, Tier1s and Tier2s. Results still to be analyzed (see post-mortem workshop in July): but indications are encouraging. All this happened despite significant problems with many sites and services leading up to and during the last two weeks; A “step-up” in WLCG operations is essential if we are to sustain a similar service level over a longer period of time.

 

Negative Aspects

Despite encouraging results, many sites and services suffered serious outages or degradations during or leading up to the main two weeks of STEP’09. Almost everything that could go wrong did: OPN fibre cuts, major data and storage management issues, data loss, release back-out etc. Many of the rather soft WLCG operations procedures are repeatedly ignored: this is a major risk for the overall WLCG service if not addressed with priority now.

 

Site ignores even the simple procedure. For instance, at the Operations meeting, FZK announced a major intervention just the day before. This is not what is agreed since years.

 

More than half of ATLAS Tier1s passed the ambitious reprocessing metric: of those that did not, biggest concerns are (high to low):

-       ASGC [ high concern ]

-       FZK

-       NL-T1 (some DMF-level tuning)

-       IN2P3: “extremely impressive” was comment from last Thursday’s minutes [ low concern ]

 

These results must be matched against expectations: we know that problems will occur and the important thing is how we plan for (where possible) and recover from these. Including unexpected events – e.g. CNAF tape cleaning cartridge issue; multiple OPN fibre cuts, etc.

 

Despite the very real problems affecting sites and services, all in all STEP’09 represents a real step forward in what has been achieved and a real step up in service level. We must analyse the problems transparently and thoroughly: issues of the level that continue to be seen deserve and require this.

And we must seek ways to further reduce effort – could we have continued at this level for many weeks / months on end?

 

Attendance at operations meeting dropped off sharply after GDB and ATLAS announcement of their run-down.

 

Site / Service

Issue

ASGC – many issues

Relocation away from temporary location foreseen in coming weeks.

Need to retest once this is done; an analysis of the key problems and their resolution would be valuable.

FZK tape access

See STEP’09 site metric.

LHCb & conditions DB

CORAL/LFC – major performance problems. A “SIR” has been requested.

IN2P3 gridftp transfers

Transfers aborting: 03:00 – 10:30 Wednesday. A “SIR” has been received. [ Alarm ticket ]

OPN fiber cuts

CERN IT status board contained stale information (24H?) after repairs had been made.

No broadcast made.

WLCG operations procedures re-discussed just days prior to start of STEP’09.

They do not participate to the GDB meetings even when requested

 

IN2P3 GridFTP Outage

Problem occurred when GridFTP servers were overloaded. Bug in monitoring script disabled all GridFTP doors.

The script fixed to not disable all doors; mid-term improvements to monitoring. Ticket about GOCDB bug: 49383

 

 

ASGC Briefing

Some “STEP-style” re-validation of ASGC once relocated back from IDC is clearly required.

July is perhaps a little soon for this, but should be done prior to any other multi-site/multi-VO Scale Tests…

 

Is difficult to understand what is the issue behind these problems and the progress. See below:

Problematic LTO4 drive fixed this Wed., after re-calibrating from console of the library. all new procured tape drives are fully functional now. majority of the activity refer to read access and from cms, try recalling data from cmsCSAtp pool.

around 500+ job pending in queue since yesterday, both for atlas and cms. the problem is clear earlier as two of the misconfigured wn add to the pool after resetting the client config yesterday evening. should now all fixed and cluster showing 67% usage.

25% reduction observed in atlas production transfer and also less than 40% MC archive found in CMS, as already brief to wlcg-operation.

congestion of backup route through CHI showing high packet loss rate earlier this afternoon, specific for inbound activity.

still we need to sort it out urgently for the t1/2 fair share issue as atlas submission to the T2 have four times the number than at T1 CEs and we merge T1/2 using the same batch pool (after relocating to IDC).

splitting the group config in scheduler and also having different group mapping in different CEs from T1/2 able to balancing the usage.

draft planning for the facility relocation 11 days later. the IDC contract end at Jun 23, and we'll have one day downtime for computing/storage facility and two days for tape library. as re-installation of IBM TS3500 linear tape system need calibrating the horizontal between all frame sets.

 

Clearer reports and summaries are clearly required.

 

OPN Fibre Cuts

The recent outage, some 36 hours on a major fibre trunk can be seen clearly in the weekly statistics here:

http://network-statistics.web.cern.ch/network-statistics/ext/LHCOPN-Total/?p=sc&q=LHCOPN%20Total%20Traffic

 

Note that when traffic is re-routed it is then attributed together with the traffic to another Tier-1. So although colours disappear it does not mean that the Tier-1 has no traffic. The overall impact can be seen from the total aggregate traffic volume.

 

This was one of the “worse case” scenarios we had long since identified that cut 5 Geant circuits, most of which were the primary connectivity paths to their respective Tier-1’s. However, the use of the cross border fibre and the associated backup paths enabled most of the traffic to find new routes and the overall traffic to continue with little impact.

So congratulations to everyone for ensuring that we have this level of resiliency. If anyone has specific comments on this event and its impact on our OPN, they are welcome to send me a mail.

 

L.Dell’Agnello noted that in spite the OPN cuts INFN was still able to communicate with Sites where in principle it should not. They could reach CERN via IN2P3 even if the backup path was also down via FZK. The network routing seems not well designed or implemented.

I.Bird suggested that also the OPN should come to the Workshop for their post-mortem.

F.Hernandez noted that the unexpected alternative path should be considered positive.

 

GridPP

Report received:

We were very busy trying to learn from STEP, but in many ways it was a very calm 2 weeks. There was no out-of-hours page out until Saturday just gone (formally after the end of STEP09) and although there WERE problems that gave us a few long days we would hope these kind of issues would shake out pretty fast. On the whole, things look quite comfortable, though perhaps we were just lucky. We did, for example, have a Tier-1 away-day in the second week with just a skeleton crew left on site, so we were certainly not in "hero-mode".

3.2      GGUS Tickets and SAM Results

Only 1 real alarm ticket (ATLAS to IN2P3) sent during this week! (Plus one test from LHCb to FZK)

TEAM tickets dominate for ATLAS and LHCb

ALICE and CMS have about 5 tickets / week

ATLAS (and LHCb) close to 1 order of magnitude higher!

CMS prefers to investigate more before sending tickets to GGUS.

 

F.Hernandez stated that the CMS reports are very well done and this must be acknowledged.

 

VO

User

Team

Alarm

Total

ALICE

6

0

0

6

ATLAS

5

32

1

38

CMS

4

0

0

4

LHCb

0

26

1

27

Totals

33

32

0

75

 

Slide 11 shows the SAM tests for the 4 Experiments. Very few red days but does not really match the reality of the week.

Major problems at 3 Tier-1 sites are not shown. Hard to match this week’s results with WLCG OPS minutes: no clear correlation with site problems described in slide 11.

3.3      Summary

A lot of numbers to study – some positive results but some less positive.

 

Apart from the technical lessons of STEP’09, it is clear that we could not continue (experiments, sites, services) at this level of operational stress for prolonged periods. We need to address this issue along with the technical problems: “hero mode” will not be sufficient.

 

On the other hand, this is not likely to be a 9-5 job for anyone in a key role anytime soon: is 12h week days and 8h (less?) weekends achievable in the coming months?

 

I.Bird added that running in hero-mode can be done only for short periods and more people should be trained. People cannot e on shift for 2 weeks.

3.4      Post-Mortem Workshop

About 40 people have registered – would be good to have better site-representation. EVO will be available

 

I.Bird asked whether some Sites should provide more detailed post-mortem in order to explain some issues.

 

J.Gordon added that the problems of LHCb at RAL where acknowledged as an LHCb problems.

 

 

4.   CMS VO SAM Tests Review (Slides) – A.Sciabá

 

 

A.Sciabá commented the SAM results for May and explained the usage of SAM and the dashboard in CMS.

4.1      May Availability

 

 

Site

Date

Up
Down

Comments

CERN

1

No significant issues.

DE-KIT

1-5

up

SRM tests failed because the disk-only area used also by SAM tests was filled by the CMS “backfill” jobs (fake production jobs).

ES-PIC

19

down

A network intervention required to switch off the site services. Some SAM CE test jobs failed due to the unavailability of the CEs.

FR-CCIN2P3

3-5, 24

down

Failure of the cooling system.

IT-INFN-CNAF

No significant issues.

TW-ASGC

~all

up

Frequent timeouts in SRM tests and SRM errors, probably all related to the ‘big ID’ Oracle bug which affected CASTOR.

UK-T1-RAL

1

up

A SAM test job failed to run before proxy expiration. Probably related to a problem with the (shared) home dir of the cmsprd and cmssgm accounts.

6-7

up

After moving to pool accounts for the production role, the SRM tests failed with “permission denied”, as the new local user had no permission to write

18-20

down

Hardware upgrades for the CASTOR database nodes, with an unscheduled extension. Some SRM failures shortly before and after the downtime.

26-28

down

Network intervention.

US-FNAL-CMS

23

up

Timeouts of SRM tests and SRM unresponsive. Reason unknown.

4.1      Downtimes and Reliability

It is evident from a comparison of the GridView plots and GOCDB that both scheduled and unscheduled downtimes appear in yellow and are taken into account in the reliability calculation. The effect is an overestimation of the reliability. Problem reported to GridView developers.

 

 

 

In GridView FNAL does not include yet the “CE service” information (missing in the picture below).

-       Availability and reliability overestimated

-       But the individual CE instances are shown

 

 

4.2      Criticality in the SAM Tests and in the Dashboard

The CMS SAM tests have two criticalities:

-       in FCR, affecting the “WLCG” CMS availability as shown by GridView

-       in the Dashboard, affecting the “CMS” CMS availability as shown in CMS internal meetings

 

Why two different criticalities?

-       For WLCG only tests whose failure is correlated to the site itself (e.g. SRM problems, CE problems)

-       For CMS also tests whose failure is correlated to CMS (e.g. local CMS, configuration files, lack of needed CMSSW releases, etc.)

 

CMS cannot “blame” sites in a WLCG context for CMS mistakes.

For instance for SAM only one CE tests is critical, while all are critical for the CMS Dashboard.

 

Test name

Function

Critical for WLCG

Critical for CMS

CE-sft-job

Tests job submission via WMS with lcgadmin role

Y

Y

CE-cms-prod

Tests job submission via WMS with production role

N

Y

CE-cms-basic

Tests CMS local site configuration in software area

N

Y

CE-cms-swinst

Tests that all needed CMSSW releases are correctly installed and published

N

Y

CE-cms-mc

Tests the local file stage out mechanism works

N

Y

CE-cms-squid

Tests that the local Squid server works

N

Y

CE-cms-frontier

Tests that calibration data can be downloaded from Frontier

N

Y

CE-cms-analysis

Tests that local data can be read by jobs

N

Y

 

Same for the SRMV2. Only 2 tests are critical for SAM.

 

Test name

Function

Critical for WLCG

Critical for CMS

SRMv2-get-pfn-from-tfc

Tests the LFN-to-PFN resolution mechanism for the endpoint

N

Y

SRMv2-lcg-cp

Tests that a file can be copied forth and back (and deleted) from the UI to the endpoint

Y

Y

SRMv2-lcg-ls

Tests that lcg-ls gives the correct output for a remote file

N

N

SRMv2 lcg-ls-dir

Tests that lcg-ls gives the correct output for a remote directory

N

N

SRMv2-lcg-gt

Tests that a good gsiftp TURL can be retrieved for a remote file

N

N

SRMv2-lcg-gt-rm-gt

As the former but additionally checks that a TURL cannot be retrieved after the file is removed

N

N

SRMv2-user

Tests that a user can write in the local /store/user area (in the CMS namespace)

N

N

 

All details are available at https://twiki.cern.ch/twiki/bin/view/CMS/SAMForCMS

For questions, contact Andrea Sciabà or Nicolò Magini.

 

I.Bird asked whether the CMS tests are also useful for the Sites and they should be made available to the Sites (via Nagios for instance).

A.Sciabá replied that this could be implemented.

 

G.Merino asked that the Experiments should provide standard Nagios messages not with special Experiments add-ons. And sites should not be classified with “site readiness” values when is really not depending on the Site.

 

 

5.   June GDB Issues (SL5, etc) (Slides) – J.Gordon

 

 

5.1      SL5

The proposal: is to define a single metapackage requiring all the used compatible libraries:

-       Will be produced both for SL4 and SL5

-       Will automatically install what is required on nodes

-       Publishing compliance using a software tag

 

Was also agreed that:

-       metapackage should be tested outside grid asap by experiments

-       metapackage should be made available to sites available to test

Both before the next GDB

 

Sites should provide some CE that point to SL5 WNs so that Experiment can start sending jobs to SL4 and SL5 and compare the behaviour at all Sites.

 

Experiments decided to ship the standard GNU GCC 4.3 in the SW area

No objections to disabling ‘allow-execheap’ in SELinux.

 

 

By the July’s GDB was agreed to:

-       Experiments test metapackage and gnu 4.3 compiler

-       Sites test metapackage

-       Experiments agree date for LXPlus migration

-       Everyone agree timetable for site migration

-       Or this won’t happen before data-taking.

5.2      SCA/gLExec

SCAS testing extended to several sites and two experiments (LHCb, ATLAS)

          Pilot job environment. A series of scripts available to initialise the environment for each payload.

          GLExec installation on WN
Proposal to install as an infrequently changing part of OS.
Any installation unsatisfactory to CC-IN2P3

OSG have a working shared installation

 

 

6.   High Level Milestones (PDF) – A.Aimar

 

 

 

Postponed to next meeting.

 

 

7.   AOB

 

 

7.1      STEP09 Press Release

J.Shiers announced that a coordinated press release with the Tier-1 Sites could be done.

A draft of the CERN release is available. Sites should contact J.Shiers or I.Bird and send the contact persons to send the draft to.

 

J.Gordon noted that the names should be already known from the past.

I.Bird asked that Sites should send again the names in order to make sure that they are still valid.

 

 

8.    Summary of New Actions