LCG Management Board

Date/Time

Tuesday 26 May 2009 16:00-17:00 – Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=55740

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 31.5.2009)

Participants

A.Aimar (notes), D.Barberis, T.Bell, I.Bird(chair), K.Bos, F.Carminati, Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, Qin Gang, J.Gordon, F.Hernandez, M.Lamanna, P.McBride, G.Merino, A.Pace, H.Renshall, M.Schulz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 9 June 2009 16:00-17:00 – F2F Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments received about the minutes. The minutes of the previous MB meeting were approved.

 

2.   Action List Review (List of actions) 
 

  • VOBoxes SLAs:
    • CMS: Several SLAs still to approve (ASGC, IN2P3, CERN and PIC).
    • ALICE: Still to approve the SLA with NDGF. Comments exchanged with NL-T1.

CMS:
No progress for CMS.

ALICE:
J.Templon added that ALICE replied positively with a few comments and NL-T1 has only to implement some minor changes.
NDGF is waiting for feedback from ALICE. F.Carminati added that need to be approved by ALICE.

  • 16 Dec 2008 - Sites requested clarification on the data flows and rates from the Experiments. The best is to have information in the form provided by the Data flows from the Experiments. Dataflow from LHCb

Done. The Experiments presented their dataflow and rates at May’s GDB.

New Action (asked by J.Gordon):

9 Jun 2009 - Sites should report to the MB whether now, after the GDB presentations, the situation of the data rates is clear.

  • M.Schulz will summarize the situation of the User Analysis WG in an email to the WLCG MB.

Not done.

  • 5 May 2009 – CNAF completes, and sends to R.Wartel, their plans to solve the issues concerning security procedures, alarm contact point, etc.

Started, but not yet completed.
L.Dell’Agnello added that the discussion is still on-going between CNAF and the Security team.

  • 26 May 2009 - ASGC should distribute a detailed plan on setting up the Tier-1 systems

Done. ASGC sent the necessary information to I.Bird.

  • 26 May 2009 - Sites reply which MSS information they can provide in XML already, and whether can be reported by Experiment.

Done. Later in the agenda.

  • 26 May 2009 - A.Aimar will see how to extract the information about scheduled downtimes from GOCDB and report it to the MB every month.

Done. Scheduled downtimes are registered in GOCDB and A.Aimar will extract the down-time information from GODDB.

  • Experiments will have to explain their VO SAM tests and the tests that are reported in their dashboards.

 

The difference is because the VO SAM critical tests are only detecting failures due to Sites and the Services they provide. Instead the dashboard tests include all errors, including those caused by the applications. The differences are explained like this.

 

New Action:
A.Aimar schedules a presentation at the F2F in June. Experiments will explain their VO SAM tests and the tests that are reported in their dashboards.

On User Accounting:

  • Tier-1 Site should start publishing the UserDN information.

J.Gordon will regularly add the list of Sites that publish the information. He asked the portal developers in order to have an anonymous list.

 

J.Gordon noted that the VO manager is often taken from the VO Cards but is incorrect for the purpose. A new role is needed in the CESGA portal for now

 

New Action:
Experiments should send to J.Gordon the DNs of the people that can read the details of the users in the CESGA Portal.

  • Countries should comment on the policy document on user information accounting

Removed. This is more a GDB issues.

  • The CESGA production Portal should be verified. J.Gordon will check and send again the information to the MB on which portal to use (prod or pre-prod).
    • Sites Tier-1 Sites and main Tier-2 Sites buy the license for the benchmark

Removed. The MB will follow only the Tier-1 Sites.

New Action:
Status of HEPSPEC benchmarking at each Tier-1 Site.

    • A web site at CERN should be set up to store the values from WLCG Sites.

To be done. 

 

3.   LCG Operations Weekly Report (Slides) – H.Renshall

Summary of status and progress of the LCG Operations. It covers the activities since the last MB meeting.

The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Summary

This report covers the service for the two weeks period 10 to 23 May.

 

The GGUS ticket rate was normal and one alarm ticket was issued. It affected more than CMS but they were the ones to be affected most.

-       19 May There was a GPN network connectivity failure to CERN from just before 10.00 to just after 11.00 due to a GEANT reconfiguration which cut off CERN addresses. CMS started losing production jobs and raised a CERN alarm ticket which was not escalated to the data operations piquet as per procedure as it did not concern the services of castor, fts, lfc or srm. In this case the ticket only arrived when the GPN came back. How to handle such cases to be followed up.

 

Three Central Service incidents (Slide 13 and 14 provide the report on the Central Services outages)

-       Failure to publish SAM test results for several hours after the above WAN incident

-       Lost batch jobs after Linux upgrade

-       SRM upgrade rolled back after causing multi-file ‘get’s to fail.

 

One incident leading to SIR submitted to WLCG:

-       PIC 14 May 5 hours cooling down. Slide 9 and 10 provide the report on the PIC cooling incident.

 

Below is the GGUS summary for the last 2 weeks.

 

VO concerned

USER

TEAM

ALARM

TOTAL

ALICE

1

0

0

1

ATLAS

46

64

0

110

CMS

18

0

1

19

LHCb

5

46

0

51

Totals

70

110

1

181

3.2      Experiments Site Availability Results

The availability plots on slide 4 show that RAL’s scheduled downtime (upgrade of Oracle backend) was seen by all Experiments.

 

J.Gordon clarified that RAL had forgotten to schedule the draining of the batch cluster and this caused the unscheduled downtimes seen in the VO results. Actually it was scheduled, but took longer than expected.

 

Note: CNAF downtime for LHCb is different when extracted alone or with the other VOS.  Must be a bug in the reporting?

 

Ph.Charpentier asked that the name of the VO appears on the matrices on slides 4.

 

ALICE:

Smooth production running.

 

ATLAS:

ASGC: ASGC is now considered fully functional after local router hardware problem was resolved. ASGC is still suffering from the Oracle BigID bug. A workaround in preparation.

 

A.Pace asked whether the BigID problem had manifested itself in the period.

H.Renshall replied that the problem showed up in ASGC.

 

PIC: The LFC server at PIC became overloaded during a centrally run file deletion operation although this rate of deletion has been reached before. May be some strange LFC entries. PIC to do manual deletion.

 

G.Merino clarified that the problem was not an overloading but a corrupted entry in the catalogue. The issue is understood and fixed. Initially was thought to be DoS but actually was a problem with the LFC data.

 

Experiments Contacts: This happened at the weekend and it was not clear at PIC whom to contact in ATLAS to exchange information and plan actions. ATLAS are suggesting a new “sites to experiment” mailing list that would have restricted posting (20 names) and end up sending an SMS to the ATLAS experts on duty.

 

CMS:

GPN to CERN failure 19 May stopped Phedex transfers and lost some batch jobs.

 

LHCb:

FZK: Issue of SRM not returning TURLs at FZK now understood. Dirac does not always set the ‘done’ state after a get request causing this instability.

 

CNAF: a host certificate expired on an LHCb StoRM disk server. Ticket routing was somehow not optimal and will have taken 5 days to fix.

3.3      Experiments Activities

ATLAS:

Preparing for STEP’09 by distributing cosmics raw data to T1 (for reprocessing – all finished except ASGC), distributing AOD (for T1-T1 tests and analysis at T2) and DPD (for analysis at T2 sites). All clouds involved. Ramp up of Step’09 from 2 June.

 

LHCb

Week of 090511 was a FEST week.

 

CMS

Good rate of ticket resolution in recent weeks – many Tier 2 are now able to reach 80% availability.

Step’09 planning meeting held with pre-staging as main subject. Small scale pre-staging tests at Tier 1 to be run this week.

CRUZET (Cosmics at Zero Tesla) run at CERN next week

3.4      SIR Reports

PIC Cooling Incident

Slide 9 and 10 provide the report on the PIC cooling incident.

 

dCache at NL-T1

NL-T1 reported this week that their 20 May upgrade of dCache to level 1.9.2-5 has been causing problems firstly with memory issues with a java virtual machine then with hung servers. They are in touch with dCache developers but are preparing to downgrade to their previous 1.9.0-10 release in order to participate in STEP’09.

 

J.Templon added that the problems should be fixed in 1.9.2-6.

 

CASTOR at CERN

Last major CASTOR upgrade to 2.1.8-7 for LHCb this week (Wednesday).

Minor (should be transparent) upgrades to 2.1.8-8 for ALICE, ATLAS and c2cernt3 (CMS and ATLAS analysis stager) also Wednesday.

 

Central Services Outages

Slide 13 and 14 provide the report on the Central Services outages.

 

J.Gordon asked why the major problems occurred to Indico were not reported in this weekly report.

I.Bird agreed that problems like these must be fixed but he also remarked that Indico cannot be considered a service “crucial for data taking” and therefore covered by WLCG operations. Even if clearly Indico should be a fault tolerant service and it cannot be unavailable for more than one day.

Ph.Charpentier noted that such a long delay, during the LHCb Week, was a “disaster”.

 

4.   Progress/Status on using SLS for Tapes Metrics (Sites Answers; Instructions.pdf) – A.Aimar

 

A.Aimar, following the previous discussion at the F2F MB meeting, proposed to start from the Tape Metrics collected at CERN.

 

DATA READ: data volume read in GB
DATA READ: number of files transferred
DATA READ: average file size in MB
DATA WRITE:            data volume written in GB
DATA WRITE:            number of files transferred
DATA WRITE:            average file size in MB

 

RATE READ: read transfer rate inc drive overhead MB/sec
RATE READ: drive read transfer rate MB/sec
RATE WRITE:            write transfer rate inc drive overhead MB/sec
RATE WRITE:            drive write transfer rate MB/sec

 

TAPE MOUNT:          successful mounts in last 4hrs
TAPE MOUNT:          failed mounts in last 4hrs
TAPE MOUNT:          read mounts in last 4hrs
TAPE MOUNT:          write mounts in last 4hrs
TAPE MOUNT:          average tape mount time in secs

TAPE QUEUES:        average wait for read in secs
TAPE QUEUES:        average wait for write in secs

 

FILES PER MOUNT:   read average
FILES PER MOUNT:   write average


TAPE VOLUME for <VO> in TB
TAPE REPEAT MOUNT:      read average last 24hrs
TAPE REPEAT MOUNT:      write average last 24hrs
TAPE FRAGMENTATION::   percentage full level for <VO> tapes

 

During the week he asked that Sites reply whether they can:

-       generate such metrics

-       at which frequency and

-       as global or “by VO” values.

 

The answers received are summarized in the table below.

 

SITE

Tape Metrics

CERN, CA-TRIUMF, DE-KIT, ES-PIC                        OK

FR-CCIN2P3

-

IT-INFN-CNAF

Yes, Except:

TAPE QUEUES: average wait for read in secs
TAPE QUEUES: average wait for write in secs
TAPE FRAGMENTATION: Percentage full level for <VO> tapes

NDGF

-

NL-T1

-

TW-ASGC

?

UK-T1_RAL

Yes, but not now. Work to be done before we can start publishing. This unlikely to be complete before STEP09 but should be done over the summer.

US-FNAL-CMS

-

US-T1-BNL

YES except:

TAPE MOUNT: failed mounts in last 4hrs – This information exists but needs to be manually extracted for now

 

J.Gordon added that RAL agrees with the metrics but they cannot generate them in time for STEP09 and before the summer.

 

Each Site commented the metrics at the meeting:

-       CERN, TRIUMF, FZK and PIC confirmed that they can generate the values.

-       FR-CCIN2P3: The experts at IN2P3 will need time in order to generate the information. Too many metrics are requested, in their opinion. The main questions should be limited to whether the Sites can deliver the rates agreed.

 

I.Bird commented that as it was impossible to obtain from the Sites which metrics to collect and we have now a proposal and we should try to collect as much as possible.

K.Bos added that all metrics are useful to understand an issue. The most important metrics are the first 10 in the list above.

 

G.Merino pointed out that, at the previous meeting, the Sites had asked which metrics were more important for the Experiments and only start from a reduced set.

A.Aimar replied that the conclusion of the discussion was that the Experiments asked that the CERN list is the starting set. And Sites should produce as much as possible of those. Selecting which metrics are more important would just have meant to spend one more week before starting.

 

-       IT-INFN-CNAF: Can provide all metrics except fragmentation.

-       NDGF: The person who is going to work on this is at HEPiX, but should not be a problem to produce it for next week.

-       NL-T1: They will be able to generate all values except fragmentation.

-       TW-ASGC: They will be able to produce most of the values.

-       UK-T1-RAL: Will generate the metrics; but in this period they cannot provide them.

 

I.Bird asked why Sites using CASTOR cannot share their scripts collecting metrics. This should be verified outside the meeting.

 

-       US-FNAL-CMS: They will provide the metrics needed.

-       US-T1-BNL: They will provide it except the tape mounts.

 

A.Aimar also provided the instructions and the explanation of the XML files that each Site should generate. See Instructions.pdf

 

F.Hernandez asked that for each metrics there should be the unit and the time period on which has been collected.

 

New Action:

Each Site should send the URL of the XML File with the Tape Metrics to A.Aimar.

 

5.   Summary of the SL5 Discussions at the GDB (Slides) – J.Gordon

 

J.Gordon summarized the SL5 discussions that took place at the GDB and where there was not a complete agreement across Sites and Experiments.

 

At the May GDB, the issues around a migration to SL5 on Worker Nodes at all Sites were discussed.

5.1      Experiments and SL5

From a GDB’s summary by P.Mato. The main highlights are:

-       ALICE, CMS, and LHCb expect to run SL4 binaries in 32 bit-mode on SL5/64bit systems smoothly.

-       Alice and CMS are ready to move now,

-       LHCb have some work to do, i.e. sorting out the libraries required but this shouldn't take long.

-       ATLAS requires disabling of allow_exec heap in SELinux in order to run in compatibility mode until September.  They proposed using SL4 for the duration of the data run.

-       Experiments would like the changeover to happen in as short a time as possible, once the move is agreed.

5.2      Sites and SL5

Sites need a clear definition of what base OS install is required. Including the SL4 compatibility libraries that are needed.

Any packages required which are not present in the operating system will need to be identified and supplied through an additional repository.

 

They will also need a location for the approved version of the compiler and associated toolsets. Gcc4.3 is now in SL5 but there are questions over compatibility with other distributions of gcc4.3.

 

Sites were asked for feedback on whether they are prepared to install SL5 with the SELinux feature disabled (as above) for ATLAS.

 

Sites were requested to make SL5 WNs available behind a separate CE so that VOs could consciously choose to use them. This will also enable the Sites to move WNs easily between SL4 and SL5 to meet the changing requirements of the Experiments.

5.3      Memory Footprint

ATLAS showed evidence that the memory footprint of a job was much bigger in 64bit than 32. Each process has a VMEM footprint of 50MB (5MB) in 64(32) bit. So a ‘Hello World’ job is now 450MB vs. 60MB in 32 bits.

 

This has particular impact at Sites which limit the virtual memory (VMEM) of jobs. Such sites should be identified and approached.

Killing jobs based on RSS is a problem too as the kernel strives to keep pages in memory. Does anyone know how to change this?

 

M.Schulz added that there are Sites that are investigating how to avoid such increase of the memory footprint in 64 bits.

5.4      Timescale

One now needs to agree on a timescale. It was not felt feasible to expect everyone to change before STEP09 because the SL5 preparation is starting now.

 

The September 2009 timeline of ATLAS was felt to be too late. J.Gordon proposed July as a possible transition time. This timescale should be confirmed by the MB.

 

I.Fisk noted that FNAL moved to SL5 two weeks ago, after consulting with CMS, all their WNs. They run in SL4-compatible mode.

J.Gordon noted that ATLAS instead has some problems at the test Site that was set up in Oxford.

 

Who will define and publicise the SL5 plans? Which base install is going to be taken and which gcc 4.3 compiler?

O.Keeble proposed at the GDB that SA3 makes a meta package but the Architect Forum must define exactly its contents, including the exact compiler.

 

M.Schulz noted that the gcc 4.3 coming with SL5 (i.e. Red Hat) is not working properly and the correct gcc 4.3, used by the Experiments, is instead coming from GNU. Several options need to be evaluated before launching the SL5 migration. In addition some Sites are buying hardware that cannot be running SL4 and installed software that is not ported yet (e.g. the Site BDII).This should be avoided by the Sites. They must remember that SL5 is still an unsupported platform for several WLCG services.

 

Ph.Charpentier added that many user desktops are used for compiling and therefore they should have the right compiler on their desktop. A complete solution must be provided.

 

I.Bird stated that we should be limiting the current discussions to the SL5 for the WNs, not for the Services.

 

D.Barberis added that main capacity at the Sites must stay on SL4 until all problems are solved on SL5. Different queues should clearly allow choosing the right OS by the Experiments. ATLAS really needs the SELinux extensions disabled.

 

I.Fisk noted that is Sites are sending also the compiler this is not an issue anymore. Unless there are other options that can cause problems (SELinux option and kernel parameters).

 

Ph.Charpentier noted that Experiment could bring their gcc compiler if the default is not adequate.

 

I.Bird suggested that the AF comes with a full proposal on what compiler to use.

J.Gordon added that he will ask that a clear proposal from the AF is presented at next GDB.

 

6.   GDB Topics: gLExec, data rates (Slides) – J.Gordon

 

Only the gLExec issue was discussed at the MB.

 

Now that SCAS has been deployed at a few sites, the testing of the combination gLExec/SCAS has moved forward. It has revealed:

 

Shared File System Issue

GLExec is required to be installed on every Worker Node. Many sites make middleware available by a shared file system. They are reluctant to make an exception for gLExec. It seems there is a solution for this problem.

 

Environment and Security Exposure:

The process created under the new identity for a new payload does not inherit the environment. While there are security exposures in inheriting too much, too little makes the user jump through hoops to recreate both the site and experiment parts of the environment.

 

Ph.Charpentier asked what the security exposure is in this case.

J.Gordon answered that when a job runs it uses values inherited from previous users. And a “setuid” will not inherit the path for instance. Each job will have to set up completely the environment.

 

F.Hernandez added that Sites should be able to put whatever they want in the jobs. This should not be discarded. GLExec should not do that.

 

Ph.Charpentier proposed that the usage and usefulness of gLExec is clarified outside the meeting.

 

New Action:

M.Schulz will come back to the MB with an explanation of the security issues and next steps with gLExec.

 

 

7.   Update on the WLCG Technical Forum (Slides) – I.Bird

 

MB Members should send feedback to I.Bird about the proposal described in the slides and about membership.

 

New Action:

Experiments and Sites should comment on the TF proposal and express suggestions about membership to the Forum.

 

8.   AOB

 

 

No AOB

 

9.    Summary of New Actions

 

 

New Action:

9 Jun 2009 - Sites should report to the MB whether now, after the GDB presentations, the situation of the data rates is clear.

 

New Action:
Experiments should send to J.Gordon the DNs of the people that can read the details of the users in the CESGA Portal.

New Action:
Status of HEPSPEC benchmarking at each Tier-1 Site.

New Action:
A.Aimar schedules a presentation at the F2F in June. Experiments will explain their VO SAM tests and the tests that are reported in their dashboards.

 

New Action:

Each Site should send the URL of the XML File with the Tape Metrics to A.Aimar.

 

New Action:

M.Schulz will come back to the MB with an explanation of the security issues and next steps with gLExec.

 

New Action:

Experiments and Sites should comment on the TF proposal and express suggestions about membership to the Forum.