WLCG Management Board

Date/Time

Tuesday 21 April 2009 –MB Meeting - 16:00-17:00

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=55735

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 26.4.2009)

Participants

A.Aimar (notes), O.Barring, I.Bird (chair), K.Bos, D.Britton, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, Qin Gang, J.Gordon, A.Heiss, F.Hernandez, M.Lamanna, G.Merino, A.Pace, R.Pordes, H.Renshall, M.Schulz, Y.Schutz, R.Tafirout, J.Templon

Invited

R.Wartel

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 28 April 2009 16:00-17:00 – Phone Meeting

1.   Minutes and Matters Arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

 

2.   Action List Review (List of actions) 

 

  • VOBoxes SLAs:
    • CMS: Several SLAs still to approve (ASGC, IN2P3, CERN and PIC).
    • ALICE: Still to approve the SLA with NDGF. Comments exchanged with NL-T1.
  • 16 Dec 2008 - Sites requested clarification on the data flows and rates from the Experiments. The best is to have information in the form provided by the Data flows from the Experiments. Dataflow from LHCb

The Experiments agreed to present their dataflow and rates at May’s GDB.

  • M.Schulz will summarize the situation of the User Analysis WG in an email to the WLCG MB.

Not done.

  • .Bird will look for the chairperson and also distribute a proposal for the mandate of the WLCG Technical Forum and reminding the GDB’s mandate.

Will be for next week.

  • 14 Apr 2009 - CNAF reports on how they plan to handle the security incidents report and periodic tests

L.Dell’Agnello summarized the phone call with R.Wartel discussing CNAF failing the recent test alarms. The decision is that CNAF’s local operators and grid managers are going to be trained in order to cover future grid incidents. A written procedure is going to be made available to the people involved and CNAF will internally simulate the test alarms.

I.Bird asked whether the Security team intends to re-run the challenges.
R.Wartel replied that a new campaign will be launched later in 2009, but not in the coming months. He added that in particular CNAF needs clearer contact points for security issues and internal grid security procedures.

L.Dell’Agnello stated that in 2 weeks a document will explain the changes that CNAF is going to implement. In addition one new security expert will also be involved in grid security issues.
I.Bird proposed that an action will be put at the MB level.

New Action
5 May 2009 – CNAF completes, and sends to R.Wartel, their plans to solve the issues concerning security procedures, alarm contact point, etc.

On User Accounting:

  • Tier-1 Site should start publishing the UserDN information.

I.Bird asked that the APEL portal should have this information available.
J.Gordon replied that it should already be possible to see who is publishing the information.

  • Countries should comment on the policy document on user information accounting
  • The CESGA production Portal should be verified. J.Gordon will check and send again the information to the MB on which portal to use (prod or pre-prod). \
  • J.Gordon and R.Pordes will write a requirements document on user information accounting.

Distributed to the MB list.

  • Actions for moving to the new CPU unit
    • Convert the current requirements to the new unit.

Done by S.Foffano

    • Sites Tier-1 Sites and main Tier-2 Sites buy the license for the benchmark and benchmark their equipment with the new unit.

 

    • A web site at CERN should be set up to store the CPU values from WLCG Sites.

To be done.

    • A group to prepare the plan of the migration regarding the CPU power published by sites through the Information.

 

    • Pledges and Requirements need to be updated.

 

1.   LCG Operations Weekly Report (Slides) – O.Barring

 

Summary of status and progress of the LCG Operations since last MB meeting. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

1.1      GGUS Tickets

The GGUS ticket rate is back to normal. A few unscheduled interventions occurred during this period, but no serious events that triggered Service Incident Reports.

 

Once again sites (e.g. IN2P3 did) stress that experiments should submit their requests only via GGUS tickets.

 

VO concerned

USER

TEAM

ALARM

TOTAL

ALICE

1

0

0

1

ATLAS

20

25

1

46

CMS

3

0

0

3

LHCb

8

3

1

12

Totals

32

28

2

62

 

Two alarms tickets open this week:

-       ATLAS alarm to FZK: (2009-04-16 18:07) Disk buffer in front of ATLASMCTAPE in FZK is full. Could not catch up with the incoming rate. Details in slides 4 and 5.
It was reported as a problem of T1 to T2 transfer and caused confusion because it should not merit an alarm ticket for FZK. But it stopped the ATLAS data production jobs for ATLAS and therefore was considered an alarm by ATLAS.

 

A.Heiss asked whether that ATLAS problem should have been an alarm or not. Maybe the WLCG could discuss it at the GDB. A T2-T1 transfer, as initially thought, should not have raised an alarm.

 

K.Bos noted that is the Experiment that knows whether an issue is worth an alarm or not. Even a data transfer or MC production can be crucial and the Experiment should judge the merit of the incident. A general clarification at the GDB would be useful.

 

-       LHCb alarm to FZK: (2009-04-11 18:36) All LHCb jobs to ce-2-fzk.gridka.de failing. Details in slide 6.
Apparently a shared library was not found and all jobs were failing. LHCb almost banned the Site at some point.
The solution was simply to add the missing library at the required location.

1.2      Sites Availability

Slide 7 shows the Tier-1 SAM availability from the Experiments’ perspective.

One can see that LHCb (bottom right matrix in the slides, and here below) had problems with a few Sites.

 

LHCb reported the problems during the Operations meetings:

-       WMS submission is failures were traced to problems with short CRL for certificates created by the CERN CA (thanks to M.Jouvin, GRIF). Fixed

-       CNAF BDII publishing wrong information making match-making impossible. Fixed

-       CERN CVS system failing: Reason unclear. Fixed?

-       The job failures yesterday at NIKHEF and IN2P3 are now explained by the pre-pended "root:" string to the returned tURL.

Ph.Charpentier noted that this issue is unclear and it not due to this reason.

-       The problem of jobs crashing accessing the LFC@CERN is still under investigation but seems to be that the thread pool in LFC becomes exhausted due the way CORAL is accessing it. Understood?

Ph.Charpentier noted that it seems to be due to the “suboptimal” access to LFC by CORAL.

1.3      Sites Issues and News

Some confusion about the effect of using ‘At Risk’ for transparent interventions at the Sites. It is NOT counted as site downtime, Sites should be aware of this.

 

BNL: degraded efficiency due to large number tape staging requests from the ATLAS production tasks (pile up and HITS merging) and this caused a high load on the dCache/pnfs server resulting in an unacceptably high failure rate for DDM transfers.

 

CERN: Good news on Castor Oracle BIGID problem. From https://savannah.cern.ch/support/?106879

 

After joint work with Sebastien and excellent feedback from some people from Oracle including Oracle development, it looks now clear that the problem is linked with the usage of "DML returning" statements accessed from OCCI. Basically it works for single row but with different types and combination of single row / multiple rows, it can “not work” and lead to issues like the Big Id issue.

Oracle has opened a documentation bug (public and accessible with Metalink account) about the issue: “OCCI does not support 'retuning into'…” 

 

But for the time being a workaround must be found by the CASTOR team.

 

J.Gordon asked that in SAM matrices in Slide 7 GridView should specify which is the VO represented.

 

Upon request from K.Bos, Qin Gang reported some news about ASGC:

-       There are now 2 tape drivers LT03 online.

-       The schedule of the installation of the other six LT04 tape drivers will be late of two weeks.

 

2.    ALICE 2009Q1 QR Report (Slides) – Y.Schutz

 

Y.Schutz presented the 2009Q1 quarterly report for ALICE.

2.1      Data Taking and Processing

Because of planned interventions on the experiment, such as cabling modification and installation of additional detectors, ALICE data taking stopped in October 2008.

 

There will be short periods of data taking:

-       cosmic rays only with a subset of detectors is starting in April, not generating a large amount of data;

-       cosmics with the complete detector will resume in June 2009

 

Full data processing (cosmic) chain  will be reactivated in June 2009 and will include:

-       On line calibration and QA

-       Collection of calibration data in OCDB

-       On demand, not automatic, replication to Tier-1 Sites

-       Data reconstruction at Tier-0, on demand re-reconstruction at Tier-1 Sites

-       Data analysis on Grid and on the existing analysis facilities at CAF and GSIAF. And possibly also LAF, in Lyon.

 

ALICE also continues to test data transfers with periodic tests of FTS/FTD from Tier-0 to Tier-1 Sites.

2.2      Monte Carlo Data

Continuous production is in progress and is prioritized according to the LHC schedule:

-       Small first physics productions

-       pp minimum bias (large 100 Mio events)

-       pp various physics signals (20 different cycles)

-       AA heavy ion data with lowest priority

 

End user analysis is ongoing in ALICE and it is increasing the number of users. Presently there are some 120 regular ALICE users doing analysis.

2.3      Software: AliRoot

The software for all detectors is ready and includes geometry, particle transport, raw data format, calibration, QA, reconstruction, etc.  The new improved TPC tracking (HLT) and TRD reconstruction has been added.

 

Pileup simulation and vertex reconstruction is implemented and being tested. Improvements and new implementation of trigger simulation is always on the way.

 

There now are regular weekly reviews on new MC productions in order to review the usage of resources.

2.4      Analysis Facilities

In the ALICE Analysis software the framework has been finalized, with only minor fixes. The Physics Working group continuously develops and consolidates analysis algorithms. And the ALICE Analysis Train is now part of the Grid processing.

 

There are now several facilities being used by ALICE:

 

CAF and GSIAF

-       Regular ROOT/AliRoot and core PROOF updates

-       In production, high availability and demand service, all new MC productions validated on CAF

 

LAF

-       Lyon Analysis Facility under development

-       A PROOF-enabled farm is in preparation

2.5      ALICE Services

The stable version of AliEn is in operation.

 

Job submission is done through WMS. Submission code updates and WMS patches are increasing the service stability. The new WMS gLite 3.2 instances added (4 at CERN, LAL, IPN0)

 

The CREAM CE deployment is ongoing; the most recent sites are CERN, Legnaro, IHEP and Nantes. It will likely meet the deadline (end June) for parallel deployment on all ALICE sites.

 

SLC5 is not in the plan, but highly desirable by ALICE. All ALICE software is ported to SL5.

 

Participation in STEP09 is being planned among the activities shown in the time line below.

 

 

Deployment of storage continues; individual site validation model is working very well.

2.6      Hardware

ALICE would like to have a large CASTOR2 disk pool for RAW data registration. The work has started between IT/DM, IT/FIO and ALICE on setting up and testing of a CASTOR2 instance with xrootd access for RAW data- the size of the pool will be O (1.5PB).

 

It would use CASTOR v.2.1.8 with latest xrootd development. The current progress: a test pool to tune:

-       writing from DAQ P2 buffer (via xrootd),

-       Grid jobs reconstruction and

-       copy from disk to tape

-       and, later, to add T0 to T1 transfers via FTD/FTS

2.7      Milestones

The present and upcoming milestones for ALICE are:

-       MS-129 Mar 09: Analysis train operational. Done, adding and testing new wagons (algorithm by the Physicists) as they become available.

-       MS-130 Jun 09: CREAM CE deployed at all ALICE sites

-       MS-131 26 Jun 09: AliRoot release ready for data taking

 

I.Bird asked whether ALICE received any feedback, regarding their new requirements, from the Resources Scrutiny Group.

Y.Schutz replied that he had just replied to the questions received from the RSG group.

H.Renshall added that the RSG will meet on the following day.

 

3.   STEP09 Metrics – A.Aimar

 

At the WLCG Workshop was said that there should be clear metrics to measure the STEP09’s achievements (tape rates, etc).

Collecting common metrics, attempted in the past, has failed because Sites have different configurations and can collect only different measurements.

 

Should we just have a list of URLs where Sites publish the metrics they can collect?

 

H.Renshall added that ATLAS is defining the rates they expect from the different Sites.

 

I.Bird stated that there should be a way to see what the rates reached by the Sites are.

 

F.Hernandez noted that at IN2P3 these MSS metrics are only visible in the intranet, not publicly.

 

New Action:

Next week there will be a round table and Sites should tell how they are going to show which rates they reach in order to prove that they are fulfilling the required tape rates in STEP09.

 

F.Hernandez added that the Sites have problems to know what kind data is written by the Experiments. The Sites can only have aggregated data, maybe by Experiment. But not distinguishing the kind of data.

 

K.Bos added that during the ATLAS challenge there were other ATLAS tape activities by another community and ATLAS could not distinguish the tape activity within the Experiment itself.

It is a common problem that Experiments have to solve.

 

4.   Update High Level Milestones (HLM_20090406.pdf) – A.Aimar

 

The MB reviewed and commented the future milestones in the HLM table attached.

4.1      Sites SLAs

ALICE and CMS still have to approve the SLAs at some Tier-1 Sites.

 

WLCG-07-05b

Jul 2007

VOBoxes Support Accepted by the Experiments
VOBoxes support  level agreed by the experiments

ALICE

n/a

 

 

 

 

 

n/a

 

 

n/a

n/a

n/a

ATLAS

 

 

 

 

 

n/a

n/a

 

 

 

 

n/a

CMS

 

 

 

 

 

n/a

 

 

n/a

n/a

n/a

 

LHCb

n/a

 

 

 

 

n/a

 

 

 

n/a

n/a

n/a

4.2      Pilot Jobs Frameworks

The Experiments’ frameworks are followed at by M.Litmaath and reports monthly at the GDB.

A.Aimar will update the status below when there are news.

 

WLCG-08-14

May 2008

Pilot Jobs Frameworks studied and accepted by the Review working group
Working group proposal complete and accepted by the Experiments.

ALICE

ATLAS

CMS

LHCb
Nov 2007

 

4.3      Tier-2 and VO Sites Reliability Reports

A.Aimar will check the percentage of Tier-2 Sites above 95%.

 

WLCG-08-09

Jun
2008

Weighted Average Reliability of the Tier-2 Federation above 95% for 80% of Sites
Weighted according to the sites CPU resources

See separated table of Tier-2 Federations
80% of the Sites above 95% reliability

 

From April 2009 we will start reporting on the VO SAM Sites Reliability.

 

WLCG-08-11

Apr
2009

VO-Specific Tier-1 Sites Reliability
Considering each Tier-0 and Tier-1 site
(and by VO?)

Apr 2009

 

 

 

 

 

 

 

 

 

 

 

 

May 2009

 

 

 

 

 

 

 

 

 

 

 

 

Jun 2009

 

 

 

 

 

 

 

 

 

 

 

 

 

4.4      SL5 Deployment

 

WLCG-09-22

Jul 2009

SL5 Deployed by the Sites (64 bits nodes)
Assuming the tests by the Experiments were successful. Otherwise a real gcc 4.3 porting of the WN software is needed.

 

 

 

 

 

 

 

 

 

 

 

 

 

4.5      Tier-1 Sites Procurement

Will be updated after the RRB discussions.

                                                         

WLCG-09-01

Sept 2009

MoU 2009 Pledges Installed
To fulfill the agreement that all sites procure they  MoU pledged by April of every year

 

 

 

 

 

 

 

 

 

 

 

 

 

4.6      SCAS/glExec Deployment

The SCAS/glExec was certified weeks ago and is available for deployment.

 

M.Schulz noted that Sites should start installing it. Just two or three Sites are not sufficient. Volunteer Sites for stress testing by the Experiments are needed. At Tier-1 or large Tier-2.

 

Ph.Charpentier added that LHCb had started testing at FZK and NL-T1.

 

F.Hernandez reminded that, as reported at the GDB by A.Retico, some Sites (e.g. IN2P3) need to adapt the current solution to their setup. GlExec cannot be installed on a shared installation of the WNs.

 

I.Bird noted that the deployment at all Sites (WLCG 09-19) should maybe be moved forward. After the next GDB the issue will be clearer.  And then the MB agrees on the milestone date.

 

J.Gordon concluded that he will ask A.Retico to proceed with more installations so that the issues at each Tier-1 Sites are collected.

 

WLCG-09-17

Jan 2009

SCAS Solutions Available for Deployment
Certification successful and SCAS packaged for deployment

Done in March 2009

WLCG-09-18

Apr 2009

SCAS Verified by the Experiments
Experiment verify that the SCAS implementation is working (available at CNAF and NL-T1)

ALICE
n/a

ATLAS

CMS
n/a ?

LHCb

WLCG-09-19

09-18 + 1 Month

SCAS + glExec Deployed and Configured at the Tier-1 Sites
SCAS and glExec ready for the Experiments.

 

 

 

 

 

 

 

 

 

 

 

 

4.7      Accounting Milestones

J.Gordon reported that WLCG-09-02 is done.

 

J.Gordon reported that Sites are publishing the information now (WLCG-09-03). Some still have “0” physical CPU we are in a transition period.

 

WLCG-09-02

Apr 2009

Wall-Clock Time Included in the Tier-2 Accounting Reports
The APEL Report should include CPU and wall-clock accounting

APEL

WLCG-09-03

TBD

Tier-2 Sites Report Installed Capacity in the Info System
Both CPU and Disk Capacity is reported in the agreed GLUE 1.3 format.   

% of T2 Sites Reporting

WLCG-09-04

TBD

User Level Accounting
(verify with the Experiments)

 

4.8      STEP09 Tier-1 Validation

Each Experiment should define its Sites’ validation criteria and the status should be reported and tracked in the HLM.  

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

DE-KIT

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

WLCG-09-23

Jun 2009

Tier-1 Validation by the Experiments

ALICE

n/a

 

 

 

 

 

n/a

 

 

n/a

n/a

n/a

ATLAS

 

 

 

 

 

n/a

n/a

 

 

 

 

n/a

CMS

 

 

 

 

 

n/a

 

 

n/a

n/a

n/a

 

LHCb

n/a

 

 

 

 

n/a

 

 

 

n/a

n/a

n/a

4.9      CREAM CE Rollout

Milestones proposed by M.Schulz. WLCG-09-25 is done.

Some Sites have already installed the CREAM CE (WLCG-09-26)

 

WLCG-09-25

Apr 2009

Release of CREAM CE for deployment

 

WLCG-09-26

May 2009

All European T1 + TRIUMF and CERN at least 1 CE.  5 T2s supporting 1 CE

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-09-27

Jul 2009

2 T2s for each experiment provide 1 CREAM-CE each.

ALICE

ATLAS

CMS

LHCb

WLCG-09-28

Sep 2009

50 sites in addition to the ones above

 

4.10    SRM Milestones

I.Bird noted that these milestones should be on hold until the SRM developers meet. At last GDB was presented what is missing for each of the implementations.

 

J.Gordon added that the Experiment’s feedback is needed and was asked at the GDB.

 

WLCG-09-05

Dec 2008

SRM Short-Term Solutions Available for Deployment

CASTOR

dCache

DPM

StoRM

BestMan

WLCG-09-06

TBD

SRM Short-Term Solutions Deployed at Tier-1 Sites Installation at the Tier-1 Sites

 

 

 

 

 

 

 

 

 

 

 

 

4.11    FTS Milestone

All Sites are running FTS on SL4 by now.

 

WLCG-09-07

Mar 2009

FTS Deployed on SL4 at the Tier-1 Sites
FTS is ready to be installed on SL4 at the Tier-1 Sites

 

 

 

 

 

 

 

 

 

 

 

 

 

4.12    Metrics and Monitoring Milestones

As discussed earlier, the Sites should send to A.Aimar how they will make MSS available the Sites metrics.

 

WLCG-09-09

TDB

Tier-1 Sites Define Their MSS Metrics
Tier-1 Sites specify which metrics are going to be collected to demonstrate the dataflow supported

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-09-10

TDB

Tier-1 Sites Show Their MSS Metrics
Tier-1 Sites specify where their MSS metrics are available

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Here is the new table, updated after the discussion. HLM_20090427.pdf

 

 

5.   AOB

 

 

No AOB.

 

6.   Summary of New Actions

 

 

New Action
5 May 2009 – CNAF completes, and sends to R.Wartel, their plans to solve the issues concerning security procedures, alarm contact point, etc