LCG Management Board

Date/Time:

Tuesday 6 March 2007 - 16:00 – 18:00 – F2F Meeting at CERN 

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=11628

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 9.3.2007)

Participants:

A.Aimar (notes), I.Bird, N.Brook, F.Carminati, T.Cass, Ph.Charpentier, L.Dell’Agnello, Di Quing, F.Donno, T.Doyle, C.Eck, M.Ernst, S.Foffano, J.Gordon, F.Hernandez, M.Kasemann, J.Knobloch, E.Laure, M.Lamanna, M.Litmaath, H.Marten, P.Mato, G.Merino, G.Poulard, L.Robertson (chair), J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 13 March 2007 - 16:00-17:00 – Phone Meeting

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of Previous Meeting

No comment received. Minutes approved.

1.2         Announcements

C.Eck introduced Sue Foffano to the MB. She joined IT/LCG on the 1st March 2007 and will become the new LCG Resource Coordinator at the end of 2007.

1.3         Documents Distributed

The documents below were distributed for information to the MB:

 

-          Experiments Targets for 2007 (UPDATED) (ATLAS CMS targets 2007).

Action:
9 Mar 2007 - ALICE and LHCb send their Targets for 2007 to A.Aimar.


-          Process for reporting resource requirements and installed (UPDATED)
(Process for Reporting Site Capacity and Usage Data_v6_05mar07).

-          Reliability and Availability Data Summary for February 2007
(Summary for February 2007).

Update: The Summary was redistributed on the 12 March and site reports expected for Friday 16 March 2007.

Action:

16 Mar 2007 - Tier-0 and Tier-1 sites should send to the MB List their Site Reliability Reports for February 2007.

1.4         Benchmarking

F.Hernandez raised the issue of benchmarking CPU capacity and what should be reported as CPU installed.

 

L.Robertson informed the MB that he had talked to H.Meinhard, Chair of the HEPIX working group on Benchmarking. That working group is just starting its activity, but H.Meinhard will present to the MB the CERN’s procedure for benchmarking new hardware.

 

J.Templon said that SARA is running the SPEC Benchmarks, executing one benchmark application per core. CERN and FZK confirmed that they also use that same procedure.

 

Concerning what units to report the MB confirmed that sites should use the SPECint results from the vendors and, when not available, use the SPEC benchmarks to assess new hardware.

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

  • 22 Feb 2007 - Experiments should verify and update the Megatable (the Tier-0-Tier-1 peak value in particular) and inform C.Eck of any change.

C.Eck reported that LHCb’s values are now correct. The values from CMS and ALICE still need to be updated.

 

  • 27 Feb 2007 - H.Renshall agreed to summarize the experiments; work for 2007 in an “Overview Table of the Experiments Activities.

Not done. H.Renshall presented the Experiments Plans at the GDB on the following day.

 

  • 6 Mar 2007 - L.Robertson will organize a meeting with someone from the accelerator to discuss the planning until 2020.

 

L.Robertson discussed the issue with J.Engelen. Beyond 2009 there are several possible open options about the evolution of the LHC.

 

J.Engelen will speak to the DG about the evolution of the CERN computing facilities. The issue could be postponed until it is clear how the LHC will be upgraded in the future. In the mean time one could base the requirements on evolutionary estimations, based on past experience.

 

M.Kasemann said that he agrees on the evolutionary approach. He also stated that the Experiments should explicitly support IT with any approach that will be selected for the long-term planning and for a new CC.

 

3.      Update and Discussion on the SRM 2.2 Progress (Slides) – F.Donno

 

F.Donno presented the progress of the SRM 2.2 implementations.

3.1         SRM 2.2 Implementations

Slide 2 reminds the kind of tests executed and that they are run as a cron jobs 5 times per day.

 

During February (slide 3) the testing is in Phase 2:

-          Perform tests on use-cases (GFAL/lcg-utils/FTS/experiment specific), boundary conditions and open issues in the spec that have been agreed on.

 

Phase 3 started since March (until “satisfaction”) consists of:

-          Add more SRM 2.2 endpoints (some Tier-1 sites?)

-          Stress testing of the installations

 

Slides 5 to 9 show the test results of each SRM implementation. One can see that the “Cross Copy and Interoperability” tests are those with more issues (esp. CASTOR).

 

Update: The plots in the attached slides now show the availability during last month and not since October 2006 (changed following the MB’s request).

 

There is not been much improvement, compared to January 2007:

-          DPM version 1.6.3 available in production. SRM 2.2 features still not officially certified. Implementation stable. Use case tests are OK. Copy not available but interoperability tests are OK. Few general issues to be solved.

-          DRM and StoRM: Copy in PULL mode not available in StoRM. Stable implementations. Some Use case tests still not passing and under investigation.

-          DCache: Stable implementation. Copy is available and working with all implementations excluding CASTOR and DRM. Working on some Use case tests.

-          CASTOR: The implementation is still rather unstable. A lot of progress during the last 3 weeks. Main instability causes found by David Smith (race conditions, unintended mixing of threads and forks, etc.). Various problems found and fixed by S. De Witt and G. Lo Presti. Several use cases are being resolved.

3.2         SRM 2.2 Interface and Clients

The SRM 2.2 specifications still have some unclear issues:

-          Order of some operations and concurrency.

-          Overwrite mode not supported by dCache at the moment.

-          Get agreement for clean up procedures in case of aborted operations.

-          Moving complete directories

 

FTS

-          SRM client code has been unit-tested and integrated into FTS

-          Tested against DPM, dCache and StoRM. For CASTOR and DRM the tests are just started.

-          Released on the development test bed.

-          Experiments could do tests on the dedicated UI that has been set up for this purpose.

-          New dCache endpoints are now setup at FNAL for stress test.

 

GFAL/lcg-utils

-          New RPM files available on test UI. No outstanding issues at the moment.

-          Still using old schema.

3.3         GLUE Schema

The new specifications of GLUE 1.3 are available (http://glueschema.forge.cnaf.infn.it/Spec/V13). Not everything originally proposed was accepted, but the important changes were accepted and implemented.

 

The LDAP implementation is available on the test UI. Static Information Providers available on test UI for CASTOR, dCache, DPM, and STORM. Now the clients need to adapt to new schema.

3.4         Update on the GSSD Working Group

The members of the group agreed to organize the work on specific issues.

 

SRM v1 to v2 plan

-          Some testing activity started with DPM and CASTOR. The sites that will participate in the testing activities and have committed to them are IN2P3, FZK, BNL, GRIF/LAL, UK Tier-2s, and RAL.

-          A draft report is being compiled.

 

Experiments input/Tier-1s input

-          LHCb input completed during last pre-GDB meeting

-          Phone conf with CMS representatives. Very good progress. Need to better refine the input received so far.

-          ATLAS will be next target.

-          Discussing with sites the implications of the requests received.

 

Monitoring utilities

-          A draft report has been compiled with the possibilities offered by DPM and dCache.

-          It will be circulated to the list and to the DPM/dCache developers for input and corrections.

-          INFN will proceed with a prototype monitoring tool that can be included in SAM/GridView. No further efforts will be requested to MSS/SRM developers.

 

L.Robertson asked which sites will test the 1.1 to 2.2 migration and perform the stress tests.

F.Donno replied that testing of dCache will start first because dCache is ready before the implementation. Candidate multi-VO sites are IN2P3, FZK and SARA. For these tests to be effective is important that the dCache team helps.

 

IN2P3 and FZK have the resources for such testing and agreed to install dCache 1.8, which includes SRM 2.2.

BNL asked to be included in the tests and also offered help for the dCache migration. DESY will also participate.

 

Ph.Charpentier raised the possible problems (change of PNFS, SA path, etc) that could appear during the migration of the experiments data. M.Litmaath replied that although sites have different organizations the migration should not be too difficult if it is organized by each VO talking directly with each site. The goal of the GSSD study group is also to identify and propose solution for the migration of the different SRM implementations.

 

4.      Criteria for Deploying the New WMS and New CE (Slides; document) – I.Bird

 

I.Bird presented the criteria for deploying the coming gLite components (WMS and CE) including the LCG requirements that each component must fulfil.

 

4.1         GLite WMS

The effort to put this component into production readiness started July 2006 and the performance was sufficient for the CMS CSA06 tests.

After that achievement many ongoing issues of reliability and manageability prevented from making this the production version and replacing the LCG-RB.

 

Some of these issues are still not resolved and WMS is now out of certification, back to the developers.

 

The CERN team cannot take the responsibility for driving the WMS improvements needed to achieve certification. INFN have agreed that this should be their responsibility and that the Deployment Team will define the criteria and the requirements for taking the WMS back into certification.

4.2         WMS Performance Requirements by the Experiments

The table below shows the performance expected by ATLAS and CMS in 2007 and 2008. (Note: values updated wrt the slides)

 

 

CMS

ATLAS

Performance

 

 

2007 Dress rehearsals

100K job/day in CSA07

60K successful jobs/day + analysis load

2008

200K jobs/day through WMS

 

<10 WMS

100K jobs/day100K

 

<10 WMS

Stability

 

 

 

Not specified

<1 restart of WMS or LB
every month (== LCG RB)

 

ALICE and LHCb have similar requirements.

4.3         Summary of LCG Requirements on WMS

Based on the table above the LCG Requirements are”

 

-          Performance:

2007 dress rehearsals: 100K successful jobs/day

2008: 200K successful jobs/day

using <10 WMS entry points

-          Stability:

<1 restart of WMS or LB every month under this load

4.4         GLite WMS Acceptance Criteria

A single WMS machine should demonstrate submission rates of at least 10K jobs/day sustained over 5 days, during which time the WMS services including the L&B should not need to be restarted.

-          This performance level should be reachable with both bulk and single job submission.

-          During this 5-days test the performance must not degrade significantly due to filling of internal queues, memory consumption, etc i.e. the submission rate on day 5 should be the same as that on day 1.

-          Proxy renewal must work at the 98% level: i.e. <2% of jobs should fail due to proxy renewal problems (the real failure rate should be less because jobs may be retried).

-          The number of stale jobs after 5 days must be <1%.

 

The L&B data and job states also must be verified:

-          After a reasonable time after submission has ended, there should be no jobs in "transient" or "cancelled" states

-          If jobs are very short no jobs should stay in "running" state for more than a few hours

-          After proxy expires all jobs must be in a final state (Done-Success or Aborted)

 

For verifying these criteria the test suite written by A.Sciaba and currently used by himself and S.Campana will be taken as the baseline.

 

J.Templon asked how the “deployability” and maintainability is checked, because performance and stability are not sufficient for the sites.

I.Bird replied that the installations and updates will be easily deployable and fully configurable, e.g. via YAIM or QUATTOR, without special ad-hoc modifications by the sites.

4.5         GLite Acceptance Criteria

Performance

2007 dress rehearsals

-          5000 simultaneous jobs per CE node.

-          50 user/role/submission node combinations (Condor_C instances) per CE node

 

End 2007

-          5000 simultaneous jobs per CE node (assuming same machine as 2007, but expect this to improve)

-          1 CE node should support an unlimited number of user/role/submission node combinations, from at least 10 VOs, up to the limit on the number of jobs. (might be achieved with 1 Condor_C per VO with user switching done by glexec in blah)

 

J.Templon asked to specify that the 10 VOs are simultaneously operating.

 

Ph.Charpentier asked why the limit of 50 user/role/submission combinations is introduced.

I.Bird replied that this a value proposed for the Dress Rehearsal, later for end 2007 that limitation is removed.

 

Reliability

-          Job failure rates due to CE in normal operation < 0.5%;

-          Job failures due to restart of CE services or CE reboot <0.5%.

 

2007 dress rehearsals

-          5 days unattended running with performance on day 5 equivalent to that on day 1

 

End 2007

-          1 month unattended running without performance degradation

 

4.6         Summary

WMS:

-          Propose the above as LCG requirements – clear statement from CMS, but not from yet ATLAS

-          Discussed with certification team, deployment testers, EIS testers, developers

 

CE:

-          Propose these requirements as LCG requirements – based on LCG-CE and deployment experience

-          Discussed with certification team, deployment testers, and developers

 

LFC:

-          A similar document will be proposed for LFC soon.

 

5.      Update on SL4 Migration of the Middleware – (Slides from GDB) M.Schulz

 

 

Build with ETICS - On the 16 February the first build without any error from ETICS. But the RPM files built are not installable not complete. Work is ongoing to resolve the problem of dependencies and build valid RPMs.

 

WN and UI Installation – The installation of WN and UI has been done in parallel (without ETICS) with the solution of the dependencies.

 

UI Testing - Since one week the UI is being tested. Still far from a deployable solution.

 

WN Testing - SAM runs successfully with a CE on SL3 and a WN on SL4 with the new versions of VDT, etc.

 

Estimates - Estimate is difficult but 2-3 days to solve the dependencies and the 2 more weeks to obtain deployable RPMs.

 

Convention to Identify a Node – A SAM test is defined to verify that the version of the OS is defined in a clear way (to distinguish nodes running SL3 vs SL4 and 32 vs 64 bits), following the conventions established.

 

Fallback Solution - The fallback solution is to have SL3 binaries running on SL4. There are tarballs ready for the CE and the WM on the PPS.

 

Ph.Charpentier asked that the SL4 client libraries (LFC, Gfal, etc) are passed to the Applications Area as soon as they are available. Experiments need to link and test their applications with the new version of the middleware components.

 

6.      Report on Job Priority Implementations (site by site) (Slides) – J.Templon

 

J.Templon summarized the status of the implementations of Job Priorities at the Tier-1 sites.

 

He distributed a questionnaire asking about:

-          Implementation of the mapping of VOMS credentials to scheduler credentials

-          Assignment of scheduler shares to these credentials

-          E.g. What is the production share for ATLAS?

-          Whether the sites attempt to publish data to the information system

 

In addition he checked whether the data from the site was correctly published in the information system.

No site, except SARA, does it.

 

 

-          No replies from FZK, NDGF and FNAL.

-          Eight sites claim to have VOMS mappings (ASGC answered not clearly)

-          Six sites have mappings to shares, one an “equivalent” scheme, two gave unclear answers (CNAF and ASGC)

-          Three sites claim to publish to IS, other six said no. Of the three, one does not publish and one is incorrectly published

 

IN2P3 does not publish because they are not sure whether it is compatible with the way they were doing it before.

They should move to the new system soon.

CNAF was not sure whether they publish correctly. They do not; the information is not available in the information system.

From ASGC’s mail is unclear what they are publishing.

 

J.Templon also noted that INFN Padua is publishing correctly and therefore CNAF could check with and learn from them.

 

I.Bird asked who will follow up the JP status and make sure that all sites get their JP setup correct and start publishing information.

J.Templon will ask again for a clear reply on the status. Nobody replied that the documentation was insufficient.

 

J.Gordon proposed to select a small number of sites and make those work.

 

H.Marten asked whether the proposed solution is compatible with the current GLUE schema.

J.Templon replied that the solution proposed is compatible with the current and future GLUE schemas.

 

M.Schulz added that they are waiting for an update of APEL for the accounting, in order to be able to understand and display accounting for groups and roles.

J.Gordon replied patch 1083, that includes the fix, should be in certification.

 

F.Hernandez explained that IN2P3 could publish the VOMS information with their current solution (they do not do it yet). But in order to use the dynamic scheduler they want to verify that the information is compatible with their current system.

J.Templon proposed to discuss solve the issue off-line.

 

J.Templon accepted to follow up the issue of Job Priorities until the sites implement it and publish the information correctly.

Sites should send him information and issues, he will report to the MB regularly

 

7.      AOB: Lessons for the naive Grid user (Slides) – T.Doyle

 

At the end of 2006 S.Lloyd and T.Doyle have developed a prototype for ATLAS analysis tests on the UK sites.

The presentation summarizes their feedback about the usability of the Grid services.

 

The aim was attempting to emulate the ATLAS ‘User experience’ of the (UK) Grid.

 

The method followed was to send one of each job per hour to each UK site:

  1. “Athena Hello World” – checks ATLAS code installed and available to WNs
  2. “New Athena Package” – checks that CMT, gmake etc work correctly
  3. “User analysis” – attempts to copy AOD data (Z->ee) from local SE to WN and calculate Z mass
  4. Coming soon – as test no 3 but POSIX IO from SE to WN

 

And then check the correct answer from job output – 10 (events) for tests 1 and 2 and 80-100 GeV for test 3

 

Slide 3 below shows the status of the sites.

 

Legenda:

A = Aborted by the RB

C = Current still running

X = Cancelled after 8h

F = No answer

S = Success

 

The 3 tests are executed on all UK grid sites:

 

The current status is 64% success rate.

 

These tests helped to:

-          have many problems identified and fixed at individual sites (GridPP DTeam)

-          have other ‘Generic’ system failures that need to be addressed before fit for widespread use by inexperienced users

7.1         Issues with the RB

The tests use RBs in the UK at RAL (2) and Imperial:

-          The RBs break about once a week and all jobs lost or in limbo and is never clear to user why

-          One could switch to a different RB but the users don’t know how to do this

-          Barely usable for bulk submission – too much latency

-          Can barely submit and query ~20 jobs in 20 mins before next submission

         Users will want to do more than this

-          Cancelling jobs doesn’t work properly – often fails and repeated attempts cause RB to fall over

         Users will not cancel jobs

-          They used the EDG RB which is deprecated but gLite RB isn’t currently deployed.

 

I.Bird noted that the failures at CERN are not with that frequency.

T.Doyle replied that it may be due to deployment problems in the UK and issues at the sites not due to the middleware component.

7.2         Issues with the Information System

-          Lcg-info is used to find out what version of ATLAS software is available before submitting a job to a site but it is too unreliable and previous answer needs to be kept track of.

-          Seems unreliable (e.g. ldap query typically gives quick, reliable answer but lcg-info doesn’t)

-          The lcg-info command is very slow (querying *.ac.uk or xxx.ac.uk) and often fails

-          Different BDIIs seem to give different results and it is not clear to users which one to use (if the default fails)

-          Many problems with UK SE's have made the creation of replicas painful - it is not helped by frequent BDII timeouts

-          The FDR freedom of choice tool causes some problems because sites fail SAM tests because the job queues are full

7.3         Issues with UI and Proxies

User Interface

-          Users need local UIs (where their files are)

-          These can be set up by local system managers but generally these are not Grid experts

-          The local UI setup controls what RB, BDII, LFC etc all the users using that UI get and these appear to be pretty random

         There needs to be clear guidance on which of these to use and how to change them if things go wrong

 

Proxy Certificates

-          These cause a lot of grief as the default 12 hours is not long enough

-          If the certificate expires it's not always clear from the error messages when running jobs fail

-          They can be created with longer lifetimes but this starts to violate security policies

         Users will violate these policies

-          Maybe MyProxy solves this but do users know?

 

I.Bird noted that most of the explanations are in the Users Guide and that problems are solved by the experiments applications.

T.Doyle recognized that the guide is much improved but the users should have some preconfigured environment to use.

 

J.Templon confirmed that MyProxy messages are not easy to understand.

7.4         Issues with GGUS

GGUS is used to report some of these problems but it is not very satisfactory. The initial response is usually quite quick saying it has been passed to X but then the response after that is very patchy  Usually there is some sort of acknowledgement but rarely a solution and often the ticket is never closed even if the problem was transitory and now irrelevant

 

There are two particular cases which GGUS does not handle well:

a)      Something breaks and probably just needs to be rebooted: the system is just too slow and it's better to email someone (if you know whom)

b)      Something breaks and is rebooted/repaired etc but the underlying cause is a bug in the middleware: this doesn't seem to be fed back to the developers

There are also of course some known problems that take long time to be fixed (e.g. the Globus port range bug, rfio libraries, etc).

 

More generally, the GGUS system is working at the ~tens of tickets/week level but may not scale as new users start using the system.

7.5         Conclusions

The Grid is a great success for Monte Carlo production. However it is not in a fit state for a basic user analysis

The tools are not suitable for bulk operations by normal users. Current users therefore set up ad-hoc scripts that can be mis-configured

 

‘System failures’ are too frequent (largely independent of the VO, probably location-independent)

The User experience is poor

-          Improved overall system stability is needed

-          Template UI configuration (being worked on) will help

-          Wider adoption of VO-specific user interfaces may help

-          Users need more (directed) guidance

 

There is not long to go

-          Usability task force required?

 

I.Bird replied that the usability is addressed inside the experiments configuring the applications for the “typical tasks” that the Physicist needs to perform in order to access and work with the Grid.

 

M.Lamanna added that many of the users should and will work at a higher level. The experiments have decided long ago to attack the problem with tools like Ganga that isolate the user from all details and difficulties. Usability should be considered in the frameworks of the experiments not on individuals using directly low-level services:

-          ATLAS and LHCb are extensively using Ganga and users are more and more on board.

-          CMS has developed CRAB has a similar experience. This approach is proving to be very successful.

 

G.Poulard confirmed that in ATLAS by using Ganga the user’s situation is much improved. And the approach to use directly the low-level services is discouraged.

 

I.Bird added that VO-specific testing should be increased in the SAM testing but it requires that experiments provide those tests.
This would verify automatically the availability of sites and their usability by the experiments.

 

8.      Summary of New Actions

 

 

 

 

 

 

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.