Notes of GDB meeting, July 11th 2012 DRAFT

PAGE UNDER CONSTRUCTION - NO PEEPING!

---++ Agenda


* http://indico.cern.ch/conferenceDisplay.py?confId=155070

Welcome, Michel

* 2nd Wednesday each month to mid-2013 (no exceptions)

* Pre-GDBs one every month in next months. Will be confirmed with meeting notice 3+ weeks in advance

* August meeting is cancelled

* October meeting (and pre-GDB) is planned in Annecy

* September meeting (pre gdb=data/storage protocols)

* IPV6

* LS1 plans

* WG presentations

* HW trends (bernd)

* Other issues to follow up See actions in progress in twiki area

* Data storage working groups

* EMI-2 WN testing to be re-discussed in September

* Extended run confirmed – still wait for details. Consequences on pledges unclear but probably to late for any significant adjustments

CVMFS deployment status (Ian Collier (presenting)/Stefan Roiser)

- ATLAS – 78/104 WLCG/OSG sites have deployed. Plan to have all sites migrated by 12Q4 – WN disk space main obstacle.

- LHCB 36/86. Officially request all sites supporting VO to deploy CVMFS . Tier-2 sites with CVMFS will be preferred for higher workload

- CMS (5 T1s migrated) encourages sites to move to CVMFS

Work on shared cache client (Jacob?) ongoing. Slow ramp up planned – end August expect new features.

NFS client (Jacob?). Handful of sites testing. Still problems. Would like to have in end of August beta release.

DE mainly constrained by small WN disks. One site has problems when cache 50% full of pin files. Reported as a bug. Not easily fixed in current branch. New branch will handle differently and reduce catalogue space requirements. Typical experience is that a 10GB ATLAS cache contains 3GB pin files – need to monitor as when it hits 5GB … problems.

Do experiments have to push sites or will WLCG do it? (Ian) Operations Team (see Maria Girone’s talk later) will follow up these kind of issues.

Request sites volunteer for testing new version as progress limited without test sites. New branch has significant features that may resolve outstanding issues. Alessandra highlights need for simple instructions. UK has an example.

Summary of Pre-GDB CE Extensions (David Salomoni)

See the Agenda.

Goals:

- Review extensions – particularly whole node/multi-core (n-m) + memory requirements.

- Agree development plan and timeline

Implies changes in CE AND experiment frameworks.. Focus on Glue 2 solution (not 1.3). Agreed best to publish features at Q level rather than site level. CE developers willing to develop this type of extension.

- ARC CE already supports whole node/ fixed or multi-core through RTE scripts.

- CREAM – supports whole node/fixed multi core. Need to know how to specify requests in JDL etc etc.

- OSG – in middle of decision making process for long term CE plans. Has been able to get whole machine and n-core jobs already on all OSG LRMS. Do we need variable number of cores

Environment variables needed to tell job what resources it may actually use. Details of the spec for environment variables available on wiki page. LOOKING FOR TEST LSF SITE . NIKHEF working on Torque tests. Question why script here and files elsewhere? See Tony’s presentation in last GDB and Ulrich’s WM TEG wiki page. Any comments on implementation to WM TEG mailing list please.

Main issue how to let CE know what site supports. See slides for enhancement steps needed in order to get whole chain working. CMS prepared to test whole chain. LHCB willing to test workload management. Need limited number of volunteer sites. Start soon after summer with larger tests.

ATLAS have no requirement for (n-m) feature. Believe it is extra complexity not needed at the moment. Should this be implemented on level of CE rather than PANDA? WM TEG believes it should be.

Jeff – Vos need to realize that they must relate pilot factory to payload. Site will not be happy if they have to reserve cores that pilot factory then does nothing. Some VOs do this, but others leaving pilots hanging around doing nothing.

More at September GDB

Post EMI Discussion (Michel)

Initial meeting with EMI/EGI and OSG. Jamie concerned membership not exactly correct. Identify issues relating to WLCG and end of EMI and supporting projects (not attempt to solve them). Foster operational relationship between European and North American part of WLCG. Mainly focused on the middleware – EGI not much discussed.

Issues:

- Globus – need to address support and packaging. Support can only come from the community. Packaging from IGE on best efforts for medium term. Need to identify who are the support community.

- EMI MW. Most WLCG components have a long term support plan. HUC products not part of EMI list – discussion in progress.

- OSG discussion in progress with external providers.

- MW validation/provisioninh – reference repositories? Baseline definition etc etc. Markus will write initial document on MW. Identify areas where OSG and Europe could collaborate).

Jamie concerned about loss of HUC support (25 FTE has been expended in this area). HUC has made a significant contribution to experiments – big impact when it goes. HUC Wasn’t discussed at the Post EMI discussion. Needs urgently a discussion with all four experiments to agree what is needed. Jamie will organize meeting by September – mainly an internal CERN discussion.

IS Evolution Follow up Maria Alandes Pradillo

- How to identify best top level BDIIs. Proposal.

Exploratory method – via SAM system. Has queried 526 CEs/WNs. Most WNs use only 1 BDII. Only 11% contain at least 3 BDIIs. Will check most heavily used BDIIs and for quality of service and configuration. Will contact sites concerned. Pragmatic solution. Pointed out it was important to look in detail – eg check for Aliases etc etc. General consensus was that this was a good idea.

Need to build configuration information around baseline services page – eg optimal configurations and practice.

GL Exec Deployment (Maartin)

Number of CEs increased by a few. 83 85. Some sites disappeared/appeared.

GLexec wiki page recently updated.

OPS tests for T1s OK (except BNL) . LHCB and CMS OK. ATLAS ok but BNL not set up and IN2P3 has errors.

Very slow progress. Plan to do another broadcast after the holidays to stimulate more progress. CMS raising tickets against sites. Hope to see ramp up over coming months. Some sites respond to CMS – some not. Eventually will become a critical test for CMS.

Federated Storage (Fabrizio Furano)

Federated storage one of the main recommendations of the TEG.

Need to make concept better known. Understand issues, reduce differences amongst VOs.

WG should composed of:

- Tech providers

- Experiments (1/2)

- Sites two (three site managers)

Has contacted various groups looking for volunteers. Need experience in topic. Hoping for a first meeting in time for Federated data stores workshop in Lyon.

Discussion how ARC system fits into this. Project ongoing at the moment for WAN access data access to ARC systems. Would like to integrate with this federation project.

DPM Community (Oliver Keeble)

DPM/LFC project heavily reliant on European funding – soon ending. Project has never been in better shape. Many new features now or coming soon. Propose collaboration of stake holders. CERN does not run a DPM instance. 36PB (10 over 1PB instance). Over 200 sites in 50 regions. Over 300 VOs have access. France and UK biggest users by capacity. Also next 5.

LFC. ATLAS and LHCB plan to move away but no timescale. Also numerous non HEP VOs. Ongoing support still needed.

Appealing to these big user regions to come forward to discuss support model. Sites are invited to express their interest in a collaboration (with MoU). Timescale to identify interest by end of the year.

Plan B CERN will not abandon DPM at end of EMI. Plan B would be to manage critical bug fixes and security fixes but no significant development (unlike recent past).

SHA-2 and RFC Proxy Support (Maartin)

IGTF Plans. CAs will be free (or even recommended) to commence issuing certificates with SHA2 from 1st January 2013. Loooks impossible to prepare for SHA-2 in 5 months (partiulatrly with extended run). All EMI-2 products should support RFC proxies (little uptake, WMS not supported, dCache issues). Many weeks yet before affected products can be endorsed by UMD. Also uncertain of status of central services.

Planned phases and milestones indicate cannot deploy SHA-2 CAs until Jan 2014! Can we persuade IGTF to postpone for 1 year.

PLAN B

Create short lived SHA-1 certificate to any WLCG member who cannot get a SHA_1 certificate from their home CA. A significant effort and will need to start soon if we are to be ready by 1013. Some concern that central issueing of host certificates will be viable.

We need IGTF to reconsider timeline by September if we are to deploy a new CA by January. We will have major problems if a major CA moves to SHA-2 in January. Dave Kelsey will pursue this matter with IGTF.

Introduction to ARGUS and central banning (Valery Tschopp)

Presented an overview of ARGUS Authorisation engine, PAP language and PAP admin tool.

CERN are running central ARGUS service. Sites should check how they would link their local system to central banning service. How can OSG benefit from central banning list without running ARGUS? Could use PAP CLI or download raw XML and parse it to authorize individual user. However working on way of providing full list to OSG.

Operations Coordination Team (Maria Girone)

  • TEG recommends service coordination/commissioning team. Computing as a Service. Goal by the end of LS1. Team needs representation/knowledge of sites/regions. Persistant WG. Volunteers needed from sites (both Tier-1s and Tier-2s).
  • Will relate to existing structures such as daily ops. Short term task forces will be created to address specific deployment / de-commissioning issues. Members of team will be Service Coordinator On Duty – SCOD. Looking for team of 15-20 core members– in team, not a full time commitment but able to take on actions (SCOD role maybe 20% when on duty). Replace T1SCM with Fortnightly Operations Coordination.

Jobs with High Memory Requirements

  • CMS (Steve Gowdy) CMS Jobs kill themselves if they exceed 2.3GB real memory. Vast majority of jobs not >> 2GB.
  • ALICE (Maartin) discusses problems with allocate/deallocate memory in order to keep memory footprint down. Problems particulary compiled under SL5 – SL6 much better.
  • LHCB – highlights that different sites have different policies which makes situation difficult to manage. Explanation from IN2P3 explaining why constraints on VMEM per job required by sites.
  • ATLAS – physical memory requirements still 2GB / core but still need some jobs > 3.5GB VMEM
Some discussion needed at HEPiX level? Possible to consider setting up[groups] to limit resources per job] (need SL6). What about allocating 2GB per slot – jobs that need more RAM request more slots. Experiments concerned that the sites all follow different strategies. Sites see jobs that if not capped would cause disruption to service. VOs would like few or no limits. Conclusion look at PSS to see if it is a useful metric that can be used by batch system. Suggest batch system experts look at this. Maartin proposes we set up a TWIKI page to survey site memory management policies.
Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2012-07-16 - AndrewRSansum
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback