TWiki> LCG Web>WLCGGDBDocs>GDBMeetingNotes20140212 (revision 3)EditAttachPDF

Summary of GDB meeting, February 12, 2014 (CERN)

Agenda

https://indico.cern.ch/event/272618/

Welcome - M. Jouvin

See slides for future (pre-)GDB planning

  • Once per month except August and except July
  • CNAF Bologna will host the March GDB and a pre-GDB on batch systems (remote participation is possible)
  • Some topics for future meetings: cloud, data federations, volunteer computing, IPv6 management... please, do not hesitate to suggest other topics

Next WLCG workshop around the summer, dates to be decided, in Barcelona. The following one to be colocated with CHEP 2015 Michel proposes from 7th to 9th of July but some people will not be able to attend. P. Charpentier proposes to organise a Doodle

See actions in progress and Ops Coord in, some highlights:

  • More news on Site Nagios testing after July
  • Job handling with high memory requirements
  • GFAL/FTS3 client migration

See upcoming meetings and interesting workshops in slides (Big Data, Federation Workshop, Hepix, EGI Community Forum, ...)

HEP SW Collaboration - I.Bird

  • Performance is nowadays a limiting factor in experiment SW: more and better physics if optimal SW and more computing resources would be available
  • Performance can be gained by using to multi-core, multi-socket, multi-node (already using it); and also vectors, pipelining, parallelism (areas to be investigated now, but will require significant re-engineering, not only SW, but also frameworks and data structures)
  • Some libraries and toolkits, like Geant, will impact many more people if they can be improved, so these are the ones to be targeted first
  • Concurrency Forum started 2 years ago to discuss all these issues
  • HEP SW Collaboration to develop open scientific software packages (although aims at working with other sciences and industrial partners)
    • Common core frameworks and tools
    • Gives credit and recognition to collaborators
    • Defines roadmaps
    • Elaborates proposals for acquiring resources (like Horizon 2020)
  • Status: ongoing discussions with other HEP labs (all computing project leaders and all existing SW collaborations), upcoming meeting during Concurreny Forum Workshop in April, to be followed up by a meeting in US at the end of the year

P. Charpentier asks what happens to the LCG Applications Area, will it dissapear? I.Bird answers this hasn't been discussed. The scope of the initiative needs to be discussed first. Things like ROOT and GEANT will be discussed first, and then, if this is successful, maybe we donīt have to mantain the Applications Area. M.Jouvin encourages everybody to join this discussion since this has benefits for the whole HEP community, not only CERN. I.Bird adds that the format of the collaboration is indeed still an open issue to be discussed.

IPv6 Update - D.Kelsey

  • F2F meeting of Ipv6 workshop in January 2014
    • Openstack Havana release doesn't work very well with IPv6. CERN is doing many internal developments to workaround this. Future Neutron release should be better.
    • IPv6 deployment at CERN progressing well, March should be campus-wide deployed. DESY, KIT and Imperial reported about their progresses and problems deploying IPv6 too.
  • Testing news: T.Wildish (CMS) manages the testbed and a CHEP paper has been written reporting on the results, with successful mesh of transfers. A. Dewhurst (ATLAS) has worked on modifying SE endpoints in AGIS to be IPv6 compliant. Tests with Hammercloud. Future steps is to identify use cases and understand which services need to be dual-stack, and continue tests with PhEDEx, FTS, dCache and SRM in production.
  • Production tests need to be accepted by WLCG and experiments, since availability may suffer, due to instabilities and due to the time sys admins will have to dedicate to IPv6 instead of other tasks. S. Campana will present this in a future WLCG MB.
  • See table showing IPv6 compliance on grid services, experiment applications, etc.
  • dCache has some issues with IPv6 (There is a technical clarification on the challenges to be able to be compatible with IPv6 and gridftp)
  • Future steps: WLCG survery to understand when sites run out of IPv4 addresses and when they will be able to support IPv6. Pre-GDB with Tier 1s to plan moving to dual-stack and production state.
  • See next IPv6 meetings in the slides (phone conferences, F2F meeting at CERN, pre-GDB)

I.Bird encourages sites to help on this task. This is a year when there is still an opportunity for not breaking things. M.Jouvin adds that this is a coordinated effort to be managed by this task force, sites should collaborate with the task force. M.Schulz suggests to join foces between WLCG and Hepix. D.Kelsey confirms that they are indeed working together.

Future of Scientific Linux in light of recent CentOS project changes - J.Polok and T.Oulevey

  • In 2014 CentOS and Redhat join forces.
  • Things that will no change: CentOS Linux platform is not changing, it will more open though. Development process remains the same, also the sponsor approach.
  • Things that will change: Everything will be open source (build, test and delivery).
  • Impact on SLC5 and 6: Each package has its own git repo. No source packages, everything built from git.
  • CERN and Fermilab discussions: 1)rebuild from source, 2)create a CentOS variant (Create a SiG) or 3)adopt CentOS core (SL becomes an add-on repository).
  • Schedule: CentOS 7 Beta in preparation, RHEL7 production due in the summer, source RPMs not guaranteed after the summer

There is a technical discussion on source rpms and whether spec files would be available and how this could impact SLC. T.Oulevey explains that spec files will be available and there will be a mechanism to build source rpms out of git, but this has an impact on the way things are currently done at CERN to build SLC packages and the current process will have to be changed. D. Kelsey asks for the timescale to take a decision. T. Oulevey says that around 8 months. H. Meinhard adds that since after the end of May there is a risk of not having an easy way to create source rpms, this cannot wait for Hepix meeting to take a decision. T.Bell says that we have to understand very soon the impact on SLC 5 and 6 and other decisions can probably wait until Hepix.

Actions in Progress

Handling of Jobs with High Memory Profiles - J.Templon

  • LHC jobs have a high memory usage. Nikhef limit is 4GB.
  • Pvmem limit is defined at queue level, this throws an out-of-mem signal so that job can handle this
  • If job needs more memory, this could be requested in the JDL

There is some technical discussion on the way physical memory can be limited, whether it actually matters at all to limit virtual memory, how this affects the performance of the job, etc. M. Litmaath comments that there are many reasons for sites to move out of maui, so this seems to be another good reason to move out from it. The technical discussions will take place on the pre-GDB on batch systems next month.

Ops Coordination Report - S. Campana

  • gLexec deployment: The SAM test will become critical
  • perfSONAR deployment: 20 sites have not installed and 20 sites with old versions
  • Tracking tools evolution: experiments miss manz fucntionalities in Jira
  • SHA-2 migration: new VOMS server host certificate, progress on providing VOMS-admin instance
  • Machine/Job features: prototype ready for bare metal
  • Middleware Readiness: task force to ensure middleware is well verified before this is deployed widely in all sites
  • Enforcement of Baseline Versions: defined but not enforced, Pakiti useful tool
  • WMS decommisioning: CMS WMS shut down end of April. WMS will not be a WLCG supported service by then.
  • Multi-core deployment: CMS and ATLAS have different usage
  • FTS3 deployment: consensus on the model. 2 or 3 servers in production, share configuration. Each experiment distribute their load among the servers.
  • Experiment Computing Commissioning: no common commissioning exercise needed, many things already in hands of task forces

SAM Test Scheduling - L. Magnoni

  • SAM Test functionality: grid services, job submission and WNs (not easy)
  • Job submission timeout: SAM Job tests may timeout because VO out of share
  • Timeout analysis shows most happen at WMS side
  • Decouple Job submission role from WNs tests to improve things, proper timeout config to reduce the effect, but WMS to be decommissioned
  • SAM WNs test submission via other frameworks: Hammercloud first candidate, but long term

There is a technical discussion on the timeout and some remarks on what has been presented that the WMS waits for 45 minutes to dispatch the job (default time is 2h in the WMS). M. Litmaath explains that if this is the case, it means the WMS doesn't find a resource in the Information System and cannot do the matchmaking. M.Schulz reminds that we migrate out the WMS and that effort should be put on the future solutions post-WMS. M. Jouvin adds that this issue may occur as well with condor-g submission, there may be timeouts if the job cannot be pushed to the site.

The new WLCG Transfer dashboard - A.Beche

  • Federated approach: FTS, Alice, FAX and AAA (4 different UI) to retrieve the information.
  • Data Volume: 3 different schemas in the DB
  • Two kind of data: raw and statistics
  • Future work: better integration of the 4 different UI
  • New architecture providing greater flexibility (easier to use new technology)

Data Preservation Update - J.Shiers

  • Executive summary: Things are going well, and will get better.
  • Report on workshop on "full cost of curation".
    • At the level of around 5EB in mid 2030's
    • Estimated at 2M$/yr (ie a few percent of WLCG)
  • Converged on a model for required services, all of which exist in some form already. Includes 'open data' which simplifies things.
  • These services and necessary opportunities can be mapped onto projects (eg DPHEP portal).
  • Relevant H2020 calls are summarised
  • DP is part of a worldwide movement where HEP should be represented
  • DP must be embedded in strategic plans (eg the "Medium Term Plan")
  • An RDA WG on bit preservation will be launched
  • List of actions to be taken, including RDA, pilots, proposals, collaborations

WLCG Monitoring Consolidation

  • objective - reduce costs, do the same thing with fewer people.
  • project has entered the deployment phase.
    • necessary tasks have been identified
  • now have to evolve the system into a production state
    • will maintain in parallel with existing system.
    • prototype is already deployed
  • 4 main areas of work
    • app support - keep things going, new functionality
    • operations - reduce cost, handle SAM/EGI, use agile infr.
    • merging apps - various merge projects, eg SAM & SSB
      • now share storage and visualisation
      • can also handle capacities/pledges and glue validator
    • tech evalautions
    • mostly finished, but will keep up to date

Jeff - what's the status of Nagios at sites?

Pablo - PIC will provide something which will feed UI output into a local site Nagios. Prototype end of April

Michel suggests Jeff contact PIC with requirements

LHCOPN/ONE Evolution Workshop

  • https://indico.cern.ch/event/289679/
  • WLCG - network is stable and scalable
  • main problem is connectivity to East Europe, Asia & Africa
  • Dilution of tier system produces more interconnected traffic flow, inc WAN access and data placement. T2s converging on T1s, so they need more connectivity.
  • List of actions relating to LHCOPN and LHCONE
    • evolutionary. change for LHCONE, related to support and dmonitoring.
  • no clear plan for point2point connections.

Comment on communication between network community and WLCG, possibility of building something not explicitly suited to WLCG.

Ian - don't see an immediate problem that we are pushing providers to resolve.

Shawn - we are having the necessary conversations, need to allow the providers to give us the same thing cheaper.

Ian - WLCG is not a free for all in network r&d activities. We need to be clear about the problem.

Markus - we're more into over-provisioning than complexity. Look at 100Gb links in USA

perfSonar update

  • 85% of WLCG sites have installed perfSONAR-PS, issues to resolve at a significant fraction.
  • modular dashboard obsolete.
  • Full mesh config was a challenge. A single shared config is used for synchronisation
  • Tests are running. Give throughput, latency (with packet loss).
  • TODO
    • complete installation across WLCG
    • dashboard dev
    • config managemnet
    • alerting is not there yet (difficult to know where to send alerts before the problem has been debugged)
    • application integration

-- MariaALANDESPRADILLO - 12 Feb 2014

Edit | Attach | Watch | Print version | History: r7 | r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2014-02-12 - OliverKeeble
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &Đ 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback