TWiki> LCG Web>WLCGGDBDocs>GDBMeetingNotes20140212 (revision 6)EditAttachPDF

Summary of GDB meeting, February 12, 2014 (CERN)

Agenda

https://indico.cern.ch/event/272618/

Welcome - M. Jouvin

See slides for future (pre-)GDB planning

  • Once per month except August and except July
  • CNAF Bologna will host the March GDB and a pre-GDB on batch systems (remote participation is possible)
  • Some topics for future meetings: cloud, data federations, volunteer computing, IPv6 management... please, do not hesitate to suggest other topics

Next WLCG workshop around the summer, dates to be decided, in Barcelona. The following one to be colocated with CHEP 2015 Michel proposes from 7th to 9th of July but some people will not be able to attend. P. Charpentier proposes to organise a Doodle

See actions in progress and Ops Coord in, some highlights:

  • More news on Site Nagios testing after July
  • Job handling with high memory requirements
  • GFAL/FTS3 client migration

See upcoming meetings and interesting workshops in slides (Big Data, Federation Workshop, Hepix, EGI Community Forum, ...)

HEP SW Collaboration - I.Bird

Computing will not scale easily to post LS1 LHC conditions (higher pileup, reduced bunch spacing): need to radically evolve the SW

  • Performance is nowadays a limiting factor in experiment SW: more and better physics if optimal SW and more computing resources would be available
  • Performance can be gained by using to multi-core, multi-socket, multi-node (already using it); and also vectors, pipelining, parallelism (areas to be investigated now, but will require significant re-engineering, not only SW, but also frameworks and data structures)
  • Some libraries and toolkits, like Geant, will impact many more people if they can be improved, so these are the ones to be targeted first
  • Concurrency Forum started 2 years ago to discuss all these issues
  • HEP SW Collaboration to develop open scientific software packages (although aims at working with other sciences and industrial partners)
    • Common core frameworks and tools
    • Gives credit and recognition to collaborators
    • Defines roadmaps
    • Elaborates proposals for acquiring resources (like Horizon 2020)
  • Status: ongoing discussions with other HEP labs (all computing project leaders and all existing SW collaborations), upcoming meeting during Concurreny Forum Workshop in April, to be followed up by a meeting in US at the end of the year

GEANT and ROOT are 2 key tools for HEP: try to have more commonalities

  • Need a more modular infrastructure: a lot of work to achieve it!

Kickoff meeting, April 3-4, is really about the organization of the collaboration

  • The 3 days before are the technical discussions: Concurrency Workshop, March 30-April 2

Discussion

  • Philippe: what is the intent regarding LCG Application Area? Will it be replaced by the new collaboration?
    • Ian: too early to say exactly... but this may be one possible outcome... if it makes sense!
  • Concerned people are not in GDB but GDB members must be ambassadors for the intiative, convincing appropriate people that this is an important and open initiative!
    • Invitation has been widely circulated...

IPv6 Update - D.Kelsey

CERN news

  • Progressing well with IPv6 deployment: firewall, DNS ready, campus-wide deployment foreseen in March
    • Still some issues with dhcpv6
  • OpenStack Havanah still not controlling well IPv6: CERN developping its own control

Other sites

  • DESY ready for deployment
  • KIT has a test cluster
  • Imperial very active with dual-stack testing

Applications

  • Application survey making progress, a few critical application without IPv6 support: dCache, EOS, Frontier
  • Xrootd v4 (required for IPv6 support) still not available?
  • dCache: 2.6 not IPv6-ready, 2.8 successfully tested at NDGF
    • Caveat: gridftp1 through FTP door doesn't work
    • gridftp is the difficult part of IPv6 compliance: other protocols (e.g. http) are easier

Transfer testbed

  • Successful mesh of transfers before CHEP but several unrelated issues afterwards as people were not looking at it any longer
    • Not IPv6 issues per say

Experiment readiness

  • ATLAS (A. Dewhurst)
    • AGIS: flags added for IPv6-ready storage endpoints
    • Frontier is a problem: 2.8 not supporting IPv6, 3.1 should be ok

Next steps

  • PhEDEx/SRM/dCache tests on testbed
  • Move to use some production endpoints at volunteer sites
    • Need clearly defined use cases from experiments about the IPv6-only clients
    • Some sites ready to participate but only if accepted by experiments and funding agencies that availability may suffer (directly or indirectly): similar to MW readiness testing, work in progress by Simone, to be discussed at MB
  • IPv6 pre-GDB on IPv6 management on June 10: aimed primilarly at T1s but others welcome
    • Encourage IPv6 support and move to dual-stack WLCG services
    • Try to do as much as possible before restart of data taking
  • Also monthly meetings by Vidyo (13/2, 13/3) and a 2-day F2F meeting in April (post-GDB)

I.Bird encourages sites to help on this task. This is a year when there is still an opportunity for not breaking things.

  • M.Jouvin adds that this is a coordinated effort to be managed by this task force, sites should collaborate with the task force.
  • M.Schulz suggests to join foces between WLCG and Hepix. D.Kelsey confirms that they are indeed working together.

Future of Scientific Linux in light of recent CentOS project changes - J.Polok and T.Oulevey

Beginning of January, CentOS announced that it is joining its forces with RH as part of the open source team

  • Governance published
  • Remains a separate group and product at RH: remains an upstream vendor
  • Things that will no change: CentOS Linux platform is not changing, it will more open though. Development process remains the same, also the sponsor approach.

Key changes

  • RH is offering to sponsor the build system and content delivery network
  • Everything will be open source (build, test and delivery).
  • git.centos.org remains the central place for CentOS developments and is open to everybody
  • Possibility of SIGs (Special Interest Groups)

Impact on SL for SL5/6

  • May be no longer SRPMS: direct use of Git sources, not really a problem
  • Hope to make the changes transparent

SL(C)7: 3 options discussed between CERN and FERMILAB

  • Continue to build from source as previously
  • Create a Scientific CentOS SIG
    • SIG board defines the rules and can diverge from core
    • Right to override packages, use specific logos...
  • Use CentOS7 as SL7: in fact now SL has just very few diffs with CentOS (in particular openAFS)
    • Main issue was the process openess but will improve greatly with recent evolutions
  • No decision yet
  • RHEL7 final expected this summer, CentOS7 beta soon

There is a technical discussion on source rpms and whether spec files would be available and how this could impact SLC.

  • T.Oulevey explains that spec files will be available and there will be a mechanism to build source rpms out of git, but this has an impact on the way things are currently done at CERN to build SLC packages and the current process will have to be changed.

D. Kelsey asks for the timescale to take a decision.

  • T. Oulevey says that a few months.
  • H. Meinhard adds that since after the end of the summer there is a risk of not having an easy way to create source rpms, this cannot wait for Hepix meeting to take a decision.
  • H. Meinhard also reminds that FNAL/CERN common initiative 10 years ago had a great positive impact on the community and its computing and that, as a community, we should support again a common decision by these 2 key players
  • T.Bell says that we have to understand very soon the impact on SLC 5 and 6 and other decisions can probably wait until Hepix.
  • Agreement to share information (if any) at the next GDBs to help preparing the final decision

Actions in Progress

Handling of Jobs with High Memory Profiles - J.Templon

Many jobs in all LHC experiments requires up to 4 GB of vmem (see plots based on NIKHEF statistics)

  • Mostly ALICE has a significant number of jobs requiring more that 4 GB of vmem
  • NIKHEF uses a pvmem limit of 4 GB

Why to use PVMEM

  • Translate into a ulimit in the process
  • Gets an out-of-mem signal rather than a kill: gives a chance to handle it
  • doesn't take into account overhead outside app space (wrappers...)
  • Limit set a the queue level

If a user needs more than 4 GB, several ways to request it

  • In particular through JDL... but in fact result is not necessarily consistent (see slides)
  • User asks for mem rather than vmem: not necessarily a major problem has WNs have not a good amout of memory

Big Damned Jobs vs. many process

  • MAUI doesn't properly understand a request for 32 GB and 8 cores: understood as 32 GB per core (256 GB!)
  • Not clear how to work around it and if it is worth the effort

Discussion on the way physical memory can be limited, whether it actually matters at all to limit virtual memory, how this affects the performance of the job, etc.

  • In particular, cgroups should be investigated as a possible way to implement a physical memory limitation (the one that is really important for performance) when there is contention for it and not all the time (which is the reason not to use it today)

MAUI limitations

  • M. Litmaath comments that there are many reasons for sites to move out of maui, so this seems to be another good reason to move out from it.
  • The will be discussed at the pre-GDB on batch systems next month.

Ops Coordination Report - S. Campana

~40 people, good representation of both sites and experiments

  • Lots of discussions

glexec

  • CMS SAM tests not yet critical, due to the SAM test scheduling issue
  • Still 20 sites haven't deployed it yet but mostly not CMS sites

perfSONAR: the core WLCG tool for network monitoring

  • Still 20 sites without it
  • 20 sites with obsolete/old versions

Tracking tools

  • Savanah-to-JIRA migration in GGUS still not done because of missing features in JIRA: recent agreement they were not critical and we'll leave without them
    • Fixing JIRA is out of the scope of the TF

SHA-2

  • Progress in VOMS-admin: experiments need to adapt to it
  • Will start soon the campaign to update VOMS server host cert
  • When SHA-2 migration is over, will tackle RFC proxy compliance

Machine/job features

  • Prototype available for testing, testing started at some sites
  • Good collaboration with OSG to integrate bi-directionnal communication

Middleware Readiness: WG to coordinate activities around ensuring middleware is well verified before this is deployed widely in all sites

  • Long-lived WG rather than a defined-term TF
  • Use experiment testing framework, e.g. HammerCloud
  • Sites will deploy "testing production instances": monitoring will distinguish them from normal production instances
  • A demonstrator being setup in Atlas
  • Discussing how to reward sites: discussion at next MB

Baseline versions

  • Enforce defined baseline versions: need a tools for reporting sites failing baseline versions
  • Proposal to look at Pakiti as a way to report baseline compliance

WMS decommissionning

  • No change in plans: CMS/LHCb WMS will be shut down end of April

Multicore deployment

  • ATLAS and CMS foresee different usage of multicore job slots: TF is not discussing which model is the best one
    • ATLAS: both single core and multi-core jobs, sites in charge of doing the backfilling
    • CMS: always using multicore slots, doing the filling optimization
    • LHCb working on an approach similar to CMS
  • TF goal: find a way to handle both approaches
  • MAUI is a great concern regarding muticore job support

FTS3 : consensus on development model

  • 3 to 4 servers at maximum: shared configuration but no state sharing/synchronization

Common Commissionning Excersite (STEP14): consensus there was no need for it

  • Production activity is at a sufficient level
  • Specific commissioning activities coordinated by TFs, Ops Coord: concurrent testing for some specific cases still remains an option if possible and necessary

SAM Test Scheduling - L. Magnoni

SAM test framework based on Nagios to do the check scheduling

3 categories of tests

  • Public grid services: FTS, SRM, LFC...
  • Job submission: ability to execute a job in a given time window
  • WN: ability to use various services from a WN

Currently WMS is used to submit job submission tests and WN tests

  • Move in progress to Condor-G submission
  • WN tests is the difficult part as it has to bundle in a job payload Nagios aspects, MTA....

Test timeouts: WN test needs specific groupe/role, inherited from job sumission test, and may time-out due to a VO out of share

  • Job submission may thus time-out too: will result in MISSING status rather than critical
  • One timeout for each job state: in fact 4 different timeouts
  • Timeouts represent 1/3 to 1/2 of the job submission test failures for CMS and ATLAS
  • Most of the timeouts happening in waiting state at the WMS (~80% for ATLAS and CMS): should probably increase the timeout at WMS (45 mn currently)
    • Very low level of timeouts at site level
    • Waiting state in WMS is purely related to match-making operations
    • An overloaded CE sets its status to Draining where the requirement is for Production state
  • Concentrate effort on issues independendent from WMS itself as the main goal is to move asap to CondorG submission
    • If timeout in waiting state is related to CREAM CE throttling mechanism, this will also affect CondorG or direct submission
    • 45 mn is clearly a too short timeout for this particular situation: may be a matter of a few hours to recover from overload

In the future we may want to decoupe job submission from WN tests

  • glexec for WN tests
  • Submit with experiment frameworks: long term...
    • HC is a good candidate to ship WN tests but not Job Submission tests
  • Need to keep a generic framework with not too many differences from what we have today...
    • Looking at Zabbix as an alternative, more scalable scheduler
    • Having a different framework for WN tests and others may add dependencies between different infrastructures: not necessarily desirable

The new WLCG Transfer dashboard - A.Beche

Original dashboard designed for FTS: difficulties to incorporate data for other transfer methods like Xrootd

  • Some specific data not relevant to FTS
  • Data duplication: tool specific dashboards + WLCG one

Current situation

  • 4 dashboards with 4 different UIs: FTS, Alice, FAX and AAA
  • Data Volume: 3 different schemas in the DB (AAA and FAX use the same schema)
  • Two kind of data: raw and statistics
  • Data retention policy depending on tools: FAX/AAA kept a long time to feed data popularity calculation

New architecture will be an aggregator for tool specific dashboard

  • No data duplication
  • Access to tool specific data through the tool specific dashboard
  • Better integration of the 4 different UI
    • New GUI views: map view
  • Easier use of new technologies

Data Preservation Update - J.Shiers

Things are going well... and will probably get better!!!

Full Costs of Curation workshop: HEP context, focus on medium term plans to ensure continuity with next experience

  • 100 PB today, 5 EB in 15 years
  • Start with a 10 PB archive now, increase every year with a flat budget
  • Estimated ~2M$/year: affordable compared to experiment costs and WLCG costs (100 ME/year)
  • A team of 4 people could "make significant progress"
  • Need to get sites other than CERN involved in the budget effort/forecast, need from support from RRB

DPHEP Portal model built over

  • Digital library tools and services (Invenio, INSPIRE, CDS...): already well organized collaboration taking care of backward compatibility when there is a major technical evolution
    • Well funded in FP7 and will continue to be funded in H2020
  • Sustainable SW coupled with virtualization techniques: mostly exists (e.g. CERNVM and CVMMFS) with sustainability plans
    • Sustainable bit preservation as a service is also fundable: based on RDA WG RFCs for interfaces
  • Open data/access: more standard description of events....
    • Well identified goal for H2020
  • DPHEP portal built in collaboration with other disciplines and a proposal is being prepared
  • These services and necessary opportunities can be mapped onto projects (eg DPHEP portal).
    • Relevant H2020 calls are summarised in slides

Next actions

  • Launch (and drive) a RDA WG on bit preservation at next RDA meeting in March
  • Prepare an H2020 project on bit preservation for the sept. call: funding to start in 2015
  • Strengthen multi-directionnal collaboration (APA, SCIDIP-ES, EUDAT)
  • Agree on functionnality for DPHEP portal
  • Do a demonstrator of using CERNVM to preserve old environment for current experiments (CMS)
  • (Self-)Audit, reusing OAIS, ISO16363
  • Establish core team of ~4 people this year

SWOT analysis (see slides)

  • Strengths and opportunities clear
  • Weakness: manpower needed but strong support statements from various bodies + H2020 funding opportunities
  • Threats: commitment from funding agencies, non-CERN experiments/sites agreement

Conclusion

  • DP is part of a worldwide movement where HEP should be represented
  • DP must be embedded in strategic plans (eg the "Medium Term Plan")

WLCG Monitoring Consolidation

A lot of work done, good feedback from experiments but site feedback weaker

  • Prototype phase completed with a detailed report
  • Now entering deployment phase
    • prototype is already deployed and will evolve to become the production system
    • will maintain in parallel with existing system.

4 different sets of taks

  • Underlying application support (job, transfer...): basically support of the current system
  • Running the services: main action was the move of services to Agile Infrastructure, almost done
    • Also the move of SAM EGI to GRNET
  • Merging applications with similar functionalities or significant overlap: reduce the number of apps to maintain
    • Merging testing frameworks (Nagios and HC): still in discussion
    • SAM & SSB: very good progress, in particular for storage and visualization
    • Keep the 2 UI look and feel over the same storage/virtualization framework: SSB and SAM
  • Technology evaluation: a good part of the work done during the evaluation phase, close collaboration with AI monitoring

Next steps

  • Downtime in avail/reliability reports
  • Report generation

Jeff - what's the status of integration with Nagios at sites (Nagios used as a "UI"), see September discussion

  • Pablo - PIC will provide something which will feed UI output into a local site Nagios. Prototype end of April
  • PIC and NIKHEF encouraged to contact each other to ensure that NIKHEF use case is covered by the envisionned plugin: report at a future GDB if needed

LHCOPN/ONE Evolution Workshop

Very good participation

  • ~60 people
  • ~11 NRENs
  • Europe, N.A. and Asia

WLCG viewpoint expressed by I. Bird

  • Network has been working well and LHC needs is expected to fit into technology evolution
  • Network is central to the evolution of WLCG computing
  • LHCOPN must remain a core component to connect T0 and T1s
  • Main issue is connectivity to Asia and Africa

Experiments usage may differ but conclusions are the same

  • WAN more and more important
  • Reduced difference between T1s and T2s
    • Produces more interconnected traffic flow, inc WAN access and data placement
  • Need for more connectivity at T2s

Sites: diverse situations, in particular with respect to LHCONE

  • T1s happy with LHCOPN
  • T2s will increase their connectivity: several in the US are planning 100G for WAN
  • Main demands
    • Better network monitoring
    • Better LHCONE operations, in particular for troubleshooting problems

LHCONE P2P service: NSPs should be ready soon to provide a production service

  • CMS may exploit the service: early prototype being developed for PHEDEX
  • Sites don't have clear plan/need for it: usual over-provisionning vs. complexity balance
  • Billing is very unclear
  • NSPs interested to get WLCG testing the service

Actions

  • Keep LHCOPN as it is: increase bandwith if really necessary and affordable
  • Several sites would like to move T1-T1 traffic to LHCONE as topology may be more optimal than LHCOPN (routing through CERN)
  • Several sites planning to use LHCONE as their backup for LHCOPN: will often provide better perfs than a slow backup link
  • LHCONE L3VPN
    • Improve support in particular with global tracking system and cross-organization team
    • Improve monitoring: remains based on perfSONAR
    • Improve usage efficiency of transatlantic OPN/ONE resources
    • Sites asked to announce only LHC prefixes: a requirement for bypassing firewalls, one of the key feature for LHCONE (science data only)
  • LHCONE P2P: still lacking interested sites and developers, lack of time to really discuss further plans
  • To be followed-up at next meeting in April (Roma)

Comment on communication between network community and WLCG, possibility of building something not explicitly suited to WLCG.

  • Ian: don't see an immediate problem that we are pushing providers to resolve.
  • Shawn: we are having contacts with Network Providers and network experts, need to express more clearly what we need
  • Michel: as said by Ian, not clear that we really need something requiring a specific R&D. May be, should reverse perspective and see what we can do with the limited manpower we had to help them in their R&D activities, even though we don't see a clear use case in WLCG
  • Markus: over-provisioning remains an alternative to complexity. Look at 100Gb links in USA...

perfSonar update

Current situation

  • perfSONAR PS 3.3.2 released Feb. 3: bug fixes, improvements
  • Modular dashboard prototype orphaned but still in GitHub
  • Deployed in 85% of sites

Evaluating new dashboard solution: MaDDash by ESNET

  • A product of the perfSONAR-PS project
  • Missing abilities to edit mesh and metric numbers are not displayed at the highest level
  • Drill-down capabilities to get more details
  • Exploring OMD to complement MaDDash to do the perfSONAR service testing
    • Based on Nagios + Check_MK
    • Very promising to monitor PS services, still missing the capability to monitor the mesh configuration
    • Graphs created automatically with checks

Mesh configuration status

  • 3.3 has all the functionnality for the mesh built-in
  • Plan to automate the mesh generation from GOCDB or similar sources: currently manual, labor intensive

WLCG deployment details

  • Sites organized in regions
  • Every site expected to deploy perfSONAR
  • Every site expected to check PS results every day: need to reach complete coverage

Debugging network problems: still triggered by a human!

  • PS clearly helps but not necessarily enough: need to correlate with other facts/metrics at site

Known issues not yet adressed

  • 10G vs. 1 G (vs. 100G?)
  • Improve end-to-end path monitoring
  • Alerting: difficult to converge on what, when, who
    • Difficult to know where to send alerts before the problem has been debugged
  • IPv6 montiroing: preliminary work at Imperial

Network metrics collected by PS may be consumed for other uses

  • Characterizing paths with cost to optimize placement decision (ANSE projet)
  • Select data sources based on network connectivity

OSG proposing to host a network service for WLCG

  • Need to revise current metrics: are they appropriate? Are they enough?

PS-PS and PS-MDM should converge eventually

  • Not clear if PS-MDM is still developed
  • PS-PS will incorporate some views provided by PS-MDM

-- MariaALANDESPRADILLO - 12 Feb 2014

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2014-02-13 - MichelJouvin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback