WLCG Operations Planning - April 17, 2014 - minutes

Agenda

Attendance

  • Local: Andrea Sciabà (chair), Nicolò Magini (secretary), Maarten Litmaath (ALICE), Stefan Roiser (LHCb), Marian Babik, Felix Lee (ASGC), Vincent Brillault, Pablo Saiz, Marcin Blaszczyk, Maria Alandes, Maria Dimou, Alessandro Di Girolamo (ATLAS), Hassen Riahi, Alberto Aimar, Domenico Giordano, Markus Schulz
  • Remote: Yury Lazin (RRC-KI-T1), Renaud Vernet, Shawn McKee, Vanessa Hamar (IN2P3-CC), Frederique Chollet (IN2P3-CC), Christoph Wissing (CMS), Thomas Hartmann (KIT), Burt Holzman (FNAL), Alessandro Cavalli (CNAF), Alessandra Forti, Valery Mitsyn (JINR-T1)

Agenda items

News

  • WLCG workshop in Barcelona (7-9 July)
    • registration is now open (see Indico page)! Registration fee: EUR 115 (+ 40 for social dinner), deadline June 9
    • Agenda still to be defined: there is a rough draft, discussed at the last GDB, any input and suggestion is very welcome
  • Task forces
    • Two task forces to be evaluated for closing: xrootd and perfSONAR. A new task force on Network monitoring?

Experiments Plans

ALICE

  • KIT jobs behavior:
    • The cause of the overload of the KIT firewall and the OPN link to CERN has been found:
      • for analysis jobs the location of the WN was not propagated to the central services
      • the client then ended up using not only the local replica, but close replicas as well
    • Over the weekend a patch was developed for TAlienFile in ROOT
      • it now sends the location information explicitly
    • The patch became first available in Tuesday's analysis tag and started getting used by a few trains and users
      • the results were very good
    • The expectation is for the vast majority of analysis jobs to be using the patched code after a few days
      • this will be monitored
    • The jobs cap should then get increased again gradually over the coming days
    • Our thanks to the teams at KIT for their efforts, and to the other experiments for their patience!

  • Plans for the next 3 months:
    • First, continuous activities in preparation for Quark Matter 2014 (May 19-24, GSI Darmstadt).
    • Then reprocessing RAW data with highest priority.
      • Starting from 2011 (p+p) and some selected periods (Pass2) from 2012.
    • Accompanied by the associated MC for these periods.
      • Anchored to the new calibration resulting from the preceding RAW data pass.
    • Pb-Pb reprocessing is not in the plans at this moment.
    • User analysis should be less intense after QM'14.
    • CERN:
      • Conclusion of SLC6 job efficiency investigations.
      • Increased use of the Agile Infrastructure.

  • Alessandro comments that the KIT incident is an opportunity to review if the monitoring covers all data access (transfers, remote and local jobs) by all VOs to understand the load on a storage. Collect links to internal VO monitoring pages, check what is in the common dashboard monitoring, what can be integrated. Maarten mentions that for ALICE remote access is monitored in Monalisa. Pablo comments that Dashboard has no resources right now for local access monitoring, but it can be added to the wish list. Discussion to be continued offline on the WLCG Ops Coord monitoring mailing list.
  • Alessandro asks about the metrics used by Alien to define two sites 'close' for remote processing, according to Maarten and Pablo it includes RTT measurements, domain. Includes network tests between the voboxes.

ATLAS

  • Alessandro presents the status and plans of ATLAS for the next months: main points are new Tier-0, new production system, Rucio migration, database access, request to deploy FAX xrootd and HTTP/WebDAV at all sites, multicore. See slides for details.

  • Alessandro clarifies that the SAM tests are not affected by the Rucio migration, because they don't interact with the catalogs, only the storage.
  • Alessandro clarifies that the timeline for sites to enable multicore for ATLAS is NOW if the sites can perform dynamic provisioning. Sites who cannot do it should contact ATLAS central operations (Alessandra Forti) and discuss on a case-by-case basis.
  • Alessandro explains that ~90% of the sites have joined FAX, 2 T1s (NDGF and IN2P3) and a few T2s are missing. 85% of the sites have enabled HTTP/WebDAV for the Rucio renaming, but the service is not also production-quality for remote access at all sites (e.g. EOS) since the requirements are different.
  • Stefan asks the reasoning behind the request of a dedicated LSF cluster for ATLAS Tier-0, since this will introduce a partitioning of the resources. Alessandro answers that during 2012 data taking there were 10 ATLAS alarms for LSF, and the problems were not identified or were caused by other users; according to IT-PES, a smaller instance can cope better with the load. CMS is going to AI instead for the Tier-0, possibly introducing partitioning in a different way. The discussion is adjourned to the next meeting since there is no Tier-0 representative attending the current meeting.

CMS

  • DBS2 has been switched off and remains off most likely
  • Decommissioning of CE tags for individual CMSSW releases
    • Still to come in April
  • glexec SAM test
    • Remaining scheduling issues solved
    • Want to go ahead and make it critical (beginning) of May
  • CSA14
    • Tier-1 Tape Staging test in May
    • Analysis Challenge in Summer (July & August)
    • Preparing samples
  • Multi Core
    • First production through multi core pilots at PIC
    • Technical tests ongoing at other sites
  • xrootd (AAA)
    • Related SAM tests not yet critical
    • Scale testing of AAA (xrootd federation) on going concentrating on European sites
  • Migration from Savannah to GGUS ongoing
    • Transfer Team and Workflow started to use it
    • Minor issues being addressed with GGUS team
    • Enlarge usage over the next weeks
  • FTS3
    • Finish moving sites to FTS3 for Debug transfers
  • Condor_g mode for SAM
    • SAM gLiteWMS decommissioning for end June confirmed?

  • Marian confirms that the June deadline for the new Condor_g SAM probes is correct, though the current schedule is 2-3 weeks late. A prototype is available, establishing the schedule for deployment to preproduction.

LHCb

  • Data Processing
    • VAC
      • Used in production with several hundred VMs at Manchester, Oxford and Lancaster
      • Infrastructure moved to CERNVM3 / SLC6
    • Unification / rewrite of the pilot framework in LHCbDIRAC
      • same infrastructure to be used at WNs, VAC, BOINC, CLOUD
      • including machine / job features
    • WMS decommissioning
      • WMS server decommissioning at CERN went without problems
      • Currently LHCb is submitting < 4 % of its pilots through WMS to small sites
        • Decommissioning of remaining sites will continue on low priority
  • Data Management
    • Tier2Ds (D==Disk)
      • Many Tier2Ds are using DPM as storage technology
    • FTS3
      • Service is used 100 % in production since several months by LHCb
      • Client was used in "FTS2 mode" within LHCbDIRAC.
        • Planning to use the REST interface (python only, avoid Boost and other C++ dependencies)
    • File Access
      • Upcoming release of LHCbDIRAC will contain ability to use natively built xroot tURLs without going through SRM,
      • Next step will be to integrate http/webdav access, should be less work with the previous work already done but so far less endpoints available....
    • LCG file catalog to DIRAC file catalog migration
      • Migration procedure is currently being prepared, no estimate yet on the final schedule for the migration available. The objective is still before the end of the year.
    • CASTOR -> EOS migration
      • LHCb is using EOS already for production data for several months
      • Last bit missing was user data migration, which is scheduled for 22nd April - Many thanks to DSS for their support !!!!
  • Infrastructure
    • IPv6
      • Test infrastructure currently being setup in LHCb
      • New LHCb representative to IPv6 WLCG TF and Hepix - Raja Nandakumar
    • perfSonar
      • waiting for dashboard interface to consume data

  • Stefan clarifies that the timeline for sites to deploy xrootd access is when they are ready.
  • Stefan confirms that the timeline for the switchover to DIRAC file catalog is before the end of the year, and LHCb intends to use it in Run2.
  • Shawn comments that the dashboard in the next release of perfSonar in May will have a REST API. Data from perfSonar gathered and exposed to WLCG.

Report from WLCG Monitoring Consolidation

  • Pablo presents the status of the monitoring consolidation project: recent updates, with focus on SAM3 validation and the site nagios plugin; next steps. See slides for detail.

  • Feedback to be submitted to the monitoring consolidation e-group for discussion. Requests/issues on JIRA tracker.

Ongoing Task Forces and Working Groups Review

WMS decommissioning

  • CERN WMS instances for experiments have been drained completely without incident
  • SAM instances to be decommissioned by the end of June
    • depending on successful validation of the new job submission methods developed for SAM
      • direct CREAM submission with payload
      • Condor-G

SHA-2

  • future VOMS servers campaign plans:
    • reminder broadcast to be sent around May 6 (original "deadline")
    • let SAM preprod instances get their proxies from the new servers to measure the readiness across WLCG
      • open tickets for failing services?
    • let SAM production instances use the new servers as of a hard date
      • Mon June 2?
    • send another broadcast for sites and experiments to reconfigure their UI-based services
      • remove references to the old servers
    • switch off the old servers on Tue July 1

  • Andrea suggests to wait until we see how many sites are impacted before opening tickets
  • Agreed to keep the TF open to track this campaign

Middleware readiness

  • Maria Dimou explains that the "baseline version" number doesn't necessarily reflect all individual updates of the packages in the dependencies. Need to understand how to publish this.

FTS3

  • FTS3 successfully managing majority of experiment transfers, agreed with all experiments on FTS2 decommissioning by August.
    • ATLAS and LHCb already 100% on FTS3, CMS to complete migration to FTS3 by June.
  • Understand transfer performance with FTS3: mostly validating FTS3 and Dashboard monitoring plots to see if experiment operations have all they need to understand FTS3 transfer behavior.
    • Expected result: validate optimizer performance for bulk of transfers; spot corner cases of problematic transfers which require dedicated debugging (e.g. at network level)
  • Validate submission with new clients/REST API and related new functionality. Timeline for integration of new features varies by experiment.

perfSONAR

  • Shawn presents the final report of the perfSONAR task force: deployed at 205 sites by April 1st deadline, only 8 missing, but 64 still running old versions. Lessons learned and important remaining issues. See slides for details.

xrootd deployment

  • Domenico presents the final report of the xrootd task force: deployment status, monitoring. Items to be followed up include keeping alive the communication between the two federations, the deployment of the dCache-XRootD monitoring plugin and the registration of the endpoints in the information systems. See slides for details.

  • Andrea comments that both the perfSonar and the xrootd task forces have reached their goals. He proposes to close them and create a new task force or working group to follow up on remaining issues, increasing the scope of network monitoring beyond perfSonar and xrootd in the long term (e.g. HTTP/WebDAV, FTS3 monitoring). The working group would be the forum to establish the proper procedure to follow up network issues. Shawn and Marian to lead this activity, they are asked to propose the mandate and goals and present them at the next WLCG Ops coord meeting on May 8th. Markus suggests to call the group something like 'data access monitoring' instead of 'network monitoring' to clarify the goals.

Tracking tools evolution

  • Move from savannah to JIRA done for the GGUS Shopping list tracker on 9th of April
    • Migration of experiment tracker outside scope of this TF

IPv6

  • Andrea presents the status of the IPv6 task force: experiments status and plans, highlights from the HEPIX IPv6 F2F meeting. See slides for details.

Machine/Job Features

  • Batch infrastructure
    • Support for ALL batch system types available, including SLURM - many thanks to NDGF)
    • Deployment plan is to test on two sites initially and then roll out to remaining sites
      • LSF: CERN (done) & second site contacted
      • Condor: USC & second site contacted
      • SGE: Gridka (done) & Imperial (done)
      • Torque/PBS: NIKHEF (done) & second site contacted
      • SLURM: script being developed
  • Cloud infrastructure
    • Setting up a prototype infrastructure at CERN/Openstack (similar to what was done for CERN/LSF)
    • based on couchdb + administration tools which are currently being written
    • Later move to more / other IaaS infrastructures
  • Client (mjf.py)
    • First version available at WLCG repository and LCG/AA afs area for use by sites / experiments
  • Bi-directional communication
    • Currently under discussion and finalizing the structure
  • see also GDB talk

gLExec

  • 79 tickets closed and verified, 16 still open (no change)
    • slow progress with a few cases
  • the current status was presented in the April 15 Management Board
    • presentation
    • the WLCG project leader proposed the task force should carry on
      • details on pages 12 and 13 of the presentation
  • Deployment tracking page

  • Maarten explains that the plans for the task force are to gather experience after CMS turns the glExec test critical; then discuss with LHCb how to ramp up; follow up with other experiments.

Multicore deployment

  • First review of all the batch systems completed
  • Had a first phase wrap up presentation https://indico.cern.ch/event/305626/contribution/0/material/slides/0.pdf
    • CMS not running multicore yet, or at least not extensively enough to asses its impact on sites
    • ATLAS still has a wavelike submission pattern which is most disruptive
    • So far the most successful model of scheduling without walltime and/or a steady stream of multicore is (dynamic) partitioning especially at sites that can - one way or another - limit the number of cores to be drained at the time.
      • FZK done it with SGE native features
      • Nikhef has done it with some creative scripting
    • Backfilling not yet possible.
    • Problem with passing parameters to batch systems in CREAM
      • Nikhef has shared their blah scripts to pass parameters to maui/torque Nikhef scripts
        • Support is on best effort and how exotic the requests are.
      • SGE works out of the box
      • Most SLURM and Htcondor sites use ARC-CE
  • Next steps
    • CMS/Atlas testing together
    • Trying to use other parameters like walltime at sites.
    • Second round of presentations in the new conditions

WLCG HTTP Proxy Discovery

  • No report

AOB

  • The next meeting on May 8th will be a regular WLCG Operations Coordination meeting.

-- NicoloMagini - 14 Apr 2014

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback