WLCG Operations Planning - April 17, 2014 - minutes

Agenda

Attendance

  • Local:
  • Remote:

Agenda items

News

  • WLCG workshop in Barcelona (7-9 July)
    • registration is now open (see Indico page)! Registration fee: EUR 115 (+ 40 for social dinner), deadline June 9
    • Agenda still to be defined: there is a rough draft, discussed at the last GDB, any input and suggestion is very welcome
  • Task forces
    • Two task forces to be evaluated for closing: xrootd and perfSONAR. A new task force on Network monitoring?

Experiments Plans

ALICE

  • KIT jobs behavior:
    • The cause of the overload of the KIT firewall and the OPN link to CERN has been found:
      • for analysis jobs the location of the WN was not propagated to the central services
      • the client then ended up using not only the local replica, but close replicas as well
    • Over the weekend a patch was developed for TAlienFile in ROOT
      • it now sends the location information explicitly
    • The patch became first available in Tuesday's analysis tag and started getting used by a few trains and users
      • the results were very good
    • The expectation is for the vast majority of analysis jobs to be using the patched code after a few days
      • this will be monitored
    • The jobs cap should then get increased again gradually over the coming days
    • Our thanks to the teams at KIT for their efforts, and to the other experiments for their patience!

  • Plans for the next 3 months:
    • First, continuous activities in preparation for Quark Matter 2014 (May 19-24, GSI Darmstadt).
    • Then reprocessing RAW data with highest priority.
      • Starting from 2011 (p+p) and some selected periods (Pass2) from 2012.
    • Accompanied by the associated MC for these periods.
      • Anchored to the new calibration resulting from the preceding RAW data pass.
    • Pb-Pb reprocessing is not in the plans at this moment.
    • User analysis should be less intense after QM'14.
    • CERN:
      • Conclusion of SLC6 job efficiency investigations.
      • Increased use of the Agile Infrastructure.

ATLAS

CMS

  • DBS2 has been switched off and remains off most likely
  • Decommissioning of CE tags for individual CMSSW releases
    • Still to come in April
  • glexec SAM test
    • Remaining scheduling issues solved
    • Want to go ahead and make it critical (beginning) of May
  • CSA14
    • Tier-1 Tape Staging test in May
    • Analysis Challenge in Summer (July & August)
    • Preparing samples
  • Multi Core
    • First production through multi core pilots at PIC
    • Technical tests ongoing at other sites
  • xrootd (AAA)
    • Related SAM tests not yet critical
    • Scale testing of AAA (xrootd federation) on going concentrating on European sites
  • Migration from Savannah to GGUS ongoing
    • Transfer Team and Workflow started to use it
    • Minor issues being addressed with GGUS team
    • Enlarge usage over the next weeks
  • FTS3
    • Finish moving sites to FTS3 for Debug transfers
  • Condor_g mode for SAM
    • SAM gLiteWMS decommissioning for end June confirmed?

LHCb

  • Data Processing
    • VAC
      • Used in production with several hundred VMs at Manchester, Oxford and Lancaster
      • Infrastructure moved to CERNVM3 / SLC6
    • Unification / rewrite of the pilot framework in LHCbDIRAC
      • same infrastructure to be used at WNs, VAC, BOINC, CLOUD
      • including machine / job features
    • WMS decommissioning
      • WMS server decommissioning at CERN went without problems
      • Currently LHCb is submitting < 4 % of its pilots through WMS to small sites
        • Decommissioning of remaining sites will continue on low priority
  • Data Management
    • Tier2Ds (D==Disk)
      • Many Tier2Ds are using DPM as storage technology
    • FTS3
      • Service is used 100 % in production since several months by LHCb
      • Client was used in "FTS2 mode" within LHCbDIRAC.
        • Planning to use the REST interface (python only, avoid Boost and other C++ dependencies)
    • File Access
      • Upcoming release of LHCbDIRAC will contain ability to use natively built xroot tURLs without going through SRM,
      • Next step will be to integrate http/webdav access, should be less work with the previous work already done but so far less endpoints available....
    • LCG file catalog to DIRAC file catalog migration
      • Migration procedure is currently being prepared, no estimate yet on the final schedule for the migration available. The objective is still before the end of the year.
    • CASTOR -> EOS migration
      • LHCb is using EOS already for production data for several months
      • Last bit missing was user data migration, which is scheduled for 22nd April - Many thanks to DSS for their support !!!!
  • Infrastructure
    • IPv6
      • Test infrastructure currently being setup in LHCb
      • New LHCb representative to IPv6 WLCG TF and Hepix - Raja Nandakumar
    • perfSonar
      • waiting for dashboard interface to consume data

Report from WLCG Monitoring Consolidation

Ongoing Task Forces and Working Groups Review

Middleware readiness

FTS3

perfSONAR

SHA-2

  • future VOMS servers campaign plans:
    • reminder broadcast to be sent around May 6 (original "deadline")
    • let SAM preprod instances get their proxies from the new servers to measure the readiness across WLCG
      • open tickets for failing services?
    • let SAM production instances use the new servers as of a hard date
      • Mon June 2?
    • send another broadcast for sites and experiments to reconfigure their UI-based services
      • remove references to the old servers
    • switch off the old servers on Tue July 1

WMS decommissioning

  • CERN WMS instances for experiments have been drained completely without incident
  • SAM instances to be decommissioned by the end of June
    • depending on successful validation of the new job submission methods developed for SAM
      • direct CREAM submission with payload
      • Condor-G

gLExec

  • 79 tickets closed and verified, 16 still open (no change)
    • slow progress with a few cases
  • the current status was presented in the April 15 Management Board
    • presentation
    • the WLCG project leader proposed the task force should carry on
      • details on pages 12 and 13 of the presentation
  • Deployment tracking page

Tracking tools evolution

  • Move from savannah to JIRA done for the GGUS Shopping list tracker on 9th of April
    • Migration of experiment tracker outside scope of this TF

xrootd deployment

IPv6

WLCG HTTP Proxy Discovery

Machine/Job Features

  • Batch infrastructure
    • Support for ALL batch system types available, including SLURM - many thanks to NDGF)
    • Deployment plan is to test on two sites initially and then roll out to remaining sites
      • LSF: CERN (done) & second site contacted
      • Condor: USC & second site contacted
      • SGE: Gridka (done) & Imperial (done)
      • Torque/PBS: NIKHEF (done) & second site contacted
      • SLURM: script being developed
  • Cloud infrastructure
    • Setting up a prototype infrastructure at CERN/Openstack (similar to what was done for CERN/LSF)
    • based on couchdb + administration tools which are currently being written
    • Later move to more / other IaaS infrastructures
  • Client (mjf.py)
    • First version available at WLCG repository and LCG/AA afs area for use by sites / experiments
  • Bi-directional communication
    • Currently under discussion and finalizing the structure
see also GDB talk

Multicore deployment

  • First review of all the batch systems completed
  • Had a first phase wrap up presentation https://indico.cern.ch/event/305626/contribution/0/material/slides/0.pdf
    • CMS not running multicore yet, or at least not extensively enough to asses its impact on sites
    • ATLAS still has a wavelike submission pattern which is most disruptive
    • So far the most successful model of scheduling without walltime and/or a steady stream of multicore is (dynamic) partitioning especially at sites that can - one way or another - limit the number of cores to be drained at the time.
      • FZK done it with SGE native features
      • Nikhef has done it with some creative scripting
    • Backfilling not yet possible.
    • Problem with passing parameters to batch systems in CREAM
      • Nikhef has shared their blah scripts to pass parameters to maui/torque Nikhef scripts
        • Support is on best effort and how exotic the requests are.
      • SGE works out of the box
      • Most SLURM and Htcondor sites use ARC-CE
  • Next steps
    • CMS/Atlas testing together
    • Trying to use other parameters like walltime at sites.
    • Second round of presentations in the new conditions

-- NicoloMagini - 14 Apr 2014

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r13 - 2014-04-17 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback