WLCG Operations Planning - July 4, 2013 - minutes

Agenda

Attendance

  • Local: Maria Girone (chair), Andrea Sciabà (secretary), Oliver Keeble, Felix Lee, Maarten Litmaath, Maite Barroso Lopez, Nicolò Magini, Domenico Giordano, Marco Cattaneo, Julia Andreeva, Alessandro Di Girolamo, Ikuo Ueda
  • Remote: Matt Doidge, Stefan Roiser, Vanessa Hamar, Luis Emiro Linares Garcia, Joel Closier, Christoph Wissing, Gareth Smith, Javier Sanchez, Alessandro Cavalli, Stephen Burke, Di Qing, Massimo Sgaravatto, Peter Solagna, Alessandra Forti, Wahid Bhimji, Ian Collier, Isidro Gonzalez Caballero, Alessandra Doria, Robert Frank, Peter Gronbech, Leslie Groer

Agenda items

News

Maria makes some important announcements:
  • the OCCT is collaborating with the Hepix IPv6 task force on WLCG application testing
  • the recently launched WLCG monitoring consolidation project in IT-SDC has representatives for the experiments and for the OCCT; the latter will provide the site perspective. To ensure proper communication, an internal mailing list will be created in the OCCT, open to site experts who want to provide input from Operations to Monitoring and to be used for bidirectional communication. Pepe Flix will be the moderator and act as primary link for OCCT (with Maria as backup) in the WLCG monitoring consolidation project.

New Task Forces: Proposals

Machine/Job features

Stefan describes the setup of the task force (see the slides for details). People interested in participating should contact him; a doodle has been set up to decide the date of the "kick-off" meeting. The mailing list is wlcg-ops-coord-tf-machinejobfeatures@cernNOSPAMPLEASE.ch.

Maria asks if anybody objects to the creation of this task force; as no one does, the task force is created.

Ongoing Task Forces Review

FTS3

CVMFS

Ian reports a limitation just discovered in the recommended CVMFS version (2.1.11), which doesn't handle the failover between Stratus1 repositories in some circumstances. Jakob fixed it and is testing the fix. For the time being, sites with an older version should wait for the next and sites with 2.1.11 should sit tight and wait for the next version as well rather than rolling back. It is decided to update the baseline versions twiki with a proper warning accordingly.

The deployment for ALICE has begun: GGUS tickets were sent to the ALICE sites and some already completed it.

Following a question from Ueda, it is stressed that the problem observed some time ago with CVMFS on SL6 (which required to remount it in rw) has been already solved by a kernel upgrade.

gLExec

  • The campaign was started on Mon July 1
  • Guenter Grein opened 93 tickets on Maarten's behalf: thanks!
    • 11 tickets are solved and verified already
  • USATLAS has been asked to take care of their sites
    • No plans yet

Maria asks what's the status of USCMS: all sites are OK apart from one in Brazil. CERN is also OK now.

Alessandra F. expresses some concern about the deadline of October 1, given that many sites will couple the gLExec deployment to the SL6 migration. Maarten explains that the original tentative date simply was the earliest realistically achievable, while the actual deadline will likely be coupled to the timeline for the WN migration to SL6. In general, sites need not worry about this.

SHA-2

  • LHCb has made good progress: DIRAC has been tested with SHA-2 VOMS proxies and looks ready!

Peter S. asks if there is any news on the migration to VOMS-admin; Maarten says that there is no timeline yet. Andrea asks Maite if there already is a plan to involve the VO managers and suggests e.g. providing test instances of VOMS-admin to allow VO managers to get familiar with the new interface and validate it from their point of view. Maite agrees but fears that it won't be possible before the end of the summer. Joel (as LHCb's VO manager) agrees that the new tools must be properly tested by VO managers.

It is agreed that the VOMS-admin migration will require test instances for the VO managers; IT-PES (represented by Maite) acknowledges the request.

Peter S. informs us that Emir finished preparing the probes to check sites for SHA-2 readiness; they are running since two days but no alarms or tickets will be sent to sites until the probe results are thoroughly vetted.

perfSONAR

  • We have perfSONAR-PS v3.3. released and we plan to deploy it at all WLCG sites in the next 3 months.
    • Sites can install with a normal kickstart
    • mesh configuration hack necessary in 3.2 is not needed anymore
  • Sites are invited to upgrade/install following the Twiki. Sites are reminded to publish PS in GOC/OIM beside doing the installation
  • The central configuration changed (both location and structure), so sites should contact their regional contact or the pS mailing list so that the proper configuration can be done
  • We are progressively testing the new modular dashboard, including the API. We will utilize in parallel the old and new dashboard for the next 3 months so that we can provide more feedback.

SL6 migration

  • T0: A large fraction of the SLC6 batch capacity is temporarily unavailable due to VM crashes. We are performing several tests to understand what is going on.
  • T1s Done: 6/15 (Alice 4/9, Atlas 4/12, CMS 2/9, LHCb 3/8)
    • PIC has gone online with 8300 HS06 last week, all experiments contacted to insert the new queues their systems. Expect to complete migration by the 15th July.
  • T2s Done: 28/129 (Alice 7/39, Atlas 12/89, CMS 17/65, LHCb 8/45)
    • Only 40 remain without any plan or testing going on.
  • EMI-3 testing
    • CMS problem reported last week was due also to the same problem voms-proxy-info that affected Atlas GGUS 94878.
    • voms-proxy-info: patched client is now in the VOMS PT production repository ready to migrate to EMI-3 and then UMD-3. There is no ETA for when this will happen.
    • Sites testing EMI-3 are 8 the testing is followed in the SL6 deployment pages.
  • HS06: sites whould rerun the benchmarks because differences between SL5 and SL6 a notable.

xrootd deployment

Domenico reports on the deployment status of FAX and AAA. In both cases there are about 40 sites (though not in all of them xrootd is monitored). DPM sites don't report external traffic due to an issue under investigation. The needed external plugins are (or will be) available in the new WLCG repository (only the dCache one is missing and some details need finalizing). The GLED collector will be enhanced to allow filtering records by VO; in fact today, multi-VO sites expose user activity for all VOs.

The task force would like sites to register their xrootd services (both SE entry points and redirectors) in GOCDB/OIM; this has several advantages including the possibility to declare downtimes and to run ad-hoc SAM tests analogous to the current SRM tests. It is not necessary to publish them in the BDII.

The priorities for the coming months will be to activate the detailed monitoring at all sites and in general to validate the monitoring workflow; help more sites in joining AAA and FAX and finally to have all xrootd services registered.

It is agreed that the baseline versions table will cover also the various xrootd plugins.

Domenico clarifies that also sites interested in providing xrootd access but not joining a storage federation fall under the mandate of the task force.

To solve the problem with DPM, a dedicated instance provided by a helpful site (Edinburgh) will be used to play with. Most probably a change in the code will be needed. It's difficult to estimate a time scale.

Experiments Plans

ALICE

  • CVMFS deployment campaign has been started by Stefan: thanks!
  • AliEn CVMFS usage being tested and further developed
    • We intend to set up a dedicated CREAM CE + WN to speed up testing
  • CERN-IT is looking forward to the use of CVMFS by ALICE after last week's incident involving Torrent
    • see the Monday Ops meeting minutes for details
    • 3 other sites reported similar problems

  • SAM: working on rationalization of the tests
    • import results from MonALISA
      • XRootD
      • VOBOX
    • make NDGF appear as a Tier-1!

Maarten clarifies that the work on SAM will be done over summer.

ATLAS

Alessandro D.G. presented the ATLAS plans for the next months (see the slides for the details). These are the main points:
  • Residual need for shared area soon to be eliminated
  • Simulation validated for multicore
    • Sites encouraged to deploy more queues
  • All sites should deploy perfSONAR
  • All sites should provide WebDAV access for storage management operations (or discuss an alternative)
  • Widely use xrootd for WAN and LAN data access after summer
    • Main use cases for FAX are fail-over for local access failures and breaking jobs-to-data locality
  • Russian Proto-T1 is contributing to production (but no tape yet)
  • Migrate ATLAS central services to OpenStack VMs with SL6 during 2014
  • Start stress testing RUCIO in July and release first official version of JEDI by end of summer

CMS

  • CVMFS
    • Stop sending installation jobs by Sep 2013
    • Require CVMFS at all sites by April 2014
      • Until then cron based installations are an option
    • Interest in using grid.cern.ch for grid clients at opportunistic sites
      • Requires reliable support for updates, CRL and CA certs

  • Multi-core scheduling
    • Succeeded in scheduling several single threaded CMSSW jobs in one pilot
      • Plan to commission more Multi-core queues for produvtion usage (during Summer)
    • Start commissioning of scheduling multi-threaded CMSSW applications (End of 2013)

  • CRAB3
    • Integration with Panda backend progressing
    • Tests by Integration team ongoing
    • Beta test with experienced users starting in the next weeks

  • Disk/Tape separation at T1 sites
    • Allows more flexible usage of resources
    • Aim to start testing in Fall this year

  • Xrootd Federation
    • Started using WAN access in production (at small scale)
    • Finish integration of all sites (~90% of the sites by Fall 2013)

  • Dynamic Data Placement (at Tier 2 level)
    • (Semi-)automatic remove from caches

  • Opportunistic Resources
    • non-CMS Grid sites, Cloud resources, HPCs with ssh-like access
    • CMS specific software via Parrot and CVMFS

Maria adds that Simone contacted the CMS Integration coordinators to discuss multicore usage. A closer collaboration among experiments is to be expected in the coming months.

LHCb

Joel reports some news and plans from LHCb (for the details, see the slides). Some highlights:
  • some sites, called Tier2Ds, will provide also disk-only space for analysis jobs
  • LHCb will investigate interfacing to non-batch resources, like the HLT farms and the VAC virtualised infrastructure in Manchester, BOINC and Openstack at CERN (more details at the next pre-GDB)
  • LHCb plans to progressively move all transfers to FTS3 as it becomes available in production at CERN (RAL will be the backup)
  • a list of sites to be monitored has been given to the perfSONAR task force; the information will be used as a site quality metric and to choose which T2 sites to use for data processing campaigns
  • All DIRAC services have been validated with SHA-2 certificates
  • LHCb will soon kick-start an activity to work on algorithms to define actions based on data popularity metrics
  • by the end of the year LHCb plans to use C++11 (gcc 4.6 and newer) which exists only for SL6; by then, the majority of WLCG resources should be on SL6. Tier-1 sites however should provide at least some SL6 resources as soon as possible
  • it is planned to publish several types of information in SAM (e.g. DIRAC functional tests, simplified workflows, resource status), also in cooperation with the WLCG Monitoring Consolidation project

Domenico points out that the upcoming data placement task force will also deal with algorithms and therefore it should be interesting for LHCb to take part in it.

-- AndreaSciaba - 01-Jul-2013

Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2013-07-18 - AlessandraForti
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback