WLCG Operations Coordination Minutes - 25 April 2013

Agenda

Attendance

  • Local: Maria Girone (chair), Andrea Sciabà (secretary), Maria Dimou, Ikuo Ueda, Mattia Cinquilli, Joel Closier, Felix Lee, Oliver Keeble, Ian Fisk, Michail Salichos, Alessandro Di Girolamo, Maite Barroso Lopez, Jan Iven, Simone Campana, Maarten Litmaath
  • Remote: Alessandra Forti, Alexey Zheledov, Andreas Petzold, Christoph Wissing, Peter Solagna, Sean Christopher Crosby, Tiziana Ferrari, Di Qing, Matt Doidge, Thomas Hartmann, Gareth Smith, Burt Holzmann, Luis Linares, Renaud Vernet, Yannick Patois, Ian Collier, Rob Quick, Peter Gronbech, Andrea Sartirana, Jeremy Coles, Onno Zweers

News (M. Girone)

At the next GDB on May 8 there will be a status report from WLCG operations and dedicated talks about SL6 and xrootd deployment.

There are new official contacts for Frontier/squid: Alastair Dewhurst (ATLAS), Luis Linares (CMS) and John Artieda (CMS).

Maria asks what is the status of the proto-Tier-1 sites for ALICE (KR-KISTI-GSDC-01, Korea), ATLAS (RRC-KI-T1, Moscow) and CMS (JINR-T1, Dubna):

  • Alessandro: RRC-KI-T1 has been validated for reprocessing of 2p76TeV heavy ion data and is now fully included in Monte Carlo production with the output archived at other sites (as the tape system is not yet ready);
  • Christoph: JINR-T1 is being commissioned: the SAM tests are OK and we are now focusing on data trasfers, hoping to start the first production workflows in May. There is still no tape and it will be added according to the new disk-tape separation policy;
  • Maarten: KISTI is effectively a new ALICE Tier-1 for all practical purposes, even if not officially yet. Only the network capacity is still a bit low. The tape system is also functional.

Middleware news and baseline versions (N. Magini, A. Sciabà)

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

Highlights:

  • Since EMI-1 is unsupported from April 30 and the EMI-2 WN and UI have been validated by the four LHC experiments, all sites should upgrade all their service and client nodes to at least EMI-2. EMI-3 is also a valid option unless otherwise noted, but it's not yet available in UMD.
  • Baseline versions are updated for: ARC middleware, BDII, LB and WMS (fully supporting submission to ARC); the EMI-1 CREAM is removed.
  • The EMI-3 version of StoRM is not recommended due to a known issue (GGUS:88607); as it will be compulsory for FTS 3 (because of how space tokens are treated), sites still at the EMI-1 version may choose to skip the EMI-2 version and go for the next EMI-3 StoRM release, which fixes the issue.
  • WLCG VOBOX is available now

  • A generic WLCG repository has been set up
    • http://linuxsoft.cern.ch/wlcg/
      • preliminary contents
    • refreshed hourly from /afs/cern.ch/project/gd/wlcg-repo/
      • preliminary ACLs
      • a README provides update instructions

Tier-1 Grid services (N. Magini, A. Sciabà)

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR: v2.1.13-9 and SRM-2.11 for all instances
EOS:
ALICE (EOS 0.2.29 / xrootd 3.2.7)
ATLAS (EOS 0.2.28 / xrootd 3.2.7 / BeStMan2-2.2.2)
CMS (EOS 0.2.29 / xrootd 3.2.7 / BeStMan2-2.2.2)
LHCb (EOS 0.2.29 / xrootd 3.2.7 / BeStMan2-2.2.2)
EOSALICE 0.2.20 → 0.2.29  
ASGC CASTOR 2.1.13-9
CASTOR SRM 2.11-2
DPM 1.8.6-1
xrootd
3.2.7-1
   
BNL dCache 1.9.12.10 (Chimera, Postgres 9 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None Upgrade to dCache 2.2-10 in late May 2013
CNAF StoRM 1.8.1 (Atlas, CMS, LHCb)   Upgrade to StoRM 1.11.1 planned for late May
FNAL dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3
Scalla xrootd 2.9.7/3.2.7-2.osg
Oracle Lustre 1.8.6
EOS 0.2.29-1/xrootd 3.2.7-2.osg with Bestman 2.2.2.0.10
  Upgrade to dCache 2.2.x + Chimera in next 6 months
IN2P3 dCache 1.9.12-16 (Chimera) on SL6 core servers and 1.9.12-25 on pool nodes
Postgres 9.1
xrootd 3.0.4
  Upgrade to 2.2-10+ in June 2013 (3rd golden release)
KIT dCache
  • atlassrm-fzk.gridka.de: 1.9.12-11 (Chimera)
  • cmssrm-fzk.gridka.de: 1.9.12-17 (Chimera)
  • lhcbsrm-kit.gridka.de: 1.9.12-24 (Chimera)
xrootd (version 20100510-1509_dbg and 3.2.6)
   
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.7 (Chimera) (SURFsara), DPM 1.8.6 (NIKHEF)    
PIC dCache 1.9.12-20 (Chimera) - doors at 1.9.12-23    
RAL CASTOR 2.1.12-10
2.1.13-9 (tape servers)
SRM 2.11-1
   
TRIUMF dCache 2.2.10(Chimera)    

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
ASGC 2.2.8 - transfer-fts-3.7.12-1    
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1    
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.0-1 for T1 and 1.8.3.1-1 for US T2s SL5, gLite Oracle ATLAS Oracle upgrade to 11 in late May 2013
CERN 1.8.6-1 SLC6, EMI2 Oracle 11 ATLAS, LHCb, OPS, ATLAS Xroot federations Getting ready for Puppet Managed OpenStack VMs

Other site news

  • CERN (Jan): even if xrootd-3.3 is released and some groups (OSG, ATLAS-FAX) are encouraging sites to migrate, on the EOS side the upgrade will wait for a future release due to a number of bugs in 3.3.1 affecting EOS which would make the client unusable for EOS and CASTOR.

Data management provider news

News from EGI operations (T. Ferrari and P. Solagna)

For the details, refer to the slides.

  • Operations Management Board approved an extension for central emergency suspension of users. The next steps will be to define a procedure for handling of compromised certificates and criteria for central emergency suspensions and to define a policy enforcement plan;
  • new workflows have been defined in GGUS to handle tickets with unresponsive submitters or supporters; the deadline for feedback is May 3;
  • the existing support units are being reviewed and software support contacts for each EMI and IGE product are being defined, as part of the Product Teams reorganisation;
  • it was decided that "best effort" support in GGUS will not be accepted: each PT will ensure response to tickets within a given time and changes delivered within a given ETA;
  • in GGUS all supporters will have "rw" access to all tickets; a dashboard will be provided to Support Units to identify urgent tickets;
  • Continuation of support for several products still needs to be clarified, including: WMS, EMI-Common (UI, WN, YAIM, Torque config, emi-nagios), gLite-Infosys (BDII, lcg-info clients); EGI will liaise directly with PTs to get information about release and software support plans;
  • UMD 3.0.0: scheduled for release on May 13, it will include the several EMI-3 products: WMS, UI, WN, LB, DPM, ARGUS, BDII, GFAL, lcg_utils. APEL, proxy-renewal, CREAM and ARC are still under verification/staged rollout. VOMS and StoRM should soon be added when critical fixes have been verified.

Oliver clarifies that there is no doubt whatsoever on the fact that the gLite-Infosys products will be supported.

Maria D. points out that the usage of the "waiting for reply" status in GGUS was discussed already last autumn and is now fully described in the GGUS "Did you know..." page.

Tiziana adds that Ian Bird and Maria G. were informed of the changes in the security policies and asks if there might be problems for WLCG. Maria G. does not expect any.

Experiment operations review and plans

ALICE (M. Litmaath)

  • central services
    • a cleanup operation in the morning of Tue Apr 16 unexpectedly caused a very high I/O load on the AliEn DB for a few hours, causing lots of jobs and services around the grid to time out
      • the cleanup was interrupted mid afternoon, after which the DB needed a few more hours to roll back
      • normal conditions were restored early evening
    • a mistaken update of the job agent on Thu Apr 18 caused most sites essentially to get drained on Fri (most jobs quickly failed)
      • understood and fixed Fri late afternoon
  • CERN
    • mostly drained Sat Apr 20 late afternoon due to network incident, recovered mid evening
  • KIT
    • new Xrootd servers not completely stable yet
    • concurrent jobs cap needed to be temporarily reduced a few times to avoid SE errors
  • SARA
    • dCache Xrootd interface fixed for writing - thanks!

ATLAS (I. Ueda)

SL6
  • NOTE : migration of WNs to SL6 is NOT "transparent" for ATLAS.
    • some work from ATLAS side is necessary for tagging atlas software releases properly
    • otherwise jobs that cannot run on SL6 WNs can be assigned and will fail
  • ATLAS discourages the migration to SL6 before June 1st 2013
    • the work is on-going
    • This does not mean migration is transparent to ATLAS after June 1st.
  • Details in SL6 TF (thanks to AF)

  • file-descriptor leak is observed with the SL6.4 kernel and it seems serious, especially when running many short jobs
    • no such problem observed with SL6.3

cvmfs

  • atlas is waiting for a "stable" cvmfs 2.1
    • for the sites using cvmfs on NFS
    • to utilize more repositories than currently in use
    • cvmfs 2.1.9 seems promising, but need a wider validation

Activities:

  • MC production and analysis running full steam.
  • processing of 2012 physics delayed stream is progressing steadly.
  • reprocessing of 2011 2p76TeV data, made at the Russian protoTier1 RRC-KI-T1, is successfully finished.
    • the Russian protoTier1 RRC-KI-T1 is now used by ATLAS for MC activity. Since there is no tape backend yet, the output is stored on other Tier1s tapes.

CMS (C. Wissing)

  • Processing Activities
    • currently processing smaller MC requests
    • expect to start Upgrade MC soon (smaller, couple of million events)
    • planned to start 2011 data legacy rereco pass (2 Billion events) including MC redigitization/rereconstruction (3 Billion events)
      • CERN (both public queues and HLT cloud) will process data
      • T1 sites will process MC
    • July, we will start 13 TeV MC production

  • Russian Tier-1 T1_RU_JINR
    • Most SAM test already working fine
    • Commissioning of data transfers ongoing

  • Submission of HammerCloud trough GlideinWMS
    • Comparison to gLite done - o.k.
    • Will switch beginning of May

  • Global DBparam reset for Phedex
    • Start end of April
    • Requires action at all sites

  • Disk/Tape separation for CMS Tier-1s
    • Workshop 7-8 May
    • RAL already being switched to separated mode

  • Standing items/follow up from last week(s)
    • Frontier/Squid operations contacts
      • Luis Emiro Linares Garcia (based at CERN)
      • John Artieda (based FNAL)
    • Updates of Squid configuration to WLCG monitoring
      • About a third of the sites has done it
      • Followed in CMS Computing Operations
    • Fair share configuration for VOMS groups/roles
      • 50% Role=production or Role=t1production, 40% Role=pilot, 10% remaining CMS
      • VO card update almost done
    • Common condor_g submission probe for SAM
      • CMS is sill interested

LHCb (J. Closier)

  • glExec:
    • LHCb had made progress with glExec support in Dirac, and we believe there are no technical hurdles to its use
    • The experience with testing at many sites has not been positive, largely due to misconfigured glExec installations
    • We do not believe it is our role to commission glExec at sites, and we do not have the manpower to do so. For this reason, further commissioning of glExec in Dirac is put on hold, until such at time as glExec is correctly installed throughout WLCG.

  • SLC6:
    • LHCb is ready to migrate its users to SLC6. However we are prevented from making it the default platform for users, because only one of the LHCb Tier1s (SARA) currently has SLC6 CEs in production (discussion ongoing with PIC and RAL). If we were to make SLC6 the default, jobs submitted by users would never be matched unless the input files are at CERN or SARA. We are therefore forced to stay with SLC5 as default, and in particular to instruct our users not to use the lxplus6 (and, from 2nd May, lxplus) aliases at CERN. It would help us with the migration if each of the LHCb Tier1s could make available to LHCb an SLC6 production CE with a minimal number of SLC6 compute nodes

Maite will check why no SLC6 computing resource is exposed to the information system

Ian F. asks if in LHCb binaries are tightly coupled to the OS version or rather binaries compiled on SL5 will run on both SL5 and SL6 and Joel confirms it is the latter. If LHCb users keep using lxplus5 for compilation, it is less urgent for sites to migrate to SL6 from an LHCb perspective.

Alessandra adds that the current timeline for the lxplus migration was discussed since February also in the WLCG MB and finally announced in March on the IT Service Status Board. She will accurately report the situation at the GDB.

Ian F. adds that it looks like no WN was yet moved to SLC6 on lxbatch either, while plans called for migrating nodes to SLC6 in both lxplus and lxbatch clusters in parallel. Alessandra confirms that this is what Helge Meinhard said would happen.

Task Force reports

CVMFS (M. Cinquilli)

For the details, refer to the slides.

The deployment target date for ATLAS and LHCb has almost arrived (April 30). After that day, no new software releases will be installed at sites without CVMFS and LHCb will immediately stop using them altogether, while ATLAS will simply not use sites without any required new software release. For CMS the target is April 1, 2014 but already from September 30 no software installation jobs will be sent (only cron-based pull methods will remain available).

CVMFS 2.1.9 is about to be released; the update is recommended but sites using the NFS export or at the 2.0 version should test it carefully for a few weeks. 2.0 and 2.1 clients can run side-by-side.

Finally, the testing and deployment process is described; in particular, sites should upgrade their nodes in stages. Interested sites are invited to join the "pre-production" effort.

Ian F. mentions the technique successfully used by CMS to install CVMFS on the fly using Parrot.

gLExec (M. Litmaath)

  • see LHCb report above

Maarten recalls that the decision that all sites should deploy gLExec had been taken already two years ago and the WLCG MB should probably reiterate that this needs to happen. It is very difficult for a single individual to debug all gLExec problems at all sites. Maria G. reminds that the need for a clear timeline and milestones was expressed already in January and should be repeated in the MB.

SHA-2 migration (M. Litmaath)

  • SHA-2 CERN CA web site available now for pilot users (experts)
  • SHA-2 VOMS service being set up
    • more news ASAP

While the CA managers are trying to expedite the process to have the new CA included in IGTF, temporary RPMs will allow the CA to be added on any service that needs SHA-2 testing. No configuration change is needed.

perfSONAR (A. Forti)

  • the new perfSONAR-PS v3.3 RC3: this is still a release candidate. We are not ready to tell all sites to go to this version. The "final" release may have some slight tweaks/differences. It is being tested in the US and CERN at least. If no negative feedback, will become stable in approx 2 weeks. At this point we will very actively invite sites to install (if they do not have perfSONAR) or upgrade.
  • Newly installed PS sites in FR added to modular dashboard: Tom Wlodek added the additional FR sites under the LHC-FR cloud. As soon as the new dashboard has all the needed capabilities, Tom will also replicate all the clouds from the old dashboard to the new one. Both will be maintained until the new one demonstrates it has the complete functionality of the old dashboard
  • LHCONE mesh. This is not being updated and will be replaced by the more general WLCG configurations we will deploy.
  • We will be discussing having OSG host the WLCG instance of the new dashboard (as BNL as done for the old dashboard).
  • According to the procedures, to add a pS probe in SAM/Nagios somebody has to open a ticket and take responsibility for the maintenance of the probe. There is a Nagios probe from ESNET but they cannot be asked to maintain an RPM for WLCG. We will discuss with Emir and the SA team to define the best way to proceed.

Simone proposes to add perfSONAR to the baseline versions table when version 3.3 will become ready for production.

Rob announced that the new perfSONAR modular dashboard has an alpha release reachable from a URL that we will send to Maria.

SL6 (A. Forti)

  • Created a deployment page to track progress:
  • Contacted all T1s: all replied, they are all at different stages of plan/testing/deployment
    • PIC and RAL complained they have queues setup but no jobs are coming [Christoph: the CMS production team is reluctant to use test queues in production and it would not scale for Tier-2s but something can be surely arranged for the Tier-1s.]
  • Contacted all T2s: encouraged them to start testing and to upgrade when are ready if they don't support atlas
    • Some Atlas and CMS T2s are waiting for the "official announcement" or for local experiment people to tell them they can go ahead [Christoph: CMS said several times and since months that sites are encouraged to move to SL6: the site contacts should have passed the message.]
  • Added a section on procedures and how to contact experiments in case things don't work properly or to consult on plans and dates
  • New HEP_OSlibs version 1.0.9 requested by Benedict Hegner to add bzip2-devel for SPI.
    • Approach: if it is needed to execute (and even to build) the nightly code then it is needed everywhere.
    • When it will appear in the new WLCG repository will alert sites.
  • LHCb worries about SL6 situation:
    • lxplus alias moving to SL6 machines date is the 6th of May and the amount of SLC6 resources within the central batch service will be increased to be able to cope with this LXPLUS.CERN.CH alias switch.
    • current T1 statuses for LHCb are summarised in order of usability below:
      • Usable SARA fully migrated
      • There but not accessible by LHCb PIC and RAL as said above have queues that don't get used
        • After discussion yesterday both sites are now working with LHCb to get the queues used
        • Preliminary investigation indicates that the sites haven't added the queues to the BDII which is necessary for LHCb to access them
      • Usable two weeks from the 6th of May Nikhef migration will start on the 21st of May, new WNs will be installed the week before should kick in when the SL5 are put offline. At that point the CE will publish SL6 WNs.
      • CCIN2P3 pre-prod in May, 25% of resources by June 11th
      • CNAF and KIT are working on their infrastructure
    • When PIC and RAL queues will be usable, i.e. accessible and all problems ironed out, a procedure for LHCb will be communicated to other sites and added to the SL6 TF page.
  • Atlas problem with SL6.4 kernels (see Atlas report) has been passed to the TF and will be followed up.

FTS 3 integration and deployment (M. Salichos)

  • Started releasing FTS3 in EPEL
  • Released an update, RAL and CERN pilot have been updated

Xrootd deployment (D. Giordano)

  • Xrootd detailed monitoring:
    • Working on consolidation of the monitoring workflow, after the instabilities experienced on the ActiveMQ broker.
      • Effect: lost the xrootd detailed monitoring data for few days. Not a big issue in this phase of the deployment.
      • Snow Incident INC284190
      • Detail: as a consequence of the network outage happened at CERN during the weekend, the ActiveMQ topics serving the xrootd monitoring got decoupled from the queues that they should feed. Cause under investigation: a broker restart cures the problem in any case.

Tracking tools (M. Dimou)

Http Proxy Discovery (D. Dykstra)

News from other WLCG working groups

AOB

Action list

  1. Find out why there are no SLC6 resources at CERN published in the BDII and how many SLC6 worker nodes are actually installed in lxbatch.
  2. Add perfSONAR to the baseline versions list as soon as the latest Release Candidate becomes production-ready.
  3. Build a list by experiment of the Tier-2's that need to upgrade dCache to 2.2
    • done for ATLAS: list updated on 03-05-2013
    • done for CMS: list updated on 03-05-2013
    • not applicable to LHC
    • Maarten: EGI and OSG will track sites that need to upgrade away from unsupported releases.
  4. Inform sites that they need to install the latest Frontier/squid RPM by May at the latest (done for CMS, status monitored)
  5. Maarten will look into SHA-2 testing by the experiments when experts can obtain VOMS proxies for their VO.
  6. Tracking tools TF members who own savannah projects to list them and submit them to the savannah and jira developers if they wish to migrate them to jira. AndreaV and MariaD to report on their experience from the migration of their own savannah trackers.
  7. CMS to appoint a contact point for Savannah:131565 and dependent dev. items for the savannah-ggus bridge replacement by a GGUS-only solution tailored to CMS ops needs.

Chat room comments

-- AndreaSciaba - 22-Apr-2013

Edit | Attach | Watch | Print version | History: r52 < r51 < r50 < r49 < r48 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r52 - 2013-05-06 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback