WLCG Operations Coordination Minutes, February 18th 2016

Highlights

  • The new Experiments Test Framework (ETF) will be tentatively ready in April. It will contain few user changes.
  • The WLCG workshop and the MB decided to freeze the gLExec deployment.
  • LHCb to discuss with Multicore deployment TF experts about the future of the TF and advise the sites.
  • All sites are to install the patches for the critical vulnerability announced yesterday.

Agenda

Attendance

  • local: Maria Alandes (chair), Maria Dimou (minutes), Maarten Litmaath (ALICE), Jerome Belleman (T0), David Cameron (ATLAS), Andrea Manzi, Marian Babik, Stephan Lammel (CMS), Julia Andreeva.
  • remote: Alessandra Forti (chair), Alessandra Doria, Michael Ernst, Andreas Petzold, Antonio Maria Perez Yzquierdo, Aresh, Christoph Wissing, Di Qing, David Mason, Pepe Flix, Andrew McNab, Daniele Bonacorsi, Hung-Te Lee, Javier Sanchez, Kyle Gross, Rob Quick, Renaud Vernet, Eric Lancon, Vladimir Romanovski, Andrea Valassi, John Kelly, Massimo Sgaravatto, Jeremy Coles.
  • apologies: Catherine Biscarat (IN2P3)

Operations News

  • Operations Coordination Meetings have been reorganised as of 1st March. See MB slides presented this week:
    • 3PM meetings once a week on Mondays
    • Ops Coord meetings once per month on the first Thursday of the month
      • Topical meetings
      • Written reports still requested, but not necessary to go through them during the meeting
  • Next Ops Coord meetings:
    • March 3rd
    • April 7th
    • May 12th (Since May 5th is Ascension day)
    • June 2nd
    • July 7th
  • The WLCG workshop took place on 1-3 February in Lisbon. Very high participation. Interesting discussions. People are encouraged to check agenda and attached material.
  • A follow up of the WLCG workshop was done at the MB on Tuesday. Concrete actions and probably new TFs and WGs will be created, more news on the coming weeks.
  • The MB has also agreed to adapt and improve the LCG monthly accounting reports. A pre-GDB will be organised to discuss the accounting in detail and to review the way accounting reports are currently done. The question of CPU usage comparison to pledges was as well addressed, and there is a universal agreement that WALL-time usages should be compared to pledges (a correction will be done in the reports, as they are still comparing CPU-time to pledges). See MB slides for more details.

Experiments Test Framework (ETF)

Marian Babik presented ETF, a successor of SAM/Nagios test framework, currently under validation that should tentatively go into production on the 1st of April, depending on the validation process being successful by this date.

ETF is a complete re-write of the SAM/Nagios test framework, but it's still using the same plugins, therefore not a major change from site's perspective. There are few changes that could impact sites:

  • Testing with RFC proxies (coordinated by RFC TF)
  • All services in the VO feeds will be tested
  • New HTTP tests (coordinated by HTTP TF)
  • Updated gLExec worker node test to the latest from UMD

Probes' results will be taken only from job outputs, i.e. the WN will no longer send them directly to the message bus.

Middleware News

  • Useful Links:
  • Baselines/New releases:
    • DPM 1.8.10 is now baselines. It’ s already in UMD3 and verified by the MW readiness some time ago. It includes bug fixes and improvements in core and frontends components
    • Perfsonar 3.5.0 is baseline. The previous version (3.4.1) end of life is set to 8th of April. There has been also a security upgrade just released (v 3.5.0.7 ) http://www.perfsonar.net/#20160216-security. Please make sure to have the latest version installed.
  • Issues:
    • latest version of java openjdk for all platforms disabled the support for Md5 signed certificates. This has caused some issues, mainly to LHCb, because of an old certificate used for transfers stored in MyProxy (solved), and to SAM tests towards CREAM ( for CMS and ATLAS), cause SAM is using a version of HTCondor ( 8.2.10) which includes an old version of Cream CLI still signing with MD5. This is going to be fixed with the new ETF-Nagios which currently has HTCondor 8.4.4. Sites have been asked either not to upgrade to the latest java, or to re-enable MD5 on their JAVA services for now. After the SAM upgrade we will ask sites to safely upgrade java and disable again MD5.
    • EGI SVG advisory sent yesterday describing a critical vulnerability of glibc on all platforms (https://wiki.egi.eu/wiki/SVG:Advisory-SVG-CVE-2015-7547), All running resources MUST be patched by 2016-02-24 21:00 UTC.
  • T0 and T1 services
    • FNAL
      • EOS Upgraded to v 0.3.127, with xrootd 3.3.6-4.slc6 and Bestman 2.3.0-21
      • planned upgrade to dCache 2.13 in April
    • JINR-T1
      • minor dCache upgrade to v 2.10.54
    • IN2P3
      • xrootd 4.2.3-3 and new balanced redirector on tape buffer under test
    • INFN-T1
      • Upgrade Storm to v 1.11.10 on the lhcb,cms and atlas instances.
      • Installed the last production version of the storm-webdav service on the lhcb gridftp servers for the HTTP TF

Marian will look into how complicated it is to upgrade HTCondor still on the old SAM-Nagios hosts.

Tier 0 News

  • Condor: 118 kHS06 out of a total of 817 kHS06 (15%).
  • Larger CREAM CE flavours being deployed.

Jerome clarified that much more memory will be configured for the CREAM CEs. By the summer the LSF vs HTCondor proportion at the T0 is expected to be half-half.

DB News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • mostly high activity
  • disk space
    • about 2.5 PB were recovered thanks to ad-hoc cleanups
    • further cleanups expected by spring, pending agreement on policy changes
    • CASTOR situation for raw data reco looks good, thanks for the support!

ATLAS

  • High activity: reprocessing of some 2012 data completed one week ago, just in time to start re-reprocessing of all 2015 data (expected to last 1-2 more weeks)
  • condor/CREAM issues: CREAM database issue a couple of weeks ago, today core dumps on the pilot factories. Neither issue is yet understood.
  • CVMFS monitoring: Taiwan Stratum 1 is always in trouble, better to turn it off than have it out of date? GGUS:119557
  • FTS: New release 3.4.x contains a couple of important features for ATLAS, can sites deploy it?

FTS3.4.2 is now verified for Readiness at the CERN pilot. Expected in production at CERN next week. The stratum issue will become an action for firm follow-up.

CMS

  • General Operational Issues
    • Overall good usage at Tier-1 and Tier-2 sites, good job success rate and CPU utilization (incl. HLT and Tier-0 Openstack resources for processing)
    • We are short on disk space and are in contact with sites about readiness of the 2016 pledges
  • Requests for Sites
    • Sites are ask to update to Phedex 4.1.5 (or higher, 4.1.7 is the recommended version) by the end of February. One Tier-1 and about 20 Tier-2 sides still need to upgrade.
    • All Tier-1 sites are running multi-core pilots and Tier-2 sites are now switching coordinated by the Submission Infrastructure team via GGUS tickets (still to be opened).

Maria A. suggested that sites report on 2016 pledges at the next Ops Coord meeting (March 3rd).

LHCb

  • Stripping for 2015 is almost finished. Cleaning processed RAW files from disk
  • Validation of Turbo and TurCal; Prestaging files for them
  • Validation of Sim09; We hope we can start massive MC production soon

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • a new plan for gLExec has been proposed in Ian Bird's presentation on the
    "Follow-up to the WLCG Workshop in Lisbon" during the Feb 16 MB meeting
  • page 4 says:
    "Freeze deployment of glexec - keep it supported for the existing use,
    but no point to expend further effort in deployment"
  • in principle the minutes of that meeting still need to be approved
  • in practice this will imply closing the remaining open tickets and wrapping up the TF

Machine/Job Features TF

  • Finalized specification and Technical Note document: https://twiki.cern.ch/twiki/bin/view/LCG/MachineJobFeaturesSpec
  • TN in HSF TN consultation process (for formatting, spelling etc.)
  • Next step is to update and complete implementations, starting with Vac/Vcycle for VMs (done), PBS/Torque (started), and HTCondor.
  • Will begin rolling out sites, initially with sites that have volunteered to help test updated implementations.
  • Update MJF SAM tests.

HTTP Deployment TF

Information System Evolution


  • Information System discussed at the WLCG workshop:
    • General agreement that it would be desirable to become independent from the BDII, although in practice this needs to be understood.
    • No clear outcome about the new IS. There is a general feeling that a new IS is useful, but this needs in any case to be supported by the experiments. As a follow up at the MB on Tuesday, it was agreed to re-visit the experiment needs for this.
  • An IS TF meeting took place on 11th February:
    • In order to define a strategy for the BDII, EGI was invited to present their plans to support the BDII and it was made clear that EGI plans to support the BDII as many VOs rely on it.
    • It was agreed to assess the feasibility of moving static information to GOCDB/OIM, since experiments like ATLAS are interested in going in this direction.
    • It was agreed to work on a table where all primary information sources for each experiment will be described and identified. This should be a compact version of the Use Cases document and an easy way to understand where information is defined and where information is consumed, highlighting possible inconsistencies and also helping to steering the discussion on how to evolve the IS.
    • It was agreed to investigate whether there is room for collaboration between LHCb and ATLAS after LHCb’s implementation of multiple information collector plugins for the DIRAC CS.
    • It was decided to stop discussing about definitions since this work fits better within the benchmarking working group and the MJF TF.

IPv6 Validation and Deployment TF


Middleware Readiness WG


The JIRA dashboard shows per experiment and per site the product versions pending for Readiness verification. Changes since the Ops Coord. meeting of Jan. 21st are:

  • The 15th MW Readiness WG meeting took place on Jan. 27th. Please read the minutes' summary here.
  • JIRA:MWREADY-107 CMS and ATLAS: CERN FTS 3.4.1 verification completed, some issues identified and fixed in 3.4.2
  • JIRA:MWREADY-114 CMS and ATLAS: CERN FTS 3.4.2 verification started
  • JIRA:MWREADY-108 CMS: GRIF-LLR dpm-xrootd 3.6.0 verification completed.
  • JIRA:MWREADY-109 ATLAS: INFN-T1 Storm 1.11.10 verification ongoing, everything looks ok
  • JIRA:MWREADY-101 CMS: GRIF-LLR gfal2 verification, lots of progress on testing gfal2 + Phedex + stageout
  • Suggested date for our WG meeting is Thursday March 17th 2016 at 15h30pm CET. NB!!! Different day-of-the-week & different time!!! The WLCG Ops Coord slot is free. Comments?

Multicore Deployment

Andrea V. said that the Friday Feb. 26th LHCb meeting will discuss the issue and decide on the preferred model (slides 4 and 8 in Antonio's presentation).

Network and Transfer Metrics WG


  • WG has contributed to the International Committee for Future Accelerators (ICFA) Annual networking report (https://cds.cern.ch/record/2130751)
  • WLCG Network Throughput SU: BNL to PIC throughput degradation
    • Root cause was instability of the GEANT Spain fiber channels
    • Issue was reported by ATLAS and involved ESNet, LHCONE, perfSONAR and BNL
  • WLCG Network Throughput SU: FNAL to CERN
    • Issue at ESNet, resolved by LHCOPN ops
  • Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
  • Meeting held on LHCb DIRAC bridge on January 18th:
    • Ongoing developments on adding additional graphs (latencies, throughput) and bug fixing, plan is to go production by Q3 2016
  • Throughput meeting held on January 27th:

RFC proxies

  • NTR

Squid Monitoring and HTTP Proxy Discovery TFs

  • Squid monitoring based on OIM/GOCDB registrations is improving, with Alastair Dewhurst making a bit more progress on getting exceptions added on the monitoring machine
  • CMS is now planning on making a virtual opportunistic computing site, that can find its proxies with a single configuration based on a Web Proxy Auto Discovery service
    • Dave Dykstra is beginning to work on hosting http://wlcg-wpad.cern.ch/wpad.dat on an existing pair of 10gbit/s external proxy machines, beginning by just supporting a few sites but eventually basing it on the OIM/GOCDB data
  • A separate proxy service is also being added to the same external proxy machines for support of LHC@home and CMS opendata

Action list

Creation date Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea M, Maarten DONE Host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. All identified affected services now have compliant certificates and the corresponding tickets have been closed.
2015-12-17 Recommend site configurations to enforce memory limits on jobs   DONE 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system. Comment from the 2016-01-07 Ops Coord meeting: The existing twiki BSPassingParameters will be enhanced to contain the recommended memory limit values per batch system, as requested by the MB. The details will be discussed offline between Marias A. & D, Maarten and Alessandra F. Status of Jan 12th: A new twiki BatchSystemsConfig was finally decided as a better idea. Tickets opened, answered, recommendation written in the same twiki and MB informed.

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
22.01.2016 Provide feedback to AFS service managers at CERN on whether the AFS outage OTG:0027970 that happened on 18-19.01 affected any of their critical workflows All - AFS team at CERN is reducing the dependencies and usage of AFS and is collecting existing use cases that are critical for experiments. The outage is a good opportunity to discover unknown use cases - DONE

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
22.01.2016 CMS sites are requested to move to Phedex 4.1.5 (minimum) or 4.1.7 (recommended) on SL6 CMS - Should be done by the end of Feb - ONGOING
18.02.2016 CVMFS monitoring: Taiwan Stratum 1 is always in trouble, better to turn it off than have it out of date? GGUS:119557 ATLAS - - - ONGOING

AOB

The vulnerability issue announced yesterday was raised at the 3pm Ops call and moved here due to lack of time. All positions known at this moment are in the site reports of WLCGDailyMeetingsWeek160215#Thursday. The CERN security team will tell the T0 when to do the batches within the allowed timeframe (before Feb 24th).

-- MariaDimou - 2016-02-16

Edit | Attach | Watch | Print version | History: r41 < r40 < r39 < r38 < r37 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r41 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback