WLCG Operations Coordination Minutes, May 18th 2017

Highlights

Agenda

Attendance

  • local: Andrea M (MW Officer + data mgmt), Edoardo (networks), Gavin (T0), Giuseppe (CMS), Julia (WLCG), Maarten (WLCG + ALICE), Marcelo (LHCb), Marian (networks)
  • remote: Alessandra D (Napoli), Alessandra F (WLCG + Manchester + ATLAS), Andrea S (WLCG), Christoph (CMS), Di (TRIUMF), Felix (ASGC), Jeremy (GridPP), Max (KIT), Nurcan (ATLAS), Renaud (IN2P3), Simon (TRIUMF), Thomas (DESY), Victor (JINR), Vladimir (CNAF)

  • apologies:

Operations News

  • WLCG workshop registration deadline has been extended to the 31st of May.
    Participants in the workshop should however register ASAP
    • Main workshop: Mon early afternoon - Wednesday
    • IPv6 hands on session on Thursday morning
    • Optional visit to Jodrell Bank Observatory Thursday afternoon (max 30 people, first-in basis)

  • the next meeting will be held Thu July 6

Report from the network group regarding MTU negotiation problem between CERN routable IPs and T1

The Path MTU discovery (PMTUD) protocol is not working with remote locations, because of a) strict filtering of ICMP packets on the CERN firewall; b) use of private addresses on the internal links c) use of Jumbo frames everywhere in the CERN interconnecting links, except on user services.

The problem with CNAF arose because CNAF has Jumbo enabled servers. c) caused the jumbo packets from CNAF to reach the datacentre router facing RAC52, which discarded it because too big to be delivered to RAC52; at the same time the router sent back an "ICMP fragmentation needed" packet to CNAF; a) and b) dropped the ICMP packet to CNAF and made PMTUD fail.

As a temporary workaround the link facing the CERN firewall has been changed to normal MTU; in this way an external router is now sending back "ICMP fragmentation needed" packets and makes PMTUD works.

As a long term solution we need to: -1- configure public addresses on the links interconnecting datacentre routers; this task is on-going -2- allow "ICMP fragmentation needed" packets through the CERN firewall; this is done -3- once 1- and 2- are completed, change back the external links' MTU to 9000

  • Julia: timeline for jumbo frames to be allowed externally?
  • Edoardo: it will take 1 or 2 years for old routers to be replaced

Middleware News

  • Useful Links:
  • Baselines/News:
  • Issues:
    • VOMS host certificates renewed at CERN 5 days before the expiration. Some long lasting VOMS proxies created before the day of the update started to be refused by Grid services. In particular some of the CMS proxies delegated to FTS were affected. ( new VOMS proxy and delegation to FTS was needed). This issue comes from a VOMS bug not yet fixed ( GGUS:120463)
    • RHEL/SL 6.9 openssl update fall-out. openssl 1.0.1e-57 by default prohibits TLS to be used with DH keys smaller than 1024 bits. Java-based services will fail openssl client connections if their version of Java is too old or if its disabled algorithms are defined incorrectly. Java-based services need to run a sufficiently recent version of Java to avoid such problem. The latest 1.7 and 1.8 releases are OK
    • ATLAS now and CMS some days ago are affected by an issue in EOS being overloaded by GSI Authentication. This problem comes from a bug in the Xrootd GSI plugin. The fix is almost ready and it will be deployed soon.
    • Heads-up: Escalation of privilege vulnerability in Intel® Active Management Technology (AMT), Intel® Standard Manageability (ISM), and Intel® SmallBusiness Technology, broadcasted by EGI SVG: https://wiki.egi.eu/wiki/SVG:Advisory-SVG-CVE-2017-5689
  • T0 and T1 services
    • BNL
      • FTS upgraded to v 3.6.8
    • CERN
      • check T0 report
      • FTS upgraded to v. 3.6.8
      • EOS for LHCb updated to Citrine version ( issue on checksum string returned by Xrootd discovered after the update and immediately fixed)
    • IN2P3
      • replacement of old ALICE XRootD servers ongoing.
      • Migration to dCache 2.16 for the June downtime
    • JINR
      • dCache Major upgrade 2.13.51 -> 2.16.31, Postgres upgrade 9.4.11 -> 9.6.2 on tape instance for CMS
    • RAL.
      • Castor nameserver updated to 2.1.16-13. All data now on T10KD dives/media.
      • Update all Castor instances to version 2.1.16-13 over the next few weeks.
      • FTS for ATLAS updated to v 3.6.8
    • TRIUMF:
      • Upgrade production dCache to 2.16 soon
      • 2.16 is in MW readiness, ipv4/6 dual stack

Discussion

  • Renaud: UI/WN readiness for CentOS/EL7?
  • Andrea M:
    • the meta-packages exist in preview repositories
    • the CREAM client has been tested by TRIUMF
    • the WN still needs an HTCondor package clash to be resolved
  • Maarten: the official UMD update has been delayed from May to June

  • Alessandra F: when might we make 1.9 the baseline for DPM?
  • Andrea M: investigating an issue at one site that we want to resolve first
  • Alessandra F:
    • the motivation comes from ATLAS wanting to use a JSON file for storage accounting
    • we do not want every site to invent their own scripts for that
    • the DPM devs can come up with a common solution, but only for version 1.9 and later

Tier 0 News

  • Capacity. LSF: 550 kHSpec. HTCondor 640 kHSpec. LSF AtlasT0: 200 kHSpec. 340 kHSpec just arrived will be added to HTCondor.
  • Creating new HTCondor CEs with aim to move remaining Grid capacity to HTCondor ~soon.
  • FTS upgraded to 3.6.
  • Castor / EOS new capacity being added during May.

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • The activity levels typically have been very high
    • The average was 101k running jobs
  • CERN
    • Some issues with the HTCondor and CREAM services
    • EOS: disk capacity was added to keep up with the high activity, thanks!
    • CASTOR: disk pools were reconfigured for data taking
      • Allowing last parts of 2016 reco to be finished in parallel, thanks!
  • A successful T1-T2 workshop was held in Strasbourg on May 3-5

ATLAS

  • Distributed production running fine with ~300k running job slots, reprocessing of data15+data16 has finished, MC16 production is running in full speed, derivations production on the reprocessed data started with validation runs.
  • Experiencing EOS overload since Friday when RAW data skimming jobs launched for HeavyIons, currently the number of jobs is throttled in the system.
  • Tier-0 is doing well with all recent data taking, from cosmics to splashes to first collisions. Extension of capacity to 2017 pledges requested and details agreed with IT, in progress.

CMS

  • CMS being prepared for LHC beam operation
    • T0: running new P5-T0 transfer system since April 26th, no known issues
    • Main production and processing activities
      • Still at moderate scale, more will come very soon
      • RE-RECO of 2016 data: pilot requests are being processed
      • Phase I MC requests being processed
  • Overloaded EOS with too many IO intense jobs
    • Some limitation known in the GSI authentication capacity
    • HLT nodes now authenticated by IP (not contributing to GSI load)
    • Prevent too many high IO jobs at CERN resources
  • New input datasets for HammerCloud and SAM being distributed to all sites
  • Preparing for CentOS 7 (with Singularity)
  • IPv6 storage checks started with a subset of sites
  • Concluded a round of Tape staging test with the T1 sites
  • Meta data issue at CERN EOS - GGUS:127322
    • Seems that files 'become' corrupt - files are not the ones they are supposed to be from the path

LHCb

  • Activity levels very high (~100k running jobs)
    • HLT Farm stopped processing MC simulation, preparing for beam operation

  • CERN
    • EOS: Failure download - fixed with new version

Ongoing Task Forces and Working Groups

Accounting TF

The April meeting has been dedicated to the Storage Space Accounting. Dimitrios Christidis started to implement the data flow using for the time being the ATLAS storage accounting data. In parallel we are working with the WLCG Data Management Steering group in order to agree on the storage reporting and storage topology description in CRIC.

Information System Evolution TF


  • Julia: we are moving forward with the implementation of CRIC

IPv6 Validation and Deployment TF


  • CERN and almost all Tier-1 sites have IPv6 and at least a fraction
    of the storage in dual stack, or will in a matter of weeks
  • No significant middleware issues
    • A problem reported by the DPM team about the GridFTP redirection in Globus has been fixed (GGUS:127285)
  • Organization of the IPv6 tutorial during the Manchester workshop
    • Note: the introduction is Wed late afternoon
    • ALL site admins (and not only) are encouraged to participate (even remotely)
    • Will cover the basics of deploying IPv6 in the network infrastructure
      and dual-stack Grid services (squid, perfSONAR, storage)
    • Hands-on exercises foreseen

Machine/Job Features TF

Machine/Job features is a mean to optimize the interaction between a resource provider (batch system, IaaS) and the the payload (jobs) with providing more detailed information about the batch system to the job, and about the job to the batch system. This information can be static (eg. power of the machine, number of cores, local scratch space) or dynamic (eg. shutdown time of a VM). It is comprised of a set of text files and Python scripts, and can be considered as an "add-on" to the scheduler. https://twiki.cern.ch/twiki/bin/view/LCG/MachineJobFeatures

This mechanism originates from LHCb (primary expert is Andrew.Mcnab@cernNOSPAMPLEASE.ch), where it is being used successfully at T2 level for some time already: https://etf-lhcb-prod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fview_name%3Dallservices%26service_regex%3Dorg.lhcb.WN-mjf-%2Flhcb%2FRole%253dproduction%26site%3D

It is suggested to expand the use of MJF to all experiments at T2, and also to T1 level. Installation is quite easy: add repository https://repo.gridpp.ac.uk/machinejobfeatures/mjf-scripts/ and run yum to install the proper (according to the scheduler used) variant of MJF. The support requires practically no effort.

The MJF e-group is available to subscribe to: https://e-groups.cern.ch/e-groups/Egroup.do?egroupName=wlcg-ops-coord-tf-machinejobfeatures

  • Julia:
    • last time we agreed to pursue the deployment first at LHCb sites
    • the other experiments are not yet interested
    • that may change with the outcome of the Benchmarking WG w.r.t. the fast benchmark
    • the current deployment campaign is coordinated by Andrew and Victor

Monitoring

MW Readiness WG


This is the status of jira ticket updates since the last Ops Coord of 20170406:

  • MWREADY-146 dCache 2.16.34 verification for ATLAS @ TRIUMF with IPV6 as well - ongoing
  • MWREADY-128 - A new version of the UI bundle has been released to EGI preview with new CREAM-CLI for C7. Tested successfully at TRIUMF
  • MWREADY-145 - Dependency clashing between WN bundle and latest HTCondor ( classads vs condor-classads). We will most probably remove the LB libs to solve this issue.
  • MWREADY-9 - /cvmfs/grid.cern.ch/Grid is now mirroring the AFS WLCG Grid Applications area. Requested by LHCb

Network and Transfer Metrics WG


  • perfSONAR 4.0 was released on 17th of April
    • 180 sites have updated so far
    • Some sites reported issues with load after updating, under investigation
  • WLCG/OSG network services
    • New central mesh configuration interface (MCA) will be deployed to production next week - transition will be transparent to all sites
      • MCA was developed by OSG and becomes part of perfSONAR.
    • Monitoring based on ETF is planned to be deployed in ITB
    • OSG collector will be updated to handle multiple backends (datastore, two message buses)
  • LHCOPN grafana dashboards established in collaboration with CERN IT/CS and MONIT team (access restricted to CERN users, public access in the works)
  • Next Throughput call will be on Wed May 24th at 4pm CEST (https://indico.cern.ch/event/640627/)

Squid Monitoring and HTTP Proxy Discovery TFs

  • http://grid-wpad/wpad.dat at CERN is now fully IPv6 compliant. However both the frontier and cvmfs clients prefer IPv4 on dual stack machines, so the IPv6 is not yet getting exercised. CMS has been asked to change their CERN frontier client configuration to prefer IPv6.

Traceability and Isolation WG

Last meeting on May 15th (see https://indico.cern.ch/event/634743/note/):

  • Traceability Challenge ran: all VOs participated
    • Asking VOs to identify a job from hostname + timestamps
    • Issues in the communication channels, being addressed now
    • Planning to run another one in late Autumn
  • Singularity: "SingularityWare, LLC" created by main developer, consequences unknown

Issues with ARC-CEs patching

see the presentation

  • Maarten: we should ensure there is an ARC deployment discussion forum
    • interested WLCG site admins should be able to join
    • the developers ought to take note of the discussions
    • such forums are working fine for dCache, DPM, FTS, ...
  • Max: there exists an ARC mailing list, but not so many WLCG sites are present
    and the developers do not realize the importance or urgency of certain issues
  • Julia: there may be similar concerns for other MW
  • Andrea M: are such problems mentioned elsewhere?
  • Maarten: yes, but not in a consistent way; furthermore, the developers may
    disagree with an RFE by a WLCG site, when nobody in NorduGrid asked for it
  • Julia: we need to ensure the flow of information
  • Andrea M: in the MW Readiness WG ARC is tested, but currently only by CMS and
    there are no stress tests either
    • ergo: some of the reported issues would not have been found
  • Alessandra F: why only CMS?
  • Andrea M: the efforts are voluntary, we neither can force sites nor experiments
  • Alessandra F: we need to avoid repetition of the gfal-utils saga
  • Julia: there are several issues
    • we need to try and get the right things tested
    • we need to ensure information is made available
    • let's follow up in the next meeting
    • followup on the ARC forum is an action on Ops Coordination

Theme: Providing reliable storage - TRIUMF

see the presentation

  • Julia: how did TRIUMF compare to other sites in the ATLAS tape performance exercise?
  • Simon: from our perspective the performance was OK;
    note that our volume is smaller than what various other T1 have

  • Maarten: is tapeguy ATLAS-specific?
  • Simon:
    • it could be generalized; furthermore, its interface is not tied to dCache
    • as there were not so many options in 2006, we started and kept our own development

  • Vladimir: did you try disabling the power management that you suspect?
  • Simon: we tried different things; there is no clear pattern, the freeze occurs once per several months
  • Vladimir: do you have PowerEdge or PowerVault servers?
  • Simon: PowerEdge; the servers access the storage through a SAN

  • Vladimir: isn't the forced queuing time of 1h too much?
  • Simon: no complaints so far
  • Vladimir: so it is acceptable to ATLAS?
  • Simon: we want to increase the number of files per mount, for increased performance;
    requests should come in bulk
  • Vladimir: CMS have also been seen to recall few files at a time,
    but accumulating to tens of thousands per day;
    is it acceptable to implement a wait time?
  • Maarten: as a site admin you have the right to protect your resources;
    together with the affected experiments some compromise could be agreed
  • Julia: you might even tune the wait time until an experiment complains
  • Vladimir: could we have all T1 "impose" the same wait time?
  • Julia: let's see a few more such presentations and then draw our conclusions

Action list

Creation date Description Responsible Status Comments
01 Sep 2016 Collect plans from sites to move to EL7 WLCG Operations Ongoing The EL7 WN is ready (see MW report of 29.09.2016). ALICE and LHCb can use it. NDGF plan to use EL7 for new HW as of early 2017. Other ATLAS sites e.g. Triumf are working on a container solution that could mask the EL7 env. for the experiments which can't use it. Maria said that GGUS tickets are a clear way to collect the sites' intentions. Alessandra said we shouldn't ask a vague question. Andrea M. said the UI bundle is also making progress.
Jan 26 update: this matter is tied to the EL7 validation statuses for ATLAS and CMS, which were reported in that meeting.
March 2 update: the EMI WN and UI meta packages are planned for UMD 4.5 to be released in May
May 18 update: UMD 4.5 has been delayed to June
03 Nov 2016 Review VO ID Card documentation and make sure it is suitable for multicore WLCG Operations Pending Jan 26 update: needs to be done in collaboration with EGI
26 Jan 2017 Create long-downtimes proposal v3 and present it to the MB WLCG Operations Pending May 18 update: EGI collected feedback from sites and propose a compromise - 3 days' notice for any scheduled downtime
18 May 2017 Follow up on the ARC forum for WLCG site admins WLCG Operations Pending  
18 May 2017 Prepare discussion on the strategy for handling middleware patches Andrea Manzi and WLCG operations Pending  

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Comments Deadline Completion

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Comments Deadline Completion

AOB

Edit | Attach | Watch | Print version | History: r24 < r23 < r22 < r21 < r20 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r24 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback