WLCG Operations Coordination Minutes, September 17th 2015

Highlights

  • All 4 experiments have now an agreed workflow with the T0 for tickets that should be handled by the experiment supporters and were accidentally assigned to the T0 service managers.
  • A new FTS3 bug fixing release 3.3.1 is now available.
  • A globus lib issue is causing problems with FTS3 for sites running IPv6.
  • No network problems experienced with the transatlantic link despite 3 out of 4 cables being unavailable.
  • T0 experts are investigating the slow WN performance reported by LHCb and others.
  • A group of experts at CERN and CMS investigate ARGUS authentication problems affecting CMS VOBOXes.
  • T1 & T2 sites please observe the actions requested by ATLAS and CMS (also on the WLCG Operations portal).

Agenda

Attendance

  • local: Maria Dimou (Minutes), Andrea Sciaba (chair), Maarten Litmaath, Andrea Manzi, Marian Babik, Christoph Wissing.
  • remote: Antonio Maria Perez Calero Yzquierdo, Renaud Vernet, Ulf Tigerstedt, Rob Quick, Vincenzo Spinoso, Pepe Flix, Alessandra Doria, Vladimir Romanovskiy, Maite Barroso, Catherine Biscarat, Di Qing, Dave Mason, Jeremy Coles, Massimo Sgaravatto, Gareth Smith.
  • apologies: Maria Alandes (Info Sys TF).

Operations News

NTR

Middleware News

  • Baselines:
  • Issues:
    • NTR
  • T0 and T1 services
    • CERN
      • FTS3 will be upgraded to 3.3.1 on Monday 21st of September
    • IN2P3
      • dCache upgrade (2.10.40) on core servers next week (22/09/2015)
    • JINR
      • dCache updated to 2.10.40
    • KIT
      • update all of dCache to latest version in 2.13 branch during GridKa's annual downtime from 29th of September till 1st of October.
    • NDGF
      • IPv6 was enabled on doors and FAX-node.
    • RRC-KI-T1
      • dCache upgrade to 2.10.39 on pools and gridftp doors

Tier 0 News

Experiment support tickets that reach the T0:

Alice:

Atlas:
  • Point to the KB where Atlas computing support is explained, request the user to follow the channels specified there, and close the ticket. Summary of guidelines: no GGUS or SNOW; use workbooks, forums, and atlas.support@cernNOSPAMPLEASE.ch
CMS:
  • Reassign in the tool they have been first reported, If GGUS in GGUS Support Unit: VO support(CMS) , if SNOW in SNOW
    • In the midterm it might be useful evaluate a technical possibility that allows CERN supporters to assign the ticket back to GGUS TPM from SNOW
LHCb:
  • Reassign tickets in GGUS to Support Unit: VO support(LHCb). If this is not possible, reassign it in SNOW to the FE LHCb Computing Support Meyrin.

Tier 1 Feedback

  • NDGF-T1 enabled IPv6 (dual stack) for SRM on 14.9.2015. We request that the FTS3 developers would take https://its.cern.ch/jira/browse/DMC-681 seriously and fix the issue. As long as we are alone with IPv6 enabled it should be no problem, but there are already T2s with dualstack. Also ARC did not support delayed passive with IPv6, causing all reads and writes to be proxied. This has already been fixed in ARC and a new release is coming soon.

Comments by Andrea M. and Andrea S. at the meeting: Alejandro, the FTS3 developer is aware of the ticket but it seems to be an issue with the globus libraries. Put a reminder in the jira ticket for Globus developers. Thomas Hartmann (KIT) should be asked to check if the latest Globus is deployed and whether the problem persists.

Tier 2 Feedback

Experiments Reports

ALICE

  • high activity
  • CERN
    • team ticket GGUS:116095 about expired CRLs on myproxy.cern.ch
      • IPv6 connectivity issue in the Wigner data center was fixed
    • Accessing CASTOR for reading or writing raw data files:
      • Various constructive meetings between ALICE experts and the CASTOR team.
      • Short- and longer-term ideas were discussed.
      • Reco jobs now download the raw data files instead of streaming them.
        • The effect should become visible when more data is ready for processing.
      • Further ideas involving EOS are being investigated.
      • DAQ and CASTOR experts also retraced how a particular file ended up lost.
      • Thanks for the good support!

ATLAS

  • High rate of activity in both jobs and data transfer
  • Transatlantic network link was at high risk last week due to 3 out 4 fibres cut.

Marian will follow-up with the WLCG networking group. They will evaluate whether and how they can help by informing the experiment.

  • A large campaign of data transfer created >1M transferring files per FTS server and this seemed to cause slowing down of the transfers
  • Procedures and communication for sites upgrading FTS is not great. Do we need a new method?
  • Found at least two sites generating gridmap files from the old voms server. This caused problems for ATLAS users with new certificates. Please send a broadcast to all sites asking to check and change if necessary.

During the discussion it was decided not to send yet another broadcast given the Very Large amount of broadcasts that was sent for several months. There was one OSG site and Glasgow. With regards to Glasgow and pointing at the old VOMS Jeremy conveyed the site reports as follows: A 'rogue' configuration management tool replaced the current configuration for VOMS with the old one. This was only on our older disks with the rest of the site pointed to the new VOMS2 locations. Glasgow moved over to the new servers a long time ago (in fact uses the RPM method of installation to ensure we get the correct certs now) and at present we’re not sure which management tool replaced the new config (we are split between puppet on our new systems and cfengine on our old). If after investigation we discover anything of interest to other sites we will share that via this meeting.

CMS

  • Some Issues with DPM Storage Elements in the AAA xrootd data federation
    • Seem to have appeared after some updates of regional re-directors to more recent xrootd versions (4.2.x)
    • Investigations ongoing between AAA folks, DPM experts and xrootd developers

Andrea M. said that a new version of DPM - xrootd is needed to fix this. The development is on-going now.

  • Last two weeks with rather high slot utilization
    • Reached (again) 120k parallel jobs in the Global Pool
  • File transfer issues from P5 to Wigner are overcome
    • CMS DAQ team observes much improved transfer quality after router exchange
  • Operational issues
    • Ongoing investigations of "UNKNOWN" SAM results for CEs at CERN, GGUS:116069
      • No obvious issues with job submission for production or analysis though
    • Some ARGUS authentication issues being investigated with help IT experts, GGUS:116092

LHCb

  • Data Processing
    • Validation and data quality verification of data are going. All data is buffered on disk resident areas -> no staging
  • Operations
    • Ongoing discussion with IT/PES about worker nodes which are executing payloads significantly slower (e.g. GGUS:116023)
Maite commented on the last point that, since 2 weeks the batch team of the T0 has set up a lab to test and improve this. A sustainable fix for these kinds of issues depends on the latest OpenStack release (Kilo) that will be gradually deployed over the coming weeks throughout Oct. In the meantime the T0 experts will continue working with the 4 experiments on new cases as they are reported, applying measures from the current set of mitigations.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

HTTP Deployment TF

Information System Evolution


  • WLCG Information System Use Cases document presented at the MB
  • MB gave feedback to work on several areas that need further discussion and agreement within the TF:
    • Future Use Cases: use cases document describes the current interactions with the IS. The TF should now investigate what it is actually needed so that we can better understand how the IS could evolve.
    • Static vs Dynamic: MB would like to see summarised the types of information actually needed by the experiments. Probably a more elaborated version of what it is already summarised in this twiki under Types of Information and focus only in the future use cases.
    • "Indicative pledges" per site in REBUS: The TF requested the MB to include "indicative pledges" per site in REBUS. MB would like to understand why this information is needed and have a concrete proposal on how it will be collected.
    • Installed capacity: a better definition, and maybe also name, is needed for what it is called today "installed capacity". MB would also like to understand why this information is needed and also how it will be collected.
    • T3s and opportunistic resources: it would be good to understand how information is going to be collected from T3s and opportunistic resources.
  • OSG, NDGF and EGI will present their plans to provide information about their resources in the future at the next TF meeting. GOCDB will also present the latest features.

IPv6 Validation and Deployment TF


Update on the status of IPv6 deployment in WLCG (from Bruno Hoeft)

Tier-1
Site LHCOPN IPv6 peering LHCONE IPv6 peering perfSONAR via IPv6
ASGC - - -
BNL not on their priority list
CH-CERN yes yes LHC[OPN/ONE]
DE-KIT yes yes LHC[OPN/ONE]
FNAL yes yes LHC[OPN/ONE] but not yet visible in Dashboard
FR-CCIN2P3 yes yes LHC[OPN/ONE] but not yet visible in Dashboard
IT-INFN-CNAF - yes LHCONE
NDGF yes yes LHC[OPN/ONE]
ES-PIC yes yes LHCOPN
KISTI started but no peering implemented
NL-T1 no peering implemented
TRIUMF IPv6 peering planned at end of 2015
RRC-KI-T1 - - -

Tier-2
Site LHCONE IPv6 peering perfSONAR
DESY yes LHCONE
CEA SACLAY yes -
ARNES yes -
WISC-MADISON yes -
UK sites QMUL peers with LHCONE but not for IPv6
Prague FZU IPv6 still working but the previous contact person left
There are additional IPv6 perfSONAR servers at Tier-2 centres, but not via LHCONE.

Marian said that FNAL & IN2P3 can be added to the dashboard now. Also, there are multiple T2 sites that can be added in the table but they are not in LHCONE. About the Prague comment in the table, the explanation is that the responsible left but the site appears on the dashboard, so we still have their PerfSONAR data for now.

Machine/Job Features

Middleware Readiness WG


The WG met yesterday. Full minutes in MWReadinessMeetingNotes20150916. Summary:
  • The new DPM version is being tested via the ATLAS workflow by the Edinburgh Volunteer site.
  • Many new sites showed interest to participate in MW Readiness testing with CentOS7. It is useful to anticipate the MW behaviour in the event of new HW purchase. DPM validation on CentOS/SL7 is already ongoing at Glasgow.
  • ATLAS and CMS are asked to declare whether the xrootd 4 monitoring plugin is important for them or not. As it is now, it doesn't work with dCache v. 2.13.8
  • Despite the fact that FTS3 runs at very few sites we decided to test it for Readiness. In this context, ATLAS & CMS are now using the CERN FTS3 pilot in their transfer test workflows.
  • PIC successfully tested dCache v.2.13.8 for CMS.
  • CNAF has obtained Indigo-DataCloud effort to strengthen the ARGUS development team. The ARGUS collaboration will meet again early October. The problems faced at CERN with a CMS VOBOX are being investigated in ticket GGUS:116092.
  • The next MW Readiness WG vidyo meeting will take place on Wednesday 28 October at 4pm CET.

Renaud asked about the experiments' experience with WNs on CentOS7. The current situation is that it doesn't work. The MW packages used on the WN need to be available for the new OS, but that should happen fairly soon. The big work is in the usual porting of the experiment SW followed by careful validation of physics outputs of reference tasks in comparison with corresponding SL6 results; all that will take a lot of time and probably will not make much progress this year. However, early next year we should pick this up.

Multicore Deployment

Network and Transfer Metrics WG


  • OSG perfSONAR datastore entered production on 14th of Sept providing storage and interface for all perfSONAR results.
  • Publishing of the perfSONAR results using pre-production (ITB) services was successfully established, working to resolve issue with some event types not being published, production still pending SLA.
  • WLCG-wide meshes campaign with latency testing ramped up to 81 sonars caused some instabilities of the sonars with 4GB RAM, therefore we have decreased the number of tests performed and this has improved the situation.
  • Final version of the perfSONAR 3.5 is planned to be released on 28th of September and will be auto-deployed to all WLCG instances. There were no issues found in the testbed, but we plan to update couple of production instances in advance to check if everything is fine.
  • ESNet and OSG have started developments on the perfSONAR configuration interface - open source project motivated by the existing version developed for WLCG. There has been also interest from GEANT and ESNet to collaborate on an open source project based on the existing proximity service.
  • Follow up meeting was held to discuss findings of the FTS performance study lead by Saul Youssef (Boston University), new optimization algorithm was proposed and discussed.
  • Next WG meeting will be on 30th of Sept (https://indico.cern.ch/event/400643/)

RFC proxies

  • NTR

Squid Monitoring and HTTP Proxy Discovery TFs

  • Alastair is making progress on the next deliverable (a flexible squid registration exception list), but is not quite ready to put it into production
  • We agreed to change the documentation for squid registration to make it clear that T3s that are not already registered in GOCDB do not have to register their squids to have them monitored, they can send an email and we'll add an exception

Action list

Creation date Description Responsible Status Comments
2015-09-03 Status of multi-core accounting John Gordon ONGOING A presentation about the plans to provide multicore accounting data in the Accounting portal should be presented at the next Ops Coord meeting on October 1st https://indico.cern.ch/event/393617/ since this is a long standing issue
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. A broadcast message has been sent by EGI. Now the team will start working on the monitoring script that will show the sites that haven't changed and open GGUS tickets to remind them.

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-09-03 T2s are requested to change analysis share from 50% to 25% since ATLAS runs centralised derivation production for analysis ATLAS - - a.s.a.p. ONGOING
2015-06-18 CMS asks all T1 and T2 sites to provide Space Monitoring information to complete the picture on space usage at the sites. Please have a look at the general description and also some instructions for site admins. CMS -   None yet ~10 T2 sites missing, Ticket open

AOB

-- MariaDimou - 2015-09-14

Edit | Attach | Watch | Print version | History: r36 < r35 < r34 < r33 < r32 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r36 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback