WLCG Network Throughput WG

Mandate

  • Ensure sites and experiments can better understand and fix networking issues

Objectives

  • Oversight of the perfSONAR network infrastructure
  • Coordination of the WLCG network performance incidents
  • Detection and follow up on issues seen by the perfSONAR network

Meetings

Bi-weekly meetings, European and North American throughput calls

Members

Shawn McKee (chairperson), Marian Babik (co-chair), ATLAS (Simone Campana), CMS (Nicolo Magini), LHCb (Stefan Roiser, Joel Closier), Alice (Latchezar Betev, Costin Grigoras), FAX (Ilija Vukotic), FTS (Michail Salichos, Oliver Keeble), Panda (Kaushik De), Rucio (Vincent Garonne), BelleII (Malachi Schram)

perfSONAR contacts: US-ATLAS (Shawn McKee), US-CMS (Jorge Alberto Diaz Cruz), UK-ALL (Alessandra Forti, Duncan Rand), IT-ATLAS (Alessandro de Salvo), IT-CMS (Enrico Mazzoni), CA-ALL (Rolf Seuster), FR-ALL (Frederique Chollet, Laurent Caillat, Frederic Schaer), TW-ALL (Hsin-Yen Chen), ND-ALL (Ulf Tigerstedt), DE-ALL (Guenter Duckeck, Andreas Petzold, DE-KIT: Bruno Hoeft, Aurelie Reymund), ES-ALL (Fernando Lopez, Josep Flix), CERN (Stefan Stancu), LHCOPN/LHCONE (John Shade, ESNet: Mike O’Connor), RU-ALL (Victor Kotlyar), ESnet Science Engagement group (Jason Zurawski), BelleII (Malachi Schram)

Contacts

Primary contact is via mailing list wlcg-network-throughput-wg@cernNOSPAMPLEASE.ch, previous mailing lists (wlcg-ops-coord-tf-perfsonar and wlcg-ops-coord-wg-metrics) are still active, defined as aliases to the the primary mailing list. The primary mailing list has two sub-groups: wlcg-perfsonar-support@cernNOSPAMPLEASE.ch and throughput-l@listsNOSPAMPLEASE.bnl.gov, which are used to organize and follow up on the corresponding European and North American throughput calls.

Coming Events

Network Throughput Support Unit

Network Performance Incidents Follow up Procedure

The main motivation for this procedure is to investigate network performance issues with assistance of the perfSONAR team. The focus is on performance issues and the primary objective is to confirm if a transfer problem observed is network related or not. If it's confirmed to be a WAN issue then work with perfSONAR team to try to narrow it down to particular network link and thus help identify who might be responsible for it. The full text of the procedure follows:

  • New GGUS support unit (WLCG Network Throughput; https://wiki.egi.eu/wiki/GGUS:WLCG_Network_Throughput) can be used to report incidents (corresponding mailing list is: wlcg-network-throughput at cern.ch, initial participation there is the same as for the WG mailing list (transfer systems, experiments, perfsonar support, esnet, lhcopn/lhcone).

  • Experiments can report to the mailing list potential network performance incidents/degradations, WLCG perfSONAR support unit will investigate and confirm if this is network related issue. Once confirmed, it will notify relevant sites and will try to assist in narrowing down the problem to particular link(s). Affected sites will be contacted and should open an incident with their network providers. Tracking of the ongoing incidents will be done on the WG page.

  • Sites observing a network performance problem should follow their standard procedure, i.e. report to their network team and if necessary escalate to their network provider while informing the wlcg-network-throughput mailing list. If confirmed to be WAN related, WLCG perfSONAR support unit can assist in further debugging of the problem. For the non-technical (policy) issues or if unclear, sites should escalate to the WLCG operations coordination.

Network Performance Incidents

Incident Ticket Comments
EU sites to IHEP/CN via mailing list Resolved: Routing issue - ticket with GEANT was opened by IHEP, peerings were updated
UFL to IC via mailing list Resolved: transfer rates improved before root cause was found
US T2s/AMS to CERN GGUS:139866 GGUS:139874 Resolved: ESNet network incident impacting US to CERN connectivity (also impacted AMS)
SARA to CERN GGUS:138472 Resolved: MTU issue on IPv6 suspected, but was just packet loss in the end
RAL/SARA to IN2P3 GGUS:137967 GGUS:137972 GGUS:137994 GGUS:139756 Resolved: Packet loss on the link due to congestion, IN2P3 has a ticket with RENATER (resolved by upgrading)
IN2P3 -CC to UTA_SWT2 via mailing list Resolved: Possible saturation on LHCONE at/close to IN2P3 -CC
AGLT2 inbound via mailing list On-going: Narrowed down to ESNet -> ALGT2 segment
FNAL inbound GGUS:137632 Resolved: Bad link was identified by FNAL
IHEP-CN - JINR/IHEP-SU GGUS:136606 GGUS:136332 On-going: more efficient transit path is missing btw. concerned NRENs, to be followed up in Asia Forum/LHCOPN-LHCONE WS
DESY/FNAL GGUS:135962 Resolved: Tests didn't indicate any obvious network issue (* but not all relevant network aspects could be tested).
UFlorida - Kharkov via mailing list Resolved: MTU step-down issue - pmtu discovery ACL was fixed by UF
UNI-Freiburg GGUS:135304 Resolved: CERN prefixes missing in the routing announcements to SWITCH
DESY inbound GGUS:134470 Resolved: Network configuration tuned/changed at DESY
AGLT2/LHCONE via mailing list Resolved: Performance issue to LHCONE sites, narrowed down to US/ESNet segment (module issue)
NCP/Pakistan commissioning via mailing list On-going: Investigated in collaboration with GlobalNOC ( report), proposed routing changes for TEIN3
CYFRONET/RRC-KI GGUS:131375 Resolved: MTU step-down (Resolved by PSNC NREN)
BEgrid-ULB-VUB UKI-LT2-IC-HEP GGUS:132286 Resolved: IceCUBE flows overloading BEgrid-ULB-VUB networks
NDGF/BNL from multiple locations GGUS:131975, GGUS:131981 Resolved: Issue with FTS at RAL
RO-02-NIPNE to multiple locations GGUS:128489 Resolved: MTU step-down + load balancing suspected; NREN was contacted by NIPNE
PIC to PL Swierk GGUS:130112 Resolved: Unable to investigate as no pS at PL Swierk, but error suggesting a storage problem
CNAF/RALPP GGUS:130112 Resolved: Investigated and resolved as non-network issue
Oxford GGUS:130032 On-going: Significant issues seen during August (down to 50Mbs), perf improved afterwards but still not at levels seen last year
SARA/IC GGUS:129964 Resolved: Issue with firmware router at SARA network provider
NCP/Pakistan via mailing list Resolved: QoS issue, IPv6 performs fine
CBPF to CNAF, PIC and IN2P3 via mailing list to LHCONE ops, GGUS:129561 from LHCb Resolved: MTU step down issue within RNP
T0 to JINR GGUS:129544 Resolved: by JINR putting in place new Moscow - Dubna link and fixing asymmetries in routing
IN2P3 NIKHEF to UC via mailing list Resolved: Univ. of Chicago investigated (root cause unknown)
BNL ASGC ESNet ticket ESNET-20170123-005 Resolved: Issue opened by WG; Resolved by ESNet
IHEP EU GGUS:125623 Resolved: by NREN (site was not notified of the ongoing network issue)
UNL FNAL via mailing list Resolved: UNL investigated (root cause unknown)
CERN RRCKI GGUS:124538 Resolved: RRCKI re-routed from AMS to BUD, root cause for congested path RRCKI-AMS was not understood
MIT inbound throughput via mailing list Resolved; MIT opened ticket with Internet2
EELA-UTFSM MWT2_UC via mailing list Resolved: gsiftp timeouts, non-network issue
McGill BU GGUS:123285 Resolved, gridftp timeouts, but re-appeared, network seems to perform well, likely an issue with storage
Victoria - Prague via mailing list Resolved; grid output retrieval failing; asymmetric paths and MTU step down issues
SARA consistent loss GGUS:121687 Resolved after SARA migrated to the new data centre
RAL consistent loss GGUS:121687 Resolved, RAL router upgraded
BNL RAL CERN GGUS:121687 Resolved, issue with RAL router
BNL SARA CERN GGUS:120957 Resolved, issue with ESNet router at CERN and saturated link CERN/SARA (was upgraded to 20Gbps)
ASGC CERN IJS GGUS:119820 Resolved, issue with router at ASGC and IJS firewall
CBPF GGUS:120081 Resolved: RNP stopped publishing to ESNet CBPF IPs
FNAL CERN GGUS:119551 Resolved: fixed by ESNet - faulty router interface in New York
PIC inbound via mailing list Resolved: 10 Gbps link WAN at PIC sharing LHCOPN,LHCONE was completely saturated causing input discards
BNL to PIC via mailing list Resolved: LHCOPN link CERN-PIC was flapping a lot due to an issue with the Geant fibre to Spain
MAINZ CA via mailing list Resolved: MAINZ uses a "commercial" network provider and Canadian sites only peer with R&E networks
OU inbound via mailing list Resolved: Narrowed down to a faulty switch on site
CA EU GGUS:118748, GGUS:118730 Resolved: Trans-atlantic channel instability, resolved by re-routing at Canarie

Security Announcements



  • Security: New SSL vulnerability dubbed Logjam: https://weakdh.org/sysadmin.html. WLCG perfSONAR hosts should NOT be vulnerable to this attack. The Apache configuration installed by the Toolkit disables the cipher suites in question by default.
  • Security: CVE released 2nd of April 2015 for cassandra, which is used by the perfSONAR measurement archive software, esmond. NO action required to protect perfSONAR Toolkit since vulnerable ports are both disabled and firewalled.



Links

Deployment Guide:

Installation and Configuration Guides: Infrastructure Monitoring: Global Configuration Interface (meshes, tests): Collector and Central Store for all perfSONAR metrics: perfSONAR stream: Dashboards:

Meetings

Presentations

Reports

Report 16/05/2019

Report 07/03/2019

  • perfSONAR infrastructure status - CC7/4.1 campaign ongoing
    • perfSONAR 4.0 and perfSONARs on SL6 are no longer supported since Q4 2018 - please update ASAP
    • New baseline version for perfSONAR is the latest release 4.1.6 (fixes important bug causing duplicate testing)
  • WLCG/OSG network services were updated
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Report 07/02/2019

  • perfSONAR infrastructure status - CC7/4.1 campaign ongoing
    • perfSONAR 4.0 and perfSONARs on SL6 are no longer supported since Q4 2018 - please update ASAP
    • We have started ticketing sites, starting with T1s and major T2s
  • WG update will be presented at HEPiX in San Diego
  • WLCG/OSG network services were updated
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Report 08/11/2018

  • perfSONAR infrastructure status - CC7/4.1 campaign ongoing
    • Sites were reminded to upgrade to CC7 and review their configuration (preferably by end of October)
    • Still only around 50% of nodes are on CC7 as of today - we'll soon start contacting sites directly
    • Some sites waiting for/deploying new hardware; e.g. SARA deployed 100Gbps perfSONAR (first in Europe), BNL deployed 2x40 Gbps perfSONAR
  • WG update was presented at HEPiX and LHCOPN/LHCONE workshop
  • WLCG/OSG network services working fine
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Report 13/09/2018

  • perfSONAR infrastructure status
    • perfSONAR 4.1 was released few weeks ago - main new feature is an improved central/remote configuration
    • WLCG broadcast was sent this week to remind sites to upgrade to CC7 and review their configuration (preferably by end of October)
    • Around 50% of sonars are on CC7 as of today
  • WG update will be presented at the upcoming HEPiX
  • WLCG/OSG network services
    • Central configuration service (meshconfig/psconfig) was updated to the version released in 4.1 (officially supported by perfSONAR team)
    • psconfig.opensciencegrid.org is currently unreachable via IPv6 from non-LHCONE sites due to issue with routing, this is being followed up by the network team at MSU
  • NSF funded projects: SAND and IRIS-HEP are starting, both will contribute in different ways to the OSG Network Area - more details will be provided in the HEPiX talk
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Report 07/06/2018

  • perfSONAR infrastructure status
    • perfSONAR 4.1 beta will be released in the coming weeks - main new feature is an improved central/remote configuration
    • CC7 campaign had only modest progress recently - 86 instances on CC7 (from 81 in April, out of total 210)
    • WLCG broadcast will be sent to remind sites to plan an upgrade to CC7 and review their configuration
  • WG update was presented at HEPiX and will be presented at CHEP
  • WLCG/OSG network services
    • Following retirement of OSG GOC, all central services were migrated to AGLT2, which took considerable effort in planning and deployment
    • Transition happened without downtime and was transparent to all sites
    • One exception are sites using the old OIM/myOSG central configuration URL, which was deprecated during 3.5 update campaign (meshconfig URLs starting with myosg.grid.iu.edu/pfmesh...)
    • Impacted sites are asked to update their meshconfig-agent.conf following http://opensciencegrid.org/networking/perfsonar/installation/#installation ASAP
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Report 12/04/2018

  • perfSONAR 4.0.2 and CC7 campaign - 210 instances updated to 4.0.2; 81 instances already on CC7
    • WLCG broadcast will be sent to remind sites to plan an upgrade to CC7 and review the firewall port openings
    • perfSONAR 4.1 release, planned in Q2 2018 will no longer ship SL6 packages
  • Attended perfSONAR developers F2F meeting in Amsterdam and presented feedback from OSG/WLCG
  • WG reports planned for upcoming HEPiX and CHEP
  • Networking and perfSONAR were also major topics at the OSG-All Hands (https://indico.fnal.gov/event/15344/)
    • 4 presentations were given on various topics related to the WG
    • One of the outcomes was a proposal to create a dedicated site-based documentation showing all links relevant to a given site
  • WLCG/OSG network services
  • Outreach and other activities:
    • GEANT has added several perfSONAR instances on LHCONE at their major network hubs (ams, gva, lon, par, fra) - both IPv4 and IPv6
    • Advania was added to HNSciCloud test mesh
    • MGHPCC (http://www.mghpcc.org/) plans to deploy up to 22 perfSONARs, currently in discussion how we can help
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Report 01/03/2018

  • perfSONAR 4.0.2 and CC7 campaign - 190 instances updated to 4.0.2; 64 instances already on CC7
    • WLCG broadcast will be sent to remind sites to plan an upgrade to CC7 and review the firewall port openings
    • perfSONAR 4.1 release, planned in Q2 2018 will no longer ship SL6 packages
  • WLCG/OSG network services
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
  • LHCOPN/LHCONE Workshop will take place next week - update on WG activities will be presented (https://indico.cern.ch/event/681168/)
  • perfSONAR developers F2F meeting will take place next week in Amsterdam - feedback from OSG/WLCG will be presented

Report 18/01/2018

  • perfSONAR 4.0.2 - 190 instances updated out of which 53 are already on CC7
    • WLCG broadcast will be re-sent next week to remind sites of the upcoming important dates and new documentation
    • perfSONAR 4.1 release, planned in Q1 2018 will no longer ship SL6 packages
    • EOL for SL6 support in Q3 2018
    • All sites are encouraged to upgrade to CC7 as soon as possible
  • WLCG/OSG network services
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Report 06/12/2017

  • perfSONAR workshop held by JISC in November ( slides) - WLCG WG activities mentioned
  • perfSONAR 4.0.2 was released on November 28th
    • WLCG broadcast will be sent this week to notify sites of the upcoming important dates and new documentation
    • perfSONAR 4.1 release, planned in Q1 2018 will no longer ship SL6 packages
    • EOL for SL6 support in Q3 2018
    • All sites are encouraged to upgrade to CC7 as soon as possible
  • WLCG/OSG network services
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
  • HNSciCloud will use perfSONAR results for network performance evaluation of the providers
  • HEPiX WG on SDN/NFV was established and will look into networking R&D topics - sites interested please subscribe via https://listserv.in2p3.fr/cgi-bin/wa?SUBED1=hepix-nfv-wg

Report 02/11/2017

  • WG update was presented at HEPiX and LHCOPN/LHCONE workshop (co-located)
  • perfSONAR 4.0.2 is planned to be released in November
  • WLCG/OSG network services
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
  • HNSciCloud meshes are being created (per provider), will enable tests between HNSciCloud sites and providers

Report 05/10/2017

  • WG update will be presented at HEPiX and LHCOPN/LHCONE workshop (co-located)
  • perfSONAR YouTube channel at https://www.youtube.com/channel/UCjK-P49pAKK9hUrrNbbe0Sg
  • perfSONAR 4.0.1 auto-deployed to 197 instances (21 are already on centos7)
    • Port 443/https is now used as a controller port for pscheduler and needs to be open on central firewalls
    • Some sites suffer from an MA access issue after the upgrade, this is being followed up
  • perfSONAR 4.0.2 is planned to be released in November
    • Brings new SNMP plugin that can be used to retrieve local site router traffic
  • WLCG/OSG network services
    • New documentation is in preparation and will be hosted at https://opensciencegrid.github.io/networking/
    • OSG collector handling multiple backends (Datastore, CERN ActiveMQ and GOC RabbitMQ) now in production
      • GOC will distribute raw data to 3 different locations, FNAL for tape archive, Nebraska for long-term ES storage, Chicago for short-term ES storage
    • Preparing new LHCOPN and perfSONAR dashboards in collaboration with CERN IT/CS and IT/MONIT
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
  • HNSciCloud will create its own perfSONAR mesh to follow up on the network performance btw. providers and sites

Report 14/09/2017

  • WG update will be presented at HEPiX and LHCOPN/LHCONE workshop (co-located)
  • perfSONAR 4.0.1 was released and was auto-deployed to 187 instances (21 are already on centos7)
  • WLCG/OSG network services
    • New documentation is in preparation and will be hosted at https://opensciencegrid.github.io/networking/
    • New central mesh configuration interface (MCA) and monitoring (ETF) in production (http://meshconfig.grid.iu.edu; https://psetf.grid.iu.edu/etf/check_mk/)
    • OSG collector handling multiple backends (Datastore, CERN ActiveMQ and GOC RabbitMQ) now in production
      • GOC will distribute raw data to 3 different locations, FNAL for tape archive, Nebraska for long-term ES storage, Chicago for short-term ES storage
    • Central dashboard service (psmad.grid.iu.edu) suffers from a bug which prevents showing statuses correctly (as well as retrieve the graphs), ESNet is working on a fix
    • Preparing new LHCOPN and perfSONAR dashboards in collaboration with CERN IT/CS and IT/MONIT
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Report 06/07/2017

Report 18/05/2017

  • perfSONAR 4.0 was released on 17th of April
    • 180 sites have updated so far
    • Some sites reported issues with load after updating, under investigation
  • WLCG/OSG network services
    • New central mesh configuration interface (MCA) will be deployed to production next week - transition will be transparent to all sites
      • MCA was developed by OSG and becomes part of perfSONAR.
    • Monitoring based on ETF is planned to be deployed in ITB
    • OSG collector will be updated to handle multiple backends (datastore, two message buses)
  • LHCOPN grafana dashboards established in collaboration with CERN IT/CS and MONIT team (access restricted to CERN users, public access in the works)
  • Next Throughput call will be on Wed May 24th at 4pm CEST (https://indico.cern.ch/event/640627/)

Report 06/04/2017

  • LHCOPN/LHCONE workshop in BNL took place this week (https://indico.cern.ch/event/581520/)
  • perfSONAR 4.0 to be released on 17th of April
    • Site on auto-updates will get it automatically - no action needed.
    • Sites planning to update perfSONARs to CC7 are encouraged to wait until 4.1 is released.
    • Minimal hardware requirements were shifted: Sites running perfSONARs with less than 4GB RAM and 2 core CPU with clock speed less than 2GHz are encouraged to keep running the old version (3.5.1)
  • WLCG/OSG network services
    • New central mesh configuration interface (MCA) will be deployed to production - transition will be transparent to all sites
      • MCA was developed by OSG, but becomes part of perfSONAR
      • Integrates perfSONAR lookup service with OIM/GOCDB services, so we can now easily add NREN perfSONARs into our meshes
    • Monitoring was updated to cover new features released in 4.0 and is now based on ETF
    • OSG collector was updated to collect additional perfSONAR metrics (such as TCP retransmits, path MTU, etc)
    • LHCOPN traffic and LHCONE simulated link utilisation now available for subscriptions from the netmon brokers
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
    • BNL/ASGC throughput improved by factor 10 - details reported at the LHCOPN/LHCONE workshop

Report 26/01/2017

Report 01/12/2016

  • pre-GDB on NETWORKING will take place on 10th of January, preliminary agenda now available at https://indico.cern.ch/event/571501/
    • If you plan to attend register at https://indico.cern.ch/event/571501/ to help us with logistics
    • Please let us know what you would like to see come out from this meeting. If there are additional topics you would like to see in the agenda or modifications to existing items, please let us know.
    • Invitation was sent to all four experiments
  • Next throughput meeting planned 14th of Dec
    • Focus on perfSONAR RC validation
  • perfSONAR team announced that they plan to release 4.0 RC3, which will push final release to next year
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activites.

Report 03/11/2016

Report 29/09/2016

  • Network session at the WLCG workshop
    • Q&A session planned, questions will be sent in advance, we encourage all to participate
    • Inder Monga (Director of ESNet) will join the session
  • LHCOPN/LHCONE workshop was held in Helsinki, Sept 19-20 (https://indico.cern.ch/event/527372/)
    • GEANT reported peaks over 100GBps and growth of over 65% from Q2 2015 to Q2 2016
    • ESNet reported that LHCONE traffic has increased 118% in the past year
    • Positive feedback received on the LHC Network Evolution talk
  • pre-GDB on networking focusing on the long-term network evolution planned on January 10th - save the date
  • Throughput meetings were held on 15th Sept:
    • Hendrik Borras (Univ. of Heidelberg) presented early results on the network telemetry based on perfSONAR
  • perfSONAR 4.0 RC1 was released, RC2 planned in October with final release sometime in November
  • We are now using a new mailing list wlcg-network-throughput-wg@cernNOSPAMPLEASE.ch - joint mailing list for European and NA throughput meetings
  • WLCG Network Throughput Support Unit: New cases were reported on IPv6 and are being followed up, see twiki for details.

Report 01/09/2016

  • Network session is planned at the WLCG workshop covering IPv6, LHCOPN/LHCONE status and LHC network evolution
  • LHCOPN/LHCONE workshop will be held in Helsinki, Sept 19-20 (https://indico.cern.ch/event/527372/)
  • pre-GDB on networking focusing on the long-term network evolution postponed to January
  • Throughput meetings were held on July, 27 and August, 16:
    • Mark Feit from Internet2 presented pScheduler (new test scheduler in perfSONAR 4.0)
    • Xinran Wang and Ilija Vukotic from Univ. of Chicago presented their Network Analytics work
  • OSG datastore and collector are experiencing problems since the upgrade last week, the issue is being followed up by OSG
  • Plan on migration to the new perfSONAR 4.0 configuration was drafted and will be followed up with OSG
  • We are now using a new mailing list wlcg-network-throughput-wg@cernNOSPAMPLEASE.ch - joint mailing list for European and NA throughput meetings
  • WLCG Network Throughput Support Unit: New cases were reported and are being followed up, see twiki for details.

Report 07/07/2016

  • WG update presented at ATLAS TIM including discussion on the mid-long term network evolution
  • pre-GDB on networking focusing on the mid-long term network evolution will be held in December
  • North American Throughput meeting held on 22nd of June:
    • Andy Lake presented new features planned in perfSONAR 4.0
    • Next meeting end of July, main topic: pScheduler (replaces bwctl)
  • WLCG Throughput meeting held 16th of June:
    • Main topic was re-organization of the meshes, the proposal was agreed and implemented
    • New experiment-based meshes were introduced in the production dashboard in effect (see twiki)
    • Next meeting in Sept. (co-located with LHCOPN/LHCONE)
  • WLCG Network Throughput Support: Several new cases were reported and are being followed, see twiki for details.
  • perfSONAR 4.0 (formerly 3.5) RC to become available end of August, WLCG validation and deployment campaign will follow.
    • Introduces several major changes such as new configuration management and interface as well as migration from BWCTL to pScheduler

Report 02/06/2016

  • WLCG Network Throughput SU:
    • ASGC connectivity - After numerous tests performed in collaboration with ASGC and ESNet (http://etf.cern.ch/perfsonar_asgc.txt) the root cause has been confirmed to be the local N7K router at ASGC. Once the perfSONARs were moved directly to the central router the measured network performance has improved by factor 10. Our recommendation is to re-wire all the existing data transfer nodes to bypass the local router as well as to tune the central router and data transfer nodes to improve their performance for long path transfers (200ms+).
    • Two new tickets received related to packet loss observed at RAL and SARA
  • Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
  • North American Throughput meeting held on 1st of June:
    • Shawn presented OSG network area roadmap, main focus will be on developing notification/alerting, support for higher-level services (analytics) and prepare for SDN
    • Next meeting is on 22nd June - main topic will be perfSONAR 4.0
  • WLCG Throughput meeting will be held on 16th of June - main topic is re-organization of the meshes
  • perfSONAR 4.0 (formerly 3.6) RC expected end of June

Report 28/05/2016

Report 28/04/2016

Report 07/04/2016

  • WLCG Network Throughput SU:
    • CBPF connectivity (https://ggus.eu/index.php?mode=ticket_info&ticket_id=120081) - resolved
    • ASGC connectivity (https://ggus.eu/index.php?mode=ticket_info&ticket_id=119820) - ongoing
    • Packet loss and high latency for certain packets (queuing issue ?) reported by perfSONAR on ASGC to CERN, but not confirmed by the counters
    • Narrowed down to the StartLight to ASGC segment, but unfortunately there are very few sonars in Asia with very limited peering, which will impact further investigation
    • Throughput tests show peaks of 400Mbit/s (200Mbit/s usual) with frequent retransmissions occurring in bunches, we'll try to run tcpdump to understand the root cause
  • Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
  • Throughput meeting held on 6th April:
  • Update on WG will be presented at HEPiX in DESY
  • WG review will take place at the next WLCG ops coordination on 28th April
  • Added section with useful links to the WG homepage https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics

Report 17/03/2016

  • ICFA SCIC meeting was held at J-Park in February, slides from the report (including WG contribution) can be found at http://icfa-scic.web.cern.ch/ICFA-SCIC/meetings.html
  • LHCOPN/LHCONE Meeting held in Taipei (https://indico.cern.ch/event/461511/)
  • WLCG Network Throughput SU: ASGC connectivity
    • Packet loss and high latency for certain packets (queuing issue ?) reported by perfSONAR on ASGC to CERN, but not confirmed by the counters
    • Narrowed down to the StartLight to ASGC segment, but unfortunately there are very few sonars in Asia with very limited peering, which will impact further investigation
  • Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
  • Throughput meetings held on Feb 24th and March 9th :
    • Soichi Hayashi presented the new configuration interface that will become part of perfSONAR 3.6
    • Shawn presented the way we currently monitor the perfSONAR infrastructure, including OSG production services
  • perfSONAR 3.5.1 released, 184 instances were auto-updated, only 13 instances on 3.4

Report 18/02/2016

  • WG has contributed to the International Committee for Future Accelerators (ICFA) Annual networking report (https://cds.cern.ch/record/2130751)
  • WLCG Network Throughput SU: BNL to PIC throughput degradation
    • Root cause was instability of the GEANT Spain fiber channels
    • Issue was reported by ATLAS and involved ESNet, LHCONE, perfSONAR and BNL
  • WLCG Network Throughput SU: FNAL to CERN
    • Issue at ESNet, resolved by LHCOPN ops
  • Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
  • Meeting held on LHCb DIRAC bridge on January 18th:
    • Ongoing developments on adding additional graphs (latencies, throughput) and bug fixing, plan is to go production by Q3 2016
  • Throughput meeting held on January 27th:

Report 21/01/2016

  • WLCG Network Throughput SU: GGUS-118730 Throughput degradation between CA and EU
    • Root cause was instability of the transatlantic link (WIX reported submarine shunt fault), which in turn impacted Geant- CANARIE link.
    • perfSONAR network helped to identify the problematic segment and once Canarie was notified the issue was resolved by re-routing.
    • Issue was reported by ATLAS, but many different people were involved (ATLAS, TRIUMF, perfSONAR support, LHCONE, Canarie, WIX).
    • Multiple GGUS tickets were open, but only one was followed up, something to improve in the future.
    • Experiments: Please check if everyone was notified of the on-going incident and let us know if we need to add additional contacts (wlcg-network-throughput mailing list)
  • OSG perfSONAR production services: Storage failure (OASIS) at GOC has impacted the entire perfSONAR pipeline, initially just the datastore, but later on also collector and publisher. The issue was resolved yesterday and the systems are recovering now. We have proposed changes that would remove dependency on the shared storage.

Report 07/01/2016

  • Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard), minor instability in the dashboard reported yesterday, being followed up by OSG
  • Additional monitoring metrics will be added to psomd.grid.iu.edu to capture collector's efficiency and report on freshness of the metadata in the OSG Datastore (for each sonar).
  • Proposed re-organization of the WG meetings, split into two areas, perfSONAR operations (throughput calls) and research/pilot projects
    • perfSONAR operations - main scope would be to continue with perfSONAR support, follow up on the existing infrastructure while at the same time start looking into issues already shown by the existing tools and try to fix them at the source. As this scope is well aligned with the existing North American throughput calls, we could alternate the meetings and publish common notes.
    • Research/pilot projects - will have separate on-demand meetings with notes published to WG mailing list
    • F2F meeting once a year, co-located with GDB or other workshop/conference
  • Pilot projects: LHCb DIRAC bridge available online

Report 19/11/2015

  • perfSONAR collector, datastore, publisher and dashboard in production (stable operations)
  • Additional monitoring metrics will be added to psomd.grid.iu.edu to capture collector's efficiency and report on freshness of the metadata in the OSG Datastore (for each sonar).
  • perfSONAR 3.5: 205 sonars were updated, ALL sites are encouraged to enable auto-updates for perfSONAR
  • Pilot projects: ATLAS Panda, perfSONAR stream now in ATLAS Network Analytics (https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ATLASAnalytics), several KIBANA dashboards available - Site link stats. Jorge and Ilija working on cost matrix using the round-trip time and packet loss in Mathis's formula to infer bandwidth (predictions based on this model will follow).
  • Pilot projects: LHCb DIRAC bridge is now functional, processing perfSONAR stream and inserting packet loss metrics in DIRAC, includes mapping to LHCb sites. Henryk, Federico and Stefan are working on this.

Report 05/11/2015

  • perfSONAR collector, datastore, publisher and dashboard now in production (stable operations)
  • perfSONAR 3.5: 205 sonars were updated, ALL sites are encouraged to enable auto-updates for perfSONAR
  • Detailed report from the WG presented at GDB
  • Meeting held yesterday, encouraging all mesh leaders to participate
  • Started discussion on the network outage and at risk announcements from NRENs
  • Pilot projects: ATLAS Panda, perfSONAR stream now in ATLAS Network Analytics (https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ATLASAnalytics), several KIBANA dashboards available MWT2 FZK2. Jorge and Ilija working on cost matrix using the round-trip time and packet loss in Mathis's formula to infer bandwidth (predictions based on this model will follow).

Report 22/10/2015

  • perfSONAR collector, datastore, publisher and dashboard now in production !
  • psmad becomes the official dashboard for perfSONAR meshes
  • perfSONAR 3.5: 183 sonars were updated, ALL sites are encouraged to enable auto-updates for perfSONAR.
  • Detailed report from the WG presented at HEPiX/GDB, we will also present status update again at the November's GDB
  • ATLAS started processing perfSONAR stream to create a network “cost-matrix” for use by PANDA with additional use cases in scheduled transfers and dynamic data access
  • LHCb also started processing perfSONAR stream and correlates it with the network and transfer metrics in DIRAC
  • Next WG meetings will be on 4th of Nov and 2nd of Dec

Report 01/10/2015

  • Meeting held yesterday, https://indico.cern.ch/event/400643/
  • Publishing of the perfSONAR results using OSG production service planned for 13th of October (OSG production date)
  • OSG dashboard (psmad.grid.iu.edu) will go production on the same date, already showing more recent results than maddash.aglt2.org, one issue to be fixed is to correctly show tests done in one-direction only
  • WLCG-wide meshes campaign finalized with 94 sonars in latency testing, 115 sonars in traceroutes and 104 in throughput.
    • Sonars that were not included in the WLCG-wide meshes were reported to the mesh leaders and will be followed up (currently they reside in the global meshes, once issues are fixed they'll be moved to WLCG meshes)
    • Started re-creating project meshes, Belle II and Dual-stack (IPv4/IPv6 bandwidth), plans for other meshes to be discussed
  • Once infrastructure is in production, we plan to focus on the integration projects, there are ongoing pilot projects for ATLAS and LHCb
  • There is also interest in perfSONAR in the IT Analytics WG as well as from the network community Asia Tier Centre Forum (https://indico.cern.ch/event/395656/)
  • perfSONAR 3.5 was released on Monday 28th Sept, 162 sonars were auto-updated, 68 still on 3.4, all sites are encouraged to enable auto-updates for perfSONAR
  • Next WG meetings will be on 4th of Nov and 2nd of Dec

WLCG perfSONAR service status report on 2015-10-01 04:02:21.078035 =======

Active perfSONAR instances: 250
GOCDB registered total: 193
OIM registered total: 85
perfSONAR-PS versions deployed: 
   3.4.1 : 7
   3.4.2 : 61
   3.5.0 : 162
   Unknown: 18
Incorrectly configured (failing >4 metrics): 5 

Report 17/09/2015

  • OSG perfSONAR datastore entered production on 14th of Sept providing storage and interface for all perfSONAR results.
  • Publishing of the perfSONAR results using pre-production (ITB) services was successfully established, working to resolve issue with some event types not being published, production still pending SLA.
  • WLCG-wide meshes campaign with latency testing ramped up to 81 sonars caused some instabilities of the sonars with 4GB RAM, therefore we have decreased the number of tests performed and this has improved the situation.
  • Final version of the perfSONAR 3.5 is planned to be released on 28th of September and will be auto-deployed to all WLCG instances. There were no issues found in the testbed, but we plan to update couple of production instances in advance to check if everything is fine.
  • ESNet and OSG have started developments on the perfSONAR configuration interface - open source project motivated by the existing version developed for WLCG. There has been also interest from GEANT and ESNet to collaborate on an open source project based on the existing proximity service.
  • Follow up meeting was held to discuss findings of the FTS performance study lead by Saul Youssef (Boston University), new optimization algorithm was proposed and discussed.
  • Next WG meeting will be on 30th of Sept (https://indico.cern.ch/event/400643/)

Report 03/09/2015

  • Meeting held yesterday, 2nd of September https://indico.cern.ch/event/393102/
  • OSG enabled publishing of the perfSONAR results to the netmon-test-mb.cern.ch from the ITB collector service today. Production setup is still pending SLA.
  • OSG perfSONAR dashboard (psmad.grid.iu.edu), which is already connected to the OSG datastore already showing up to date content.
  • MadAlert - new project to analyse meshes and report infrastructure issues vs network problems already reporting from psmad (MadAlert http://maddash.aglt2.org/madalert.html).
  • perfSONAR operations status
    • Latency mesh: 81 sonars (94% efficiency)
    • Traceroute mesh: 112 sonars (90% efficiency)
    • perfSONAR 3.5rc2 was released yesterday and will be auto-deployed to all testbed instances, one issue with Postgresql reported from UC instance

Report 20/08/2015

  • Established production and validation ActiveMQ brokers at CERN (netmon-mb.cern.ch and netmon-test-mb.cern.ch), they will be used to broadcast data collected by perfSONARs to experiments.
  • OSG will test-enable publishing of the perfSONAR results to the netmon-test-mb.cern.ch from the ITB collector service.
  • Proximity service - developed mapping matrix that experiments could use to map storages to sonars and use it to process the perfSONAR stream from. Currently tested by LHCb, which is developing a perfSONAR to DIRAC connector.
  • New project to analyse meshes and report infrastructure issues vs network problems is being developed at AGLT2 (MadAlert http://maddash.aglt2.org/madalert.html). Plan is to continue to develop it targeting an eventual way to automate problem finding.
  • perfSONAR operations status
    • Progress made on the WLCG-wide meshes, latency mesh now with 70 sonars.
    • Validation of the perfSONAR 3.5rc1 started, final release expected in October.
    • ESNet is finalizing the development design document on the perfSONAR configuration interface - open source project motivated by the existing version developed for WLCG.

Report 30/07/2015

  • Successfully tested publishing of the perfSONAR results to the message bus directly from the OSG collector. Discussing possible SLA to run this as a production service in collaboration with OSG.
  • OSG datastore on track to go production at the end of July, this will be a service provided to the WLCG, it will store all the perfSONAR data and provide an API
  • Started testing proximity service, which helps to map sonars to storages and thus enables integration of the network and transfer metrics.
  • Review of the experiments use cases was presented/discussed at the last meeting, see slides for details (https://indico.cern.ch/event/393101/)
  • FTS performance study update - see slides for details (https://indico.cern.ch/event/393101/), observations from the report so far:
    • Peak transfer rates between Europe and North America are less asymmetric than they were last month (to be followed up)
    • Almost all incoming to BNL uses TCP=1 (Alejandro confirmed this is how BNL is configured right now, the other FTS instances use auto-tuning)
    • CMS T1s have better transfer rates compared to ATLAS and LHCb (to be followed up)
    • CMS uses TCP=1 more often than ATLAS and LHCb for large files
    • TCP stream=1 transfer do timeout about 2-3% of the time, however timeouts are concentrated at a few sites.
    • Throughput dependence on TCP streams possibly understood (see http://egg.bu.edu/lhc/fts/docs/2015-05-26-status/results_so_far.pdf)
  • perfSONAR operations status
    • Agreed to establish WLCG-wide meshes for top 100 sites (based on the contributed storage and location). This will enable full mesh testing of latencies, traceroutes and throughput (ongoing).
    • ESNet interested in the perfSONAR configuration interface developed for WLCG, development design document for an open-source project based it is currently discussed.

Report 02/07/2015

  • perfSONAR status
    • Agreed to establish WLCG-wide meshes for top 100 sites (based on the contributed storage and location). This will enable full mesh testing of latencies, traceroutes and throughput
    • Working in collaboration with ESNet to narrow down on an issue affecting latency measurements for long distance testing (US to Europe, Europe to Asia, etc.). A fix has been released and will be auto-deployed to all sites.
    • perfSONAR 3.5 RC is planned to be released next week. The following sites agreed to participate in the validation testbed: Nebraska, BNL, SWT2, AGLT2, MWT, TAMU, IEPSAS-Kosice
    • perfSONAR support involved in debugging the network issues at RAL

  • Successfully tested publishing perfSONAR results directly from the OSG collector (that populates OSG/esmond datastore).
  • Started testing proximity service, which helps to map sonars to storages and thus enables integration of the network and transfer metrics.
  • Next meeting will be on 8th of July (https://indico.cern.ch/event/393101/), planning a detailed update on OSG datastore and FTS performance study.

Report 18/06/2015

  • perfSONAR status
    • Proposed to establish WLCG-wide meshes for top 100 sites (based on their storage contribution and geographical location). This would enable full mesh testing of latencies, traceroutes and bandwidth.
    • Potential bug was identified and submitted to ESNet affecting latency measurements for long distance testing (US to Europe, Europe to Asia, etc.).
  • Currently evaluating the possibility to publish perfSONAR results directly from the OSG collector (that populates OSG/esmond datastore). Set of patches to extend the OSG collector were submitted for consideration.
  • Next meeting will be on 8th of July (https://indico.cern.ch/event/393101/), planning a detailed update on OSG datastore and FTS performance study.

Report 04/06/2015

  • perfSONAR status
    • Detailed report from the WG was presented on Monday at the LHCOPN-LHCONE meeting - LBL Berkeley (US) (https://indico.cern.ch/event/376098/)
    • Both LHCOPN and LHCONE meshes stable now, consistently delivering metrics. RAL shows signs of continuing network problems in both latency and bandwidth.
    • Based on the positive experience in ramping up latency mesh, we plan to establish full WLCG meshes for all types of tests and use it as a baseline for other meshes
    • In collaboration with ESNet, a bug was found in parsing tracepath results, causing significant reduction in efficiency of getting tracepath results. Plan is to revert back to traceroutes and only run low frequency tracepath tests until the issue is fixed.
    • The old mesh configuration interface hosted from grid-deployment.web.cern.ch will be decomissioned on Monday (8th of June). Few sites that still have the old URLs configured have been notified.
  • Network performance incidents process - new GGUS SU (WLCG Network Throughput) already available, more information at https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Performance_Incidents
  • Test deployed esmond2mq at CERN (developed in collaboration with LHCb), core functionality works fine, waiting for the OSG datastore to enter production in order to run it continuously
  • Next meeting postponed to 10th of June (https://indico.cern.ch/event/382624/). Plan is to focus it on discussing full WLCG meshes proposal, proximity service and initial report from the FTS performance study.
  • Very special thanks for major contributions to the WG and farewell to Soichi Hayashi (OSG) and Aaron Brown (Internet2).

WLCG perfSONAR service status report on 2015-06-04 04:02:22.794725 =======

Active perfSONAR instances: 240
Registered/monitored perfSONAR instances: 260
perfSONAR-PS versions deployed: 
   3.4.1 : 17
   3.4.2 : 200
   Unknown: 21
Incorrectly configured (failing >4 metrics): 23 

Report 21/05/2015

  • perfSONAR status
    • Security: New SSL vulnerability dubbed Logjam: https://weakdh.org/sysadmin.html. WLCG perfSONAR hosts should NOT be vulnerable to this attack. The Apache configuration installed by the Toolkit disables the cipher suites in question by default.
  • Network performance incidents process - new GGUS SU (WLCG Network Throughput) will become available on 24th of June.
  • Next meeting 3rd of June (https://indico.cern.ch/event/382624/). Plan is to focus it on latency ramp up and proximity service.

Report 07/05/2015

  • perfSONAR status
    • Security: NDT 3.7.0.1 was released, fixing potential security issue in NDT. This shouldn't affect WLCG sites that followed our instructions, since they should have NDT/NPAD disabled. We encourage ALL sites to double check this and also to ensure they have auto-updates enabled. The latest perfSONAR Toolkit version that all sites should be running is 3.4.2-12.pSPS (Latest versions of all sub-components are Toolkit-3.4.2 (3.4.2-12.pSPS), BWCTL-1.5.4-1.el6, OWAMP-3.4-10.el6, NDT-3.7.0.1-2.el6, NPAD-1.5.6-3.el6, esmond-1.0-13.el6, Regular Testing Daemon-3.4.2-4.pSPS, iperf3-3.0.11-1.el6).
    • All meshes migrated from iperf to iperf3 and from traceroute to tracepath. This should improve our bandwidth measurements and enable MTU path discovery.
    • Very good progress in ramping up latency tests, currently with 34 sonars, we're able to consistently get results for all tested links.
  • Network performance incidents process put in place as was agreed at the last meeting (https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Performance_Incidents)
  • OSG/Datastore validation progressing well, resolved all performance issues and targeting July for production (progress already visible at http://psmad.grid.iu.edu/maddash-webui/).
  • Publishing results to message bus progressing, development has finalized for esmond2mq prototype and we plan to enter pilot phase. Initial version of the proximity service (mapping sonars to storages) in testing.
  • Last meeting held yesterday (https://indico.cern.ch/event/382623/) - focused on FTS perfromance
    • Hassen Riahi (FTS dashboard) reported on FTS performance for WLCG during the first phase of production (3 months)
    • Initial report on the FTS performance study presented by Saul Youssef (Boston University), common study for ATLAS, CMS and LHCb. Early results already provide valuable insights and also show how we could benefit from integrating FTS and perfSONAR. Agreed to follow up on a regular basis at the next meetings.
  • Next meeting 3rd of June (https://indico.cern.ch/event/382624/). Plan is to focus it on latency ramp up and proximity service.

WLCG perfSONAR service status report on 2015-05-07 04:02:24.706444 =======

Active perfSONAR instances: 235
Registered/monitored perfSONAR instances: 259
perfSONAR-PS versions deployed: 
   3.4.1 : 23
   3.4.2 : 183
   Unknown: 25
Incorrectly configured (failing >4 metrics): 17 

Report 02/04/2015

  • perfSONAR status
    • Security: CVE released today for cassandra, which is used by the perfSONAR measurement archive software, esmond. NO action required to protect perfSONAR Toolkit since vulnerable ports are both disabled and firewalled.
    • perfSONAR 3.4.2 was released and auto-deployed to 163 sonars, there are 42 instances still on 3.4.1. We no longer have any active instances on older versions.
    • We encourage ALL sites that are still on 3.4.1 to check status of their sonars (mainly disk space) and enable auto updates ASAP.
    • Significant improvement observed in getting consistently all the needed metrics after this update. The plan is to resume validation in LHCOPN/LHCONE and continue with a ramp up to full mesh latency tests.
    • Full mesh trace paths now at 80%
  • Network performance incidents follow up (proposal):
    • New mailing list and GGUS SU will be established to follow up, proposed name is wlcg-network-throughput, initial participation will be the same as for the WG mailing list (transfer systems, experiments, perfsonar support, esnet, lhcopn/lhcone).
    • Experiments can report to the GGUS SU/mailing list potential network performance incidents/degradations, WLCG perfSONAR support unit will investigate and confirm if this is network related issue. Once confirmed, it will notify relevant sites and will try to assist in narrowing down the problem to particular link(s). Affected sites will be contacted and should open an incident with their network providers. Tracking of the ongoing incidents will be done on the WG page (https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Performance_Incidents).
    • Sites observing a network performance problem should follow their standard procedure, i.e. report to their network team and if necessary escalate to their network provider while informing the wlcg-network-throughput mailing list. If confirmed to be WAN related, WLCG perfSONAR support unit can assist in further debugging of the problem. For the non-technical (policy) issues or if unclear, sites should escalate to the WLCG operations coordination.

WLCG perfSONAR service status report on 2015-04-02 04:02:22.925555 =======

Active perfSONAR instances: 233
Registered/monitored perfSONAR instances: 259
perfSONAR-PS versions deployed: 
   3.4.1 : 42
   3.4.2 : 163
   Unknown: 26
Incorrectly configured (failing >4 metrics): 27 

Report 19/03/2015

  • WG meeting was held on 18th of March (https://indico.cern.ch/event/379017/)
  • perfSONAR status
    • All sites should be running 3.4.1, final deadline was 16th of February, 5 sites received tickets (3 of them responded)
    • Testing/evaluation of the 3.4.2rc candidate ongoing, additional issues were identified and fixed by the ESNet developers team.
    • Plan is to follow up the testbed for next couple of days, if there are no issues reported, 3.4.2rc will get a green light (once released, this should propagate to all sites within 24 hours)
  • Datastore (esmond) status
    • Esmond testing is ongoing, gathering 100% of the meshes (some with missing data due to issues in 3.4.1)
  • Network performance incidents follow up
    • Procedure was proposed and is still under discussion within the WG.
  • Integration projects
    • Revised proposal for the experiment’s interface to perfSONAR, esmond2mq prototype was developed and tested, feedback will be reported to OSG and ESNet.
  • Next meeting: 8th of April (https://indico.cern.ch/event/382622/)

Report 05/03/2015

  • WG meeting was held on 18th of February (https://indico.cern.ch/event/372546/)
  • All sites should be running 3.4.1, final deadline was 16th of February, 5 sites received tickets (2 of them responded)
  • Follow up campaign to bring all perfSONARs to the correct configuration ongoing, started with LHCOPN/LHCONE instances, several issues found and reported
  • Testbed established to evaluate/test 3.4.2rc (release candidate), which was released last week. Several issues fixed that were reported by us during LHCOPN/LHCONE configuration campaign. One new issue found and reported to the development team.
  • New meshes: IPv6/IPv4 dual stack (lead by Duncan Rand), Latin America (lead by Renato Santana, Pedro Diniz)
  • Testing and evaluation of the pilot instances for esmond/maddash ongoing (psds.grid.iu.edu, psmad.grid.iu.edu)
  • Production instance of the infrastructure monitoring (psomd.grid.iu.edu) updated with new tests that check completeness/freshness of data in the local measurement archives (high level functional test)

  • Integration of the network and transfer metrics: two pilot projects proposed in the last WG meeting
  • LHCb pilot project to provide experiment agnostic prototype to access central datastore (esmond) and publish available metrics to messaging
  • Extending ATLAS FTS performance study to CMS and LHCb

  • Networking degradation between SARA and AGLT2 under investigation - to be followed up at the next WG meeting
    • Original issue noted when many large file transfers SARA->AGLT2 failed. Cause was FTS timeout since files 2-6GB were moving at 10-100s of Kbytes/sec. Problem reported to this working group.
    • perfSONAR regular tests between T2 and T1 have been paused so manual perfSONAR tests were done showing poor performance (200-500 Kbytes/sec).
    • Saul Youssef's examination of FTS logs indicated possible problematic trans-Atlantic link was involved. Additional reports of poor performance between CERN EOS and MWT2 used same link.
    • Recommended procedure (by LHCONE/LHCOPN working group) is to have either end-site contact their R&E network provider to open a ticket. AGLT2 contacted Internet2 and opened a ticket (ISSUE=2688 PROJ=144)
    • Temporary debug mesh setup to test paths between SARA, CERN and AGLT2,MWT2. See https://maddash.aglt2.org/maddash-webui/index.cgi?dashboard=Debug%20Mesh%20(temp)
    • Internet2 has opened ticket with GEANT(TT#2015022734000453) and the issue is actively being pursued.
      • Work underway getting suitable intermediate perfSONAR instances onto LHCONE to help localize the issue.
  • Next WG meeting will be on 18th of March (https://indico.cern.ch/event/379017/)

WLCG perfSONAR service status report on 2015-03-05 04:02:24.416548 =======

Active perfSONAR instances: 225
Registered/monitored perfSONAR instances: 249
perfSONAR-PS versions deployed: 
   3.2.2 : 1
   3.3.2 : 1
   3.4.1 : 207
   3.4.2 : 13
   Unknown: 27
Incorrectly configured (failing >4 metrics): 31 

Report 05/02/2015

  • WG still waiting on input from ATLAS on use-cases/requirements for network metrics
  • Meeting to discuss the use cases will be held on 18th of February (https://indico.cern.ch/event/372546/)

  • 2nd broadcast was sent to remind sites to update to 3.4.1 - final deadline is 16th of February - sites that won't update by this date will receive tickets
  • Production version of perfSONAR infrastructure monitoring available at http://pfomd.grid.iu.edu/ (you need to have your certificate loaded in the browser to access)
  • Pilot versions of maddash and datastore (http://pfds.grid.iu.edu) available
  • perfSONAR operations meeting was held last week - minutes available at https://indico.cern.ch/event/369420/
  • Agreed to start full mesh latency testing starting with top-k sites and gradually moving to all sites
  • Follow up campaign to bring all perfSONARs to the correct configuration

WLCG perfSONAR service status report on 2015-02-05 04:02:21.711952 =======

Active perfSONAR instances: 220
Registered/monitored perfSONAR instances: 241
perfSONAR-PS versions deployed: 
   3.2.2 : 1
   3.3.2 : 4
   3.4.1 : 185
   Unknown: 51
Incorrectly configured (failing >4 metrics): 51 

Report 20/11/2014

  • Metrics area meeting held last week, minutes available at https://indico.cern.ch/event/354593/
  • WG waiting on input from transfer systems and experiments on use-cases/requirements for network metrics
  • Strawman planned for early next year
  • Status of perfSONAR presented also at ATLAS jamboree yesterday
  • Update campaign ongoing, hard deadline for all sites to update is 8th January 2015
  • perfSONAR data store configured in ITB; stress testing ongoing

WLCG perfSONAR service status report on 2014-12-04 04:02:19.233227 =======

perfSONAR instances monitored: 214
perfSONAR-PS versions deployed: 
   3.3.2 : 26
   3.4.1 : 118
   Unknown: 66
GOCDB registered total: 190
OIM registered total: 80
Unreachable instances (not monitored): 42
Incorrectly configured (failing >4 metrics): 69 

Report 20/11/2014

  • 107 instances updated to 3.4.1 following the WLCG and EGI broadcasts sent with the new install/update instructions
  • Second broadcast to be sent next week, deadline to update will be 8th January 2015
  • Planning to start validation of the existing 3.4.1 sonars next week
  • perfSONAR data store configured in ITB; stress testing to start next week
  • Metrics area meeting to be held next week (http://doodle.com/ezrfh8eybu7iybxyqzrcbze9)

WLCG perfSONAR service status report on 2014-11-20 04:02:13.263575 =======

perfSONAR instances monitored: 214
perfSONAR-PS versions deployed: 
   3.3.2 : 29
   3.4.1 : 107
   Unknown: 74
GOCDB registered total: 190
OIM registered total: 70
Unreachable instances (not monitored): 45
Incorrectly configured (failing >4 metrics): 70 

Report 06/11/2014

WLCG perfSONAR service status report on 2014-11-06 04:02:17.829838 =======

perfSONAR instances monitored: 214
perfSONAR-PS versions deployed: 
   3.2.2 : 1
   3.3.1 : 2
   3.3.2 : 47
   3.4.1 : 82
   Unknown: 78
GOCDB registered total: 188
OIM registered total: 55
Unreachable instances (not monitored): 47
Incorrectly configured (failing >4 metrics): 78 

Report 16/10/2014

  • Update on WG presented at GDB last week (Details at agenda)
  • perfSONAR 3.4 released 7th of October, we recommend ALL sites to wait with upgrade until the re-install instructions are broadcasted via WLCG and EGI
  • Performed internal security audit in collaboration with perfSONAR developers - summary to be provided in the re-install instructions
  • Metrics area meeting was canceled, doodle for the new one will be sent shortly
  • POODLE: SSLv3.0 vulnerability (CVE-2014-3566) announced yesterday - https://access.redhat.com/articles/1232123 - affecting perfSONARs as well. Patches from distributions not available yet (16th Oct) - perfSONAR team provided their own fixes yesterday (perl-perfSONAR_PS-Toolkit-3.4-29.pSPS and perl-perfSONAR_PS-Toolkit-SystemEnvironment-3.4-29.pSPS). We recommend all sites running 3.3 to temporarily disable SSL3. We recommend ALL sites to wait with upgrade to 3.4 until the re-install instructions are broadcasted via WLCG and EGI.
  • perfSONAR operations meeting this Friday (Oct 3 at 3PM), minutes at https://indico.cern.ch/event/342995/
    • Highlights: Agreed to introduce several major changes in operations (introduce GGUS SU, security mailing list, setup infrastructure monitoring, introduce automated mesh configurations)
    • Next operations meeting will be held next week, please vote at http://doodle.com/qydib32fkv48er2r

WLCG perfSONAR service status report on 2014-10-16 04:03:54.594325 =======

perfSONAR instances monitored: 214
perfSONAR-PS versions deployed: 
   3.2.2 : 1
   3.3.1 : 2
   3.3.2 : 66
   Unknown: 141
GOCDB registered total: 172
OIM registered total: 55
Unreachable instances (not monitored): 79
Incorrectly configured (failing >4 metrics): 109 

Report 02/10/2014

  • Details on the shell shock vulnerabilites and its impact on perfSONAR available at https://twiki.cern.ch/twiki/bin/view/LCG/ShellShockperfSONAR
  • We recommend ALL sites that didn't patch bash before Friday Sep 26 to terminate their instances and wait until perfSONAR 3.4 is released
  • perfSONAR 3.4 to be released on Mon Oct 6, WLCG and EGI broadcasts will be sent with the installation instructions
  • perfSONAR operations meeting this Friday (Oct 3 at 3PM), agenda at https://indico.cern.ch/event/342995/

WLCG perfSONAR service status report on 2014-10-02 04:02:15.996763 =======

perfSONAR instances monitored: 214
perfSONAR-PS versions deployed: 
   3.3.1 : 2
   3.3.2 : 96
   Unknown: 112
GOCDB registered total: 173
OIM registered total: 55
Unreachable instances (not monitored): 90
Incorrectly configured (failing >4 metrics): 111 

Report 18/09/2014

  • Kick-off meeting minutes and slides available at https://indico.cern.ch/event/336520/
  • The meeting had very good participation including experiments, ESNet Science Engagement Group (perfSONAR development team), Panda, PhEDEx, FTS, FAX as well as majority of the perfSONAR regional contacts. An initial overview of the current status in the network and transfer metrics was presented and a list of topics and tasks to work on in the short-term was proposed. Very good feedback was received and we have agreed on the topics to discuss at the follow up meetings.
  • Please check Twiki for updated task table
  • 5 sites received tickets on running an outdated version of perfSONAR

WLCG perfSONAR service status report on 2014-09-18 09:56:52.693187 =======

perfSONAR instances monitored: 214
perfSONAR-PS versions deployed: 
   3.2.2 : 4
   3.3.1 : 3
   3.3.2 : 175
   Unknown: 28
GOCDB registered total: 172
OIM registered total: 53
Unreachable instances (not monitored): 7
Incorrectly configured (failing >4 metrics): 26 

Report 04/09/2014

  • Kick-off meeting will take place on Mon 8th of Sept at 3PM CEST
  • Early version of the WLCG perfSONAR configuration interface will be deployed to production next week.
  • Pythia Network Diagnosis Infrastructure (PuNDIT) project will be funded by NSF and starts at the beg. of September (lead by Shawn). The project will use perfSONAR-PS data to identify and localize network problems using the Pythia algorithms. PuNDIT will collaborate with OSG and WLCG over its two year duration.
  • Sites with incorrect versions of perfSONAR will receive tickets at the beg. of next week (9 sites in total)
   WLCG perfSONAR service status report on 2014-09-04 10:03:22.949799 =======

perfSONAR instances monitored: 214
perfSONAR-PS versions deployed: 
   3.2.2 : 6
   3.3.1 : 3
   3.3.2 : 173
   Unknown: 28
GOCDB registered total: 170
OIM registered total: 53
Unreachable instances (not monitored): 8
Incorrectly configured (failing >4 metrics): 28 

Report 21/08/2014

  • Updated WG page with list of members, task tracking, coming events and reports (https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics)
  • Kick-off meeting will take place on Mon 8th of Sept at 3PM CEST
  • On July 21st perfSONAR Toolkit 3.4rc2 became available for testing, version 3.4 is a major milestone for the WG as it enables access via REST API and introduces several important performance improvements, therefore deployment campaign will follow once we get a stable release
  • Work is progressing on the WLCG perfSONAR configuration interface (finalized design, work is ongoing on a prototype implementation)
  • OSG perfSONAR datastore plan has been agreed and testing of the store based on esmond is ongoing
WLCG perfSONAR service level report on 2014-08-20 16:59:32.876708=======

perfSONAR instances monitored: 214
perfSONAR-PS versions deployed: 
   3.2.2 : 6
   3.3.1 : 3
   3.3.2 : 174
   Unknown: 27
GOCDB registered total: 170
OIM registered total: 53
Unreachable instances (not monitored): 8
Incorrectly configured (failing >4 metrics): 30

-- MarianBabik - 19 May 2014

Edit | Attach | Watch | Print version | History: r163 | r161 < r160 < r159 < r158 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r159 - 2019-07-29 - MarianBabik
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback