WLCG Network Throughput WG
Mandate
- Ensure sites and experiments can better understand and fix networking issues
Objectives
- Oversight of the perfSONAR network infrastructure
- Coordination of the WLCG network performance incidents
- Detection and follow up on issues seen by the perfSONAR network
Meetings
Regular meet-ups now only take place during LHCOPN/LHCONE and
HEPiX workshops; monthly reports presented at WLCG operations coordination can be found below
Members
Shawn McKee (chairperson), Marian Babik (co-chair), ATLAS (Alessandro di Girolamo), CMS (Nicolo Magini), LHCb (Stefan Roiser, Joel Closier), Alice (Latchezar Betev, Costin Grigoras), FAX (Ilija Vukotic), FTS (Michail Salichos, Oliver Keeble), PANda (Kaushik De), Rucio (Vincent Garonne), BelleII (Silvio Pardi)
perfSONAR contacts: US-ATLAS (Shawn McKee), US-CMS (Jorge Alberto Diaz Cruz), UK-ALL (Alessandra Forti, Duncan Rand), IT-ATLAS (Alessandro de Salvo), IT-CMS (Enrico Mazzoni), CA-ALL (TBD), FR-ALL (Frederique Chollet, Laurent Caillat, Frederic Schaer), TW-ALL (Eric Yen), ND-ALL (TBD), DE-ALL (Guenter Duckeck, Andreas Petzold, DE-KIT: Bruno Hoeft, Aurelie Reymund), ES-ALL (Fernando Lopez, Josep Flix), CERN (Stefan Stancu), LHCOPN/LHCONE (John Shade, ESNet: TBD), RU-ALL (Victor Kotlyar), ESnet Science Engagement group (Jason Zurawski), BelleII (Silvio Pardi), ASGC (Wen-Shui Chen, Felix Lee)
Contacts
Primary contact is via mailing list
wlcg-network-throughput-wg@cernNOSPAMPLEASE.ch, previous mailing lists (wlcg-ops-coord-tf-perfsonar and wlcg-ops-coord-wg-metrics) are still active, defined as aliases to the the primary mailing list. The primary mailing list has two sub-groups:
wlcg-perfsonar-support@cernNOSPAMPLEASE.ch and
throughput-l@listsNOSPAMPLEASE.bnl.gov, which are used to organize and follow up on the corresponding European and North American throughput calls.
Network Throughput Support Unit
Network Performance Incidents Follow up Procedure
The main motivation for this procedure is to investigate network
performance issues with assistance of the perfSONAR team. The focus is on
performance issues and the primary objective is to confirm if a transfer problem observed is network related or not. If it's confirmed to be a WAN issue then work with perfSONAR team to try to narrow it down to particular network link and thus help identify who might be responsible for it. The full text of the procedure follows:
- New GGUS support unit (WLCG Network Throughput; https://wiki.egi.eu/wiki/GGUS:WLCG_Network_Throughput
) can be used to report incidents (corresponding mailing list is: wlcg-network-throughput at cern.ch, initial participation there is the same as for the WG mailing list (transfer systems, experiments, perfsonar support, esnet, lhcopn/lhcone).
- Experiments can report to the mailing list potential network performance incidents/degradations, WLCG perfSONAR support unit will investigate and confirm if this is network related issue. Once confirmed, it will notify relevant sites and will try to assist in narrowing down the problem to particular link(s). Affected sites will be contacted and should open an incident with their network providers. Tracking of the ongoing incidents will be done on the WG page.
- Sites observing a network performance problem should follow their standard procedure, i.e. report to their network team and if necessary escalate to their network provider while informing the wlcg-network-throughput mailing list. If confirmed to be WAN related, WLCG perfSONAR support unit can assist in further debugging of the problem. For the non-technical (policy) issues or if unclear, sites should escalate to the WLCG operations coordination.
Network Performance Incidents
Incident |
Ticket |
Comments |
GSI KISTI |
via mailling list |
Resolved: LHCONE routing issue at KISTI |
IEPSAS-Kosice |
via mailing list |
Resolved: very high loss and latency due to improper placement of perfSONARs on OpenStack |
MWT2 LHCONE |
via mailling list |
Resolved: In collaboration with ESnet, LHCONE prefixes for MWT2 were re-published |
EELA-UTFSM <- UK |
GGUS:143770 |
Resolved: Unable to reach UTFSM from UK non-LHCONE sites, reported to JANET/JISC |
CBPF -> CNAF, PIC, IN2P3 |
via mailing list |
Resolved: CBPF reverted from LHCONE to regular GEANT path |
AUSTRALIA-ATLAS -> PIC |
GGUS:143894 |
Resolved: PIC throughput has known limitation due to link capacity |
RAL IPv6 consist. loss |
GGUS:140447 |
Resolved: External router upgrade/fix |
UK sites |
GGUS:143218 GGUS:143220 |
Resolved: Router hosting the GEANT connection not fully distributing the affected prefixes to all of the JANET core |
CERN inbound |
OTG:0052301 |
Resolved: All CERN IPv4 prefixes were leaked to LHCONE GEANT by TIFR |
JINR inbound |
GGUS:141954 |
Resolved: packet loss seen btw Geneva/Moscow, trans. module had to be replaced in Frankfurt |
EU sites to IHEP/CN |
via mailing list |
Resolved: Routing issue - ticket with GEANT was opened by IHEP, peerings were updated |
UFL to IC |
via mailing list |
Resolved: transfer rates improved before root cause was found |
US T2s/AMS to CERN |
GGUS:139866 GGUS:139874 |
Resolved: ESNet network incident impacting US to CERN connectivity (also impacted AMS) |
SARA to CERN |
GGUS:138472 |
Resolved: MTU issue on IPv6 suspected, but was just packet loss in the end |
RAL/SARA to IN2P3 |
GGUS:137967 GGUS:137972 GGUS:137994 GGUS:139756 |
Resolved: Packet loss on the link due to congestion, IN2P3 has a ticket with RENATER (resolved by upgrading) |
IN2P3 -CC to UTA_SWT2 |
via mailing list |
Resolved: Possible saturation on LHCONE at/close to IN2P3 -CC |
AGLT2 inbound |
via mailing list |
On-going: Narrowed down to ESNet -> ALGT2 segment |
FNAL inbound |
GGUS:137632 |
Resolved: Bad link was identified by FNAL |
IHEP-CN - JINR/IHEP-SU |
GGUS:136606 GGUS:136332 |
On-going: more efficient transit path is missing btw. concerned NRENs, to be followed up in Asia Forum/LHCOPN-LHCONE WS |
DESY/FNAL |
GGUS:135962 |
Resolved: Tests didn't indicate any obvious network issue (* but not all relevant network aspects could be tested). |
UFlorida - Kharkov |
via mailing list |
Resolved: MTU step-down issue - pmtu discovery ACL was fixed by UF |
UNI-Freiburg |
GGUS:135304 |
Resolved: CERN prefixes missing in the routing announcements to SWITCH |
DESY inbound |
GGUS:134470 |
Resolved: Network configuration tuned/changed at DESY |
AGLT2/LHCONE |
via mailing list |
Resolved: Performance issue to LHCONE sites, narrowed down to US/ESNet segment (module issue) |
NCP/Pakistan commissioning |
via mailing list |
On-going: Investigated in collaboration with GlobalNOC ( report ), proposed routing changes for TEIN3 |
CYFRONET/RRC-KI |
GGUS:131375 |
Resolved: MTU step-down (Resolved by PSNC NREN) |
BEgrid-ULB-VUB UKI-LT2-IC-HEP |
GGUS:132286 |
Resolved: IceCUBE flows overloading BEgrid-ULB-VUB networks |
NDGF/BNL from multiple locations |
GGUS:131975 , GGUS:131981 |
Resolved: Issue with FTS at RAL |
RO-02-NIPNE to multiple locations |
GGUS:128489 |
Resolved: MTU step-down + load balancing suspected; NREN was contacted by NIPNE |
PIC to PL Swierk |
GGUS:130112 |
Resolved: Unable to investigate as no pS at PL Swierk, but error suggesting a storage problem |
CNAF/RALPP |
GGUS:130112 |
Resolved: Investigated and resolved as non-network issue |
Oxford |
GGUS:130032 |
On-going: Significant issues seen during August (down to 50Mbs), perf improved afterwards but still not at levels seen last year |
SARA/IC |
GGUS:129964 |
Resolved: Issue with firmware router at SARA network provider |
NCP/Pakistan |
via mailing list |
Resolved: QoS issue, IPv6 performs fine |
CBPF to CNAF, PIC and IN2P3 |
via mailing list to LHCONE ops, GGUS:129561 from LHCb |
Resolved: MTU step down issue within RNP |
T0 to JINR |
GGUS:129544 |
Resolved: by JINR putting in place new Moscow - Dubna link and fixing asymmetries in routing |
IN2P3 NIKHEF to UC |
via mailing list |
Resolved: Univ. of Chicago investigated (root cause unknown) |
BNL ASGC |
ESNet ticket ESNET-20170123-005 |
Resolved: Issue opened by WG; Resolved by ESNet |
IHEP EU |
GGUS:125623 |
Resolved: by NREN (site was not notified of the ongoing network issue) |
UNL FNAL |
via mailing list |
Resolved: UNL investigated (root cause unknown) |
CERN RRCKI |
GGUS:124538 |
Resolved: RRCKI re-routed from AMS to BUD, root cause for congested path RRCKI-AMS was not understood |
MIT inbound throughput |
via mailing list |
Resolved; MIT opened ticket with Internet2 |
EELA-UTFSM MWT2_UC |
via mailing list |
Resolved: gsiftp timeouts, non-network issue |
McGill BU |
GGUS:123285 |
Resolved, gridftp timeouts, but re-appeared, network seems to perform well, likely an issue with storage |
Victoria - Prague |
via mailing list |
Resolved; grid output retrieval failing; asymmetric paths and MTU step down issues |
SARA consistent loss |
GGUS:121687 |
Resolved after SARA migrated to the new data centre |
RAL consistent loss |
GGUS:121687 |
Resolved, RAL router upgraded |
BNL RAL CERN |
GGUS:121687 |
Resolved, issue with RAL router |
BNL SARA CERN |
GGUS:120957 |
Resolved, issue with ESNet router at CERN and saturated link CERN/SARA (was upgraded to 20Gbps) |
ASGC CERN IJS |
GGUS:119820 |
Resolved, issue with router at ASGC and IJS firewall |
CBPF |
GGUS:120081 |
Resolved: RNP stopped publishing to ESNet CBPF IPs |
FNAL CERN |
GGUS:119551 |
Resolved: fixed by ESNet - faulty router interface in New York |
PIC inbound |
via mailing list |
Resolved: 10 Gbps link WAN at PIC sharing LHCOPN,LHCONE was completely saturated causing input discards |
BNL to PIC |
via mailing list |
Resolved: LHCOPN link CERN-PIC was flapping a lot due to an issue with the Geant fibre to Spain |
MAINZ CA |
via mailing list |
Resolved: MAINZ uses a "commercial" network provider and Canadian sites only peer with R&E networks |
OU inbound |
via mailing list |
Resolved: Narrowed down to a faulty switch on site |
CA EU |
GGUS:118748 , GGUS:118730 |
Resolved: Trans-atlantic channel instability, resolved by re-routing at Canarie |
perfSONAR 100G
Security Announcements
- Security: SL6 is no longer supported by the perfSONAR developers. Please upgrade to CC7 as soon as possible or shutdown the node if unable to upgrade. We also strongly recommend to enable auto-updates in all cases.
- Security: New SSL vulnerability dubbed Logjam: https://weakdh.org/sysadmin.html
. WLCG perfSONAR hosts should NOT be vulnerable to this attack. The Apache configuration installed by the Toolkit disables the cipher suites in question by default.
- Security: CVE released 2nd of April 2015 for cassandra, which is used by the perfSONAR measurement archive software, esmond. NO action required to protect perfSONAR Toolkit since vulnerable ports are both disabled and firewalled.
Links
OSG/WLCG Networking Documentation (Deployment and Installation Guides)
OSG/WLCG Network Monitoring Platform and related projects
perfSONAR Stream Structure
perfSONAR Dashboards
perfSONAR Central Configuration
perfSONAR Infrastructure Monitoring
ATLAS Analytics Platform
OSG toolkit info page:
Meetings
- Regular meet-ups now only take place during LHCOPN/LHCONE and HEPiX workshops
- Past WG meetings can be found at https://indico.cern.ch/category/4372/
- 01/28/2015 perfSONAR operations meeting
- 11/26/2014 Metrics area meeting
- 10/22/2014 perfSONAR operations meeting
- 10/3/2014 perfSONAR operations meeting
- 9/8/2014 Network and Transfer Metrics WG Kick-off meeting
- 9/15/2014 LHCOPN and LHCONE joint Meeting: Ann Arbor (US) 15-16 of September agenda
Presentations
Reports
Report 05/11/2020
- Update on WG activities and plans will be presented at WLCG ops coordination (tentative Dec)
- perfSONAR infrastructure - 4.3.0 was released this week
- Release notes https://www.perfsonar.net/releasenotes-2020-11-02-4-3-0.html
- Release focused on python3 migration, but also contains important changes in PWA and new testing tools were added: ethr and s3 benchmark
- Bug was identified and reported to developers yesterday impacting around 36 nodes in OSG/WLCG (out of 166 that auto-updated); bug-fix release (4.3.1) is in the works
- WLCG/OSG Network Monitoring Platform
- Work on publishing directly from perfSONAR toolkits - testing started for USATLAS/USCMS sites
- AGLT2 had a major outage due to air conditioning failure this week, which impacted some of the central services (psconfig, psetf, psmad)
- EU project ARCHIVER plans to use perfSONAR to test cloud connectivity
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Report 03/09/2020
- perfSONAR infrastructure status - 4.2.4 versions was released - please upgrade
- WG activities were presented yesterday at the OSG All hands meeting (https://indico.fnal.gov/event/22127/contributions/194783/
)
- OSG/WLCG infrastructure
- Migration to the push model - direct publishing of results from toolkits to RabbitMQ - implementation is ready, plan is to migrate all US sites in the coming weeks
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Report 02/07/2020
- perfSONAR infrastructure status - 4.2.4 versions was released - please upgrade
- 100 Gbps perfSONAR mesh was established with 6 participating sites (TRIUMF, CERN, BNL, KIT, IC, AGLT2, Prague)
- New LHCONE mesh established testing from sites to R&E perfSONAR endpoints (on LHCONE)
- OSG/WLCG infrastructure
- Discussing plan of migration to the push model - direct publishing of results from toolkits to RabbitMQ
- ESnet (router) traffic feed now available, working on its integration to our pipeline - prototype already working
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Report 07/05/2020
- perfSONAR infrastructure status - 4.2.4 versions was released - please upgrade
- Update on the WG activities will be presented next week at the virtual LHCOPN/LHCONE workshop (https://indico.cern.ch/event/888924/
)
- OSG/WLCG infrastructure
- New dashboards are now available providing high-level overview of packet loss, throughput, latency and traceroutes (https://atlas-kibana.mwt2.org/s/networking/goto/20dd25907d61df98a0b85b1dfaed54e1
)
- Working on a new LHCONE mesh that will focus on testing from sites to R&E endpoints
- Meeting with perfSONAR developers this week on publishing measurements to message bus directly from perfSONAR toolkit - discussed different options and possible strategy going forward
- ESnet (router) traffic feed now available, working on its integration to our pipeline - prototype already working
- Also started working on integration of the OSG HTCondor jobs statistics (network related) - will be added to our pipeline and stream
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Report 02/04/2020
- perfSONAR infrastructure status - 4.2.3 and 4.2.4 versions were released
- OSG/WLCG infrastructure
- New dashboards are now available providing high-level overview of packet loss, throughput, latency and traceroutes (https://atlas-kibana.mwt2.org/s/networking/goto/20dd25907d61df98a0b85b1dfaed54e1
)
- The aim is to make it easier to identify new issues that are not easy to spot by the experiments data management systems (network instabilities that could impact network performance).
- Started identifying interesting cases showing up in the new dashboards, documenting them and following up
- ESnet (router) traffic feed now available, working on its integration to our pipeline
- Also started working on integration of the OSG HTCondor jobs statistics (network related) - will be added to our pipeline and stream
- 100 Gbps perfSONAR testbed mailing list to join: http://cern.ch/simba3/SelfSubscription.aspx?groupName=wlcg-perfsonar-100g
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Report 05/03/2020
- LHCOPN/LHCONE Asia (8-9th March) was re-scheduled to take place during HSF/WLCG workshop, but as the workshop is postponed there will be half-day virtual meeting instead (on 13th of May)
- perfSONAR infrastructure status - 4.2.3 version was released this week
- OSG/WLCG infrastructure
- New dasboards were created to provide high-level overview of packet loss, throughput, latency and traceroutes (https://atlas-kibana.mwt2.org/s/networking/goto/20dd25907d61df98a0b85b1dfaed54e1
)
- Still work in progress but feedback is welcome
- Dashboards are already showing some of the results of the analytical studies that were performed as part of SAND project
- The aim is to make it easier to identify new issues that are not easy to spot by the experiments data management systems (network instabilities that could impact network performance in undeterministic ways).
- Looking into dublin-traceroute (extension of paris-traceroute) and possible integration in perfSONAR
- 100 Gbps perfSONAR testbed mailing list to join: http://cern.ch/simba3/SelfSubscription.aspx?groupName=wlcg-perfsonar-100g
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Report 30/01/2020
- CERN Networking Week took place 13-17th January (https://wiki.geant.org/display/SIGNGN/4th+SIG-NGN+Meeting
)*
- Feedback from LHCOPN/LHCONE workshop (https://indico.cern.ch/event/828520/
)
- Importance of network monitoring has been stressed out by most of the experiments (covered many topics including perfSONAR up to requests for detailed packet telemetry)
- Focus on analytics, better insights into existing results would be beneficial for most of the experiments
- DOMA project had a dedicated slide on perfSONAR, highlighted it as a very useful diagnostic tool.
- DUNE is planning to establish perfSONAR mesh
- Several experiments have mentioned lack of available/used capacity monitoring
- Some experiments have mentioned missing API to access network LHCOPN/LHCONE topologies
- Next steps and follow up discussion will take place at the LHCOPN/LHCONE Asia (8-9th March)
- LHCOPN/LHCONE WS had also a dedicated session on the future of LHC networking
- Dedicated TF will be setup to work on packet tagging/pacing and network orchestration in close collaboration with the experiments
- perfSONAR infrastructure status - please ensure you're running the latest version 4.2.2-1.el7
- 100 Gbps perfSONAR testbed mailing list to join: http://cern.ch/simba3/SelfSubscription.aspx?groupName=wlcg-perfsonar-100g
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Report 12/12/2019
Report 14/11/2019
Report 16/05/2019
- Detailed status update was presented at HEPiX (https://indico.cern.ch/event/765497/contributions/3351215/
)
- CHEP abstract to be submitted (https://docs.google.com/document/d/1O5PhgCmdwbYJpL7qHpPFxxMLaGh1aWbMWyO69pXk-H0/edit
)
- perfSONAR infrastructure status - CC7/4.1 campaign
- All T1s updated and re-configured, except TRIUMF (waiting for hw) and RRC-KI (missing IPv6); we have started to follow up with T2s
- Overall we have 176 perfSONARs on 4.1 (137 on 4.1.6); status has significantly improved
- 4.2.0 release soon - will bring preemptive scheduling & gridftp testing
- WLCG/OSG network services were updated
- Issues with the psmad dashboard were fixed, dashboard now well populated (OPN, UK and FR meshes in very good shape; psmad/maddash
)
- http://monit-grafana-open.cern.ch
also now well populated, some issues with site mapping due to IPv6 fixed, others still remain (mostly due to too many sources/complex topology processing)
- New collector is now in production, re-written from scratch within SAND project, improved performance (lowered latency)
- Work is on-going in both SAND and IRIS-HEP to switch all perfSONAR to report measurements directly to the message bus (real-time measurements capability)
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
- 100 Gbps perfSONARs now at SARA, CERN, CSCS, BNL (80Gbps), KIT (in QA)
- perfSONAR now part of the cloud benchmark testing developed in OCRE project (https://github.com/cern-it-efp/OCRE-Testsuite/
)
Report 07/03/2019
- perfSONAR infrastructure status - CC7/4.1 campaign ongoing
- perfSONAR 4.0 and perfSONARs on SL6 are no longer supported since Q4 2018 - please update ASAP
- New baseline version for perfSONAR is the latest release 4.1.6 (fixes important bug causing duplicate testing)
- WLCG/OSG network services were updated
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Report 07/02/2019
- perfSONAR infrastructure status - CC7/4.1 campaign ongoing
- perfSONAR 4.0 and perfSONARs on SL6 are no longer supported since Q4 2018 - please update ASAP
- We have started ticketing sites, starting with T1s and major T2s
- WG update will be presented at HEPiX in San Diego
- WLCG/OSG network services were updated
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Report 08/11/2018
- perfSONAR infrastructure status - CC7/4.1 campaign ongoing
- Sites were reminded to upgrade to CC7 and review their configuration (preferably by end of October)
- Still only around 50% of nodes are on CC7 as of today - we'll soon start contacting sites directly
- Some sites waiting for/deploying new hardware; e.g. SARA deployed 100Gbps perfSONAR (first in Europe), BNL deployed 2x40 Gbps perfSONAR
- WG update was presented at HEPiX and LHCOPN/LHCONE workshop
- WLCG/OSG network services working fine
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Report 13/09/2018
- perfSONAR infrastructure status
- perfSONAR 4.1 was released few weeks ago - main new feature is an improved central/remote configuration
- WLCG broadcast was sent this week to remind sites to upgrade to CC7 and review their configuration (preferably by end of October)
- Around 50% of sonars are on CC7 as of today
- WG update will be presented at the upcoming HEPiX
- WLCG/OSG network services
- Central configuration service (meshconfig/psconfig) was updated to the version released in 4.1 (officially supported by perfSONAR team)
- psconfig.opensciencegrid.org is currently unreachable via IPv6 from non-LHCONE sites due to issue with routing, this is being followed up by the network team at MSU
- NSF funded projects: SAND and IRIS-HEP are starting, both will contribute in different ways to the OSG Network Area - more details will be provided in the HEPiX talk
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Report 07/06/2018
- perfSONAR infrastructure status
- perfSONAR 4.1 beta will be released in the coming weeks - main new feature is an improved central/remote configuration
- CC7 campaign had only modest progress recently - 86 instances on CC7 (from 81 in April, out of total 210)
- WLCG broadcast will be sent to remind sites to plan an upgrade to CC7 and review their configuration
- WG update was presented at HEPiX and will be presented at CHEP
- WLCG/OSG network services
- Following retirement of OSG GOC, all central services were migrated to AGLT2, which took considerable effort in planning and deployment
- Transition happened without downtime and was transparent to all sites
- One exception are sites using the old OIM/myOSG central configuration URL, which was deprecated during 3.5 update campaign (meshconfig URLs starting with myosg.grid.iu.edu/pfmesh...)
- Impacted sites are asked to update their meshconfig-agent.conf following http://opensciencegrid.org/networking/perfsonar/installation/#installation
ASAP
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Report 12/04/2018
- perfSONAR 4.0.2 and CC7 campaign - 210 instances updated to 4.0.2; 81 instances already on CC7
- WLCG broadcast will be sent to remind sites to plan an upgrade to CC7 and review the firewall port openings
- perfSONAR 4.1 release, planned in Q2 2018 will no longer ship SL6 packages
- Attended perfSONAR developers F2F meeting in Amsterdam and presented feedback from OSG/WLCG
- WG reports planned for upcoming HEPiX and CHEP
- Networking and perfSONAR were also major topics at the OSG-All Hands (https://indico.fnal.gov/event/15344/
)
- 4 presentations were given on various topics related to the WG
- One of the outcomes was a proposal to create a dedicated site-based documentation showing all links relevant to a given site
- WLCG/OSG network services
- Outreach and other activities:
- GEANT has added several perfSONAR instances on LHCONE at their major network hubs (ams, gva, lon, par, fra) - both IPv4 and IPv6
- Advania was added to HNSciCloud test mesh
- MGHPCC (http://www.mghpcc.org/
) plans to deploy up to 22 perfSONARs, currently in discussion how we can help
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Report 01/03/2018
- perfSONAR 4.0.2 and CC7 campaign - 190 instances updated to 4.0.2; 64 instances already on CC7
- WLCG broadcast will be sent to remind sites to plan an upgrade to CC7 and review the firewall port openings
- perfSONAR 4.1 release, planned in Q2 2018 will no longer ship SL6 packages
- WLCG/OSG network services
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
- LHCOPN/LHCONE Workshop will take place next week - update on WG activities will be presented (https://indico.cern.ch/event/681168/
)
- perfSONAR developers F2F meeting will take place next week in Amsterdam - feedback from OSG/WLCG will be presented
Report 18/01/2018
- perfSONAR 4.0.2 - 190 instances updated out of which 53 are already on CC7
- WLCG broadcast will be re-sent next week to remind sites of the upcoming important dates and new documentation
- perfSONAR 4.1 release, planned in Q1 2018 will no longer ship SL6 packages
- EOL for SL6 support in Q3 2018
- All sites are encouraged to upgrade to CC7 as soon as possible
- WLCG/OSG network services
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Report 06/12/2017
- perfSONAR workshop held by JISC in November ( slides
) - WLCG WG activities mentioned
- perfSONAR 4.0.2 was released on November 28th
- WLCG broadcast will be sent this week to notify sites of the upcoming important dates and new documentation
- perfSONAR 4.1 release, planned in Q1 2018 will no longer ship SL6 packages
- EOL for SL6 support in Q3 2018
- All sites are encouraged to upgrade to CC7 as soon as possible
- WLCG/OSG network services
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
- HNSciCloud will use perfSONAR results for network performance evaluation of the providers
- HEPiX WG on SDN/NFV was established and will look into networking R&D topics - sites interested please subscribe via https://listserv.in2p3.fr/cgi-bin/wa?SUBED1=hepix-nfv-wg
Report 02/11/2017
- WG update was presented at HEPiX and LHCOPN/LHCONE workshop (co-located)
- perfSONAR 4.0.2 is planned to be released in November
- WLCG/OSG network services
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
- HNSciCloud meshes are being created (per provider), will enable tests between HNSciCloud sites and providers
Report 05/10/2017
- WG update will be presented at HEPiX and LHCOPN/LHCONE workshop (co-located)
- perfSONAR YouTube channel at https://www.youtube.com/channel/UCjK-P49pAKK9hUrrNbbe0Sg
- perfSONAR 4.0.1 auto-deployed to 197 instances (21 are already on centos7)
- Port 443/https is now used as a controller port for pscheduler and needs to be open on central firewalls
- Some sites suffer from an MA access issue after the upgrade, this is being followed up
- perfSONAR 4.0.2 is planned to be released in November
- Brings new SNMP plugin that can be used to retrieve local site router traffic
- WLCG/OSG network services
- New documentation is in preparation and will be hosted at https://opensciencegrid.github.io/networking/
- OSG collector handling multiple backends (Datastore, CERN ActiveMQ and GOC RabbitMQ) now in production
- GOC will distribute raw data to 3 different locations, FNAL for tape archive, Nebraska for long-term ES storage, Chicago for short-term ES storage
- Preparing new LHCOPN and perfSONAR dashboards in collaboration with CERN IT/CS and IT/MONIT
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
- HNSciCloud will create its own perfSONAR mesh to follow up on the network performance btw. providers and sites
Report 14/09/2017
- WG update will be presented at HEPiX and LHCOPN/LHCONE workshop (co-located)
- perfSONAR 4.0.1 was released and was auto-deployed to 187 instances (21 are already on centos7)
- WLCG/OSG network services
- New documentation is in preparation and will be hosted at https://opensciencegrid.github.io/networking/
- New central mesh configuration interface (MCA) and monitoring (ETF) in production (http://meshconfig.grid.iu.edu
; https://psetf.grid.iu.edu/etf/check_mk/
)
- OSG collector handling multiple backends (Datastore, CERN ActiveMQ and GOC RabbitMQ) now in production
- GOC will distribute raw data to 3 different locations, FNAL for tape archive, Nebraska for long-term ES storage, Chicago for short-term ES storage
- Central dashboard service (psmad.grid.iu.edu) suffers from a bug which prevents showing statuses correctly (as well as retrieve the graphs), ESNet is working on a fix
- Preparing new LHCOPN and perfSONAR dashboards in collaboration with CERN IT/CS and IT/MONIT
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Report 06/07/2017
- Detailed WG update presented as part of the network session at the WLCG workshop in Manchester
- perfSONAR 4.0 was released on 17th of April
- 194 nodes updated so far
- ES/Kibana dashboard showing perfSONAR infrastructure status in testing
- WLCG/OSG network services
- New central mesh configuration interface (MCA) in production (http://meshconfig.grid.iu.edu
)
- New monitoring based on ETF in production (https://psetf.grid.iu.edu/etf/check_mk/
)
- New OSG collector handling multiple backends (Datastore, CERN ActiveMQ and GOC RabbitMQ) in production
- New LHCOPN grafana dashboards done in collaboration with CERN IT/CS and IT/MONIT in testing
- Additional perfSONAR dashboards to be added soon
- Throughput call was held on Wed May 24th at 4pm CEST (https://indico.cern.ch/event/640627/
) mainly focusing on review of new production services
Report 18/05/2017
- perfSONAR 4.0 was released on 17th of April
- 180 sites have updated so far
- Some sites reported issues with load after updating, under investigation
- WLCG/OSG network services
- New central mesh configuration interface (MCA) will be deployed to production next week - transition will be transparent to all sites
- MCA
was developed by OSG and becomes part of perfSONAR.
- Monitoring based on ETF is planned to be deployed in ITB
- OSG collector will be updated to handle multiple backends (datastore, two message buses)
- LHCOPN grafana dashboards established in collaboration with CERN IT/CS and MONIT team (access restricted to CERN users, public access in the works)
- Next Throughput call will be on Wed May 24th at 4pm CEST (https://indico.cern.ch/event/640627/
)
Report 06/04/2017
- LHCOPN/LHCONE workshop in BNL took place this week (https://indico.cern.ch/event/581520/
)
- perfSONAR 4.0 to be released on 17th of April
- Site on auto-updates will get it automatically - no action needed.
- Sites planning to update perfSONARs to CC7 are encouraged to wait until 4.1 is released.
- Minimal hardware requirements were shifted: Sites running perfSONARs with less than 4GB RAM and 2 core CPU with clock speed less than 2GHz are encouraged to keep running the old version (3.5.1)
- WLCG/OSG network services
- New central mesh configuration interface (MCA) will be deployed to production - transition will be transparent to all sites
- MCA
was developed by OSG, but becomes part of perfSONAR
- Integrates perfSONAR lookup service with OIM/GOCDB services, so we can now easily add NREN perfSONARs into our meshes
- Monitoring was updated to cover new features released in 4.0 and is now based on ETF
- OSG collector was updated to collect additional perfSONAR metrics (such as TCP retransmits, path MTU, etc)
- LHCOPN traffic and LHCONE simulated link utilisation
now available for subscriptions from the netmon brokers
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
- BNL/ASGC throughput improved by factor 10 - details reported at the LHCOPN/LHCONE workshop
Report 26/01/2017
Report 01/12/2016
- pre-GDB on NETWORKING will take place on 10th of January, preliminary agenda now available at https://indico.cern.ch/event/571501/
- If you plan to attend register at https://indico.cern.ch/event/571501/
to help us with logistics
- Please let us know what you would like to see come out from this meeting. If there are additional topics you would like to see in the agenda or modifications to existing items, please let us know.
- Invitation was sent to all four experiments
- Next throughput meeting planned 14th of Dec
- Focus on perfSONAR RC validation
- perfSONAR team announced that they plan to release 4.0 RC3, which will push final release to next year
- WLCG Network Throughput Support Unit: see twiki for summary of recent activites.
Report 03/11/2016
- pre-GDB on NETWORKING will take place on 10th of January, participation of experiments and sites is crucial, if you plan to attend PLEASE register at https://indico.cern.ch/event/571501/
- WG results recently reported at various events:
- Throughput meeting was held on 27th Oct:
- Focused on the network analytics, see minutes
for details
- perfSONAR 4.0 RC2 was released yesterday, we will intensify validation effort towards final release planned end of November, update campaign to follow once final release is out
- We are now using a new mailing list wlcg-network-throughput-wg@cernNOSPAMPLEASE.ch - joint mailing list for European and NA throughput meetings
- WLCG Network Throughput Support Unit: CERN - RRCKI followed up, see twiki for details.
Report 29/09/2016
- Network session
at the WLCG workshop
- Q&A session planned, questions will be sent in advance, we encourage all to participate
- Inder Monga (Director of ESNet) will join the session
- LHCOPN/LHCONE workshop was held in Helsinki, Sept 19-20 (https://indico.cern.ch/event/527372/
)
- GEANT reported peaks over 100GBps and growth of over 65% from Q2 2015 to Q2 2016
- ESNet reported that LHCONE traffic has increased 118% in the past year
- Positive feedback received on the LHC Network Evolution talk
- pre-GDB on networking focusing on the long-term network evolution planned on January 10th - save the date
- Throughput meetings were held on 15th Sept:
- Hendrik Borras (Univ. of Heidelberg) presented early results on the network telemetry based on perfSONAR
- perfSONAR 4.0 RC1 was released, RC2 planned in October with final release sometime in November
- We are now using a new mailing list wlcg-network-throughput-wg@cernNOSPAMPLEASE.ch - joint mailing list for European and NA throughput meetings
- WLCG Network Throughput Support Unit: New cases were reported on IPv6 and are being followed up, see twiki for details.
Report 01/09/2016
- Network session is planned at the WLCG workshop covering IPv6, LHCOPN/LHCONE status and LHC network evolution
- LHCOPN/LHCONE workshop will be held in Helsinki, Sept 19-20 (https://indico.cern.ch/event/527372/
)
- pre-GDB on networking focusing on the long-term network evolution postponed to January
- Throughput meetings were held on July, 27 and August, 16:
- Mark Feit from Internet2 presented pScheduler (new test scheduler in perfSONAR 4.0)
- Xinran Wang and Ilija Vukotic from Univ. of Chicago presented their Network Analytics work
- OSG datastore and collector are experiencing problems since the upgrade last week, the issue is being followed up by OSG
- Plan on migration to the new perfSONAR 4.0 configuration was drafted and will be followed up with OSG
- We are now using a new mailing list wlcg-network-throughput-wg@cernNOSPAMPLEASE.ch - joint mailing list for European and NA throughput meetings
- WLCG Network Throughput Support Unit: New cases were reported and are being followed up, see twiki for details.
Report 07/07/2016
- WG update presented at ATLAS TIM including discussion on the mid-long term network evolution
- pre-GDB on networking focusing on the mid-long term network evolution will be held in December
- North American Throughput meeting held on 22nd of June:
- Andy Lake presented new features planned in perfSONAR 4.0
- Next meeting end of July, main topic: pScheduler (replaces bwctl)
- WLCG Throughput meeting held 16th of June:
- Main topic was re-organization of the meshes, the proposal was agreed and implemented
- New experiment-based meshes were introduced in the production dashboard in effect (see twiki)
- Next meeting in Sept. (co-located with LHCOPN/LHCONE)
- WLCG Network Throughput Support: Several new cases were reported and are being followed, see twiki for details.
- perfSONAR 4.0 (formerly 3.5) RC to become available end of August, WLCG validation and deployment campaign will follow.
- Introduces several major changes such as new configuration management and interface as well as migration from BWCTL to pScheduler
Report 02/06/2016
- WLCG Network Throughput SU:
- ASGC connectivity - After numerous tests performed in collaboration with ASGC and ESNet (http://etf.cern.ch/perfsonar_asgc.txt
) the root cause has been confirmed to be the local N7K router at ASGC. Once the perfSONARs were moved directly to the central router the measured network performance has improved by factor 10. Our recommendation is to re-wire all the existing data transfer nodes to bypass the local router as well as to tune the central router and data transfer nodes to improve their performance for long path transfers (200ms+).
- Two new tickets received related to packet loss observed at RAL and SARA
- Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
- North American Throughput meeting held on 1st of June:
- Shawn presented OSG network area roadmap, main focus will be on developing notification/alerting, support for higher-level services (analytics) and prepare for SDN
- Next meeting is on 22nd June - main topic will be perfSONAR 4.0
- WLCG Throughput meeting will be held on 16th of June - main topic is re-organization of the meshes
- perfSONAR 4.0 (formerly 3.6) RC expected end of June
Report 28/05/2016
Report 28/04/2016
- WLCG Network Throughput SU:
- Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
- North American Throughput meeting held on 6th April:
- Jason Zurawski from ESNet presented the art of debugging network issues with perfSONAR
- WLCG Throughput meeting held on 14th of April:
- Discussed design and limitations of the current WLCG bandwidth mesh, throughput tests between WLCG sites
- Followed up on the WLCG deployment/operations status
- WG was presented at HEPiX in DESY
- Added section with useful links to the WG homepage https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics
Report 07/04/2016
- WLCG Network Throughput SU:
- CBPF connectivity (https://ggus.eu/index.php?mode=ticket_info&ticket_id=120081
) - resolved
- ASGC connectivity (https://ggus.eu/index.php?mode=ticket_info&ticket_id=119820
) - ongoing
- Packet loss and high latency for certain packets (queuing issue ?) reported by perfSONAR on ASGC to CERN, but not confirmed by the counters
- Narrowed down to the StartLight to ASGC segment, but unfortunately there are very few sonars in Asia with very limited peering, which will impact further investigation
- Throughput tests show peaks of 400Mbit/s (200Mbit/s usual) with frequent retransmissions occurring in bunches, we'll try to run tcpdump to understand the root cause
- Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
- Throughput meeting held on 6th April:
- Update on WG will be presented at HEPiX in DESY
- WG review will take place at the next WLCG ops coordination on 28th April
- Added section with useful links to the WG homepage https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics
Report 17/03/2016
- ICFA SCIC meeting was held at J-Park in February, slides from the report (including WG contribution) can be found at http://icfa-scic.web.cern.ch/ICFA-SCIC/meetings.html
- LHCOPN/LHCONE Meeting held in Taipei (https://indico.cern.ch/event/461511/
)
- WLCG Network Throughput SU: ASGC connectivity
- Packet loss and high latency for certain packets (queuing issue ?) reported by perfSONAR on ASGC to CERN, but not confirmed by the counters
- Narrowed down to the StartLight to ASGC segment, but unfortunately there are very few sonars in Asia with very limited peering, which will impact further investigation
- Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
- Throughput meetings held on Feb 24th and March 9th :
- Soichi Hayashi presented the new configuration interface that will become part of perfSONAR 3.6
- Shawn presented the way we currently monitor the perfSONAR infrastructure, including OSG production services
- perfSONAR 3.5.1 released, 184 instances were auto-updated, only 13 instances on 3.4
Report 18/02/2016
- WG has contributed to the International Committee for Future Accelerators (ICFA) Annual networking report (https://cds.cern.ch/record/2130751
)
- WLCG Network Throughput SU: BNL to PIC throughput degradation
- Root cause was instability of the GEANT Spain fiber channels
- Issue was reported by ATLAS and involved ESNet, LHCONE, perfSONAR and BNL
- WLCG Network Throughput SU: FNAL to CERN
- Issue at ESNet, resolved by LHCOPN ops
- Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
- Meeting held on LHCb DIRAC bridge on January 18th:
- Ongoing developments on adding additional graphs (latencies, throughput) and bug fixing, plan is to go production by Q3 2016
- Throughput meeting held on January 27th:
Report 21/01/2016
- WLCG Network Throughput SU: GGUS-118730
Throughput degradation between CA and EU
- Root cause was instability of the transatlantic link (WIX reported submarine shunt fault), which in turn impacted Geant- CANARIE link.
- perfSONAR network helped to identify the problematic segment and once Canarie was notified the issue was resolved by re-routing.
- Issue was reported by ATLAS, but many different people were involved (ATLAS, TRIUMF, perfSONAR support, LHCONE, Canarie, WIX).
- Multiple GGUS tickets were open, but only one was followed up, something to improve in the future.
- Experiments: Please check if everyone was notified of the on-going incident and let us know if we need to add additional contacts (wlcg-network-throughput mailing list)
- OSG perfSONAR production services: Storage failure (OASIS) at GOC has impacted the entire perfSONAR pipeline, initially just the datastore, but later on also collector and publisher. The issue was resolved yesterday and the systems are recovering now. We have proposed changes that would remove dependency on the shared storage.
Report 07/01/2016
- Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard), minor instability in the dashboard reported yesterday, being followed up by OSG
- Additional monitoring metrics will be added to psomd.grid.iu.edu to capture collector's efficiency and report on freshness of the metadata in the OSG Datastore (for each sonar).
- Proposed re-organization of the WG meetings, split into two areas, perfSONAR operations (throughput calls) and research/pilot projects
- perfSONAR operations - main scope would be to continue with perfSONAR support, follow up on the existing infrastructure while at the same time start looking into issues already shown by the existing tools and try to fix them at the source. As this scope is well aligned with the existing North American throughput calls, we could alternate the meetings and publish common notes.
- Research/pilot projects - will have separate on-demand meetings with notes published to WG mailing list
- F2F meeting once a year, co-located with GDB or other workshop/conference
- Pilot projects: LHCb DIRAC bridge available online
Report 19/11/2015
- perfSONAR collector, datastore, publisher and dashboard in production (stable operations)
- Additional monitoring metrics will be added to psomd.grid.iu.edu to capture collector's efficiency and report on freshness of the metadata in the OSG Datastore (for each sonar).
- perfSONAR 3.5: 205 sonars were updated, ALL sites are encouraged to enable auto-updates for perfSONAR
- Pilot projects: ATLAS Panda, perfSONAR stream now in ATLAS Network Analytics (https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ATLASAnalytics), several KIBANA dashboards available - Site link stats
. Jorge and Ilija working on cost matrix using the round-trip time and packet loss in Mathis's formula to infer bandwidth (predictions based on this model will follow).
- Pilot projects: LHCb DIRAC bridge is now functional, processing perfSONAR stream and inserting packet loss metrics in DIRAC, includes mapping to LHCb sites. Henryk, Federico and Stefan are working on this.
Report 05/11/2015
- perfSONAR collector, datastore, publisher and dashboard now in production (stable operations)
- perfSONAR 3.5: 205 sonars were updated, ALL sites are encouraged to enable auto-updates for perfSONAR
- Detailed report from the WG presented at GDB
- Meeting held yesterday, encouraging all mesh leaders to participate
- Started discussion on the network outage and at risk announcements from NRENs
- Pilot projects: ATLAS Panda, perfSONAR stream now in ATLAS Network Analytics (https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ATLASAnalytics), several KIBANA dashboards available MWT2
FZK2
. Jorge and Ilija working on cost matrix using the round-trip time and packet loss in Mathis's formula to infer bandwidth (predictions based on this model will follow).
Report 22/10/2015
- perfSONAR collector, datastore, publisher and dashboard now in production !
- psmad becomes the official dashboard for perfSONAR meshes
- perfSONAR 3.5: 183 sonars were updated, ALL sites are encouraged to enable auto-updates for perfSONAR.
- Detailed report from the WG presented at HEPiX/GDB
, we will also present status update again at the November's GDB
- ATLAS started processing perfSONAR stream to create a network “cost-matrix” for use by PANDA with additional use cases in scheduled transfers and dynamic data access
- LHCb also started processing perfSONAR stream and correlates it with the network and transfer metrics in DIRAC
- Next WG meetings will be on 4th of Nov and 2nd of Dec
Report 01/10/2015
- Meeting held yesterday, https://indico.cern.ch/event/400643/
- Publishing of the perfSONAR results using OSG production service planned for 13th of October (OSG production date)
- OSG dashboard (psmad.grid.iu.edu) will go production on the same date, already showing more recent results than maddash.aglt2.org, one issue to be fixed is to correctly show tests done in one-direction only
- WLCG-wide meshes campaign finalized with 94 sonars in latency testing, 115 sonars in traceroutes and 104 in throughput.
- Sonars that were not included in the WLCG-wide meshes were reported to the mesh leaders and will be followed up (currently they reside in the global meshes, once issues are fixed they'll be moved to WLCG meshes)
- Started re-creating project meshes, Belle II and Dual-stack (IPv4/IPv6 bandwidth), plans for other meshes to be discussed
- Once infrastructure is in production, we plan to focus on the integration projects, there are ongoing pilot projects for ATLAS and LHCb
- There is also interest in perfSONAR in the IT Analytics WG as well as from the network community Asia Tier Centre Forum (https://indico.cern.ch/event/395656/
)
- perfSONAR 3.5 was released on Monday 28th Sept, 162 sonars were auto-updated, 68 still on 3.4, all sites are encouraged to enable auto-updates for perfSONAR
- Next WG meetings will be on 4th of Nov and 2nd of Dec
WLCG perfSONAR service status report on 2015-10-01 04:02:21.078035 =======
Active perfSONAR instances: 250
GOCDB registered total: 193
OIM registered total: 85
perfSONAR-PS versions deployed:
3.4.1 : 7
3.4.2 : 61
3.5.0 : 162
Unknown: 18
Incorrectly configured (failing >4 metrics): 5
Report 17/09/2015
- OSG perfSONAR datastore
entered production on 14th of Sept providing storage and interface for all perfSONAR results.
- Publishing of the perfSONAR results using pre-production (ITB) services was successfully established, working to resolve issue with some event types not being published, production still pending SLA.
- WLCG-wide meshes campaign with latency testing ramped up to 81 sonars caused some instabilities of the sonars with 4GB RAM, therefore we have decreased the number of tests performed and this has improved the situation.
- Final version of the perfSONAR 3.5 is planned to be released on 28th of September and will be auto-deployed to all WLCG instances. There were no issues found in the testbed, but we plan to update couple of production instances in advance to check if everything is fine.
- ESNet and OSG have started developments on the perfSONAR configuration interface - open source project motivated by the existing version developed for WLCG. There has been also interest from GEANT and ESNet to collaborate on an open source project based on the existing proximity service.
- Follow up meeting was held to discuss findings of the FTS performance study lead by Saul Youssef (Boston University), new optimization algorithm was proposed and discussed.
- Next WG meeting will be on 30th of Sept (https://indico.cern.ch/event/400643/
)
Report 03/09/2015
- Meeting held yesterday, 2nd of September https://indico.cern.ch/event/393102/
- OSG enabled publishing of the perfSONAR results to the netmon-test-mb.cern.ch from the ITB collector service today. Production setup is still pending SLA.
- OSG perfSONAR dashboard (psmad.grid.iu.edu), which is already connected to the OSG datastore already showing up to date content.
- MadAlert - new project to analyse meshes and report infrastructure issues vs network problems already reporting from psmad (MadAlert http://maddash.aglt2.org/madalert.html
).
- perfSONAR operations status
- Latency mesh: 81 sonars (94% efficiency)
- Traceroute mesh: 112 sonars (90% efficiency)
- perfSONAR 3.5rc2 was released yesterday and will be auto-deployed to all testbed instances, one issue with Postgresql reported from UC instance
Report 20/08/2015
- Established production and validation ActiveMQ brokers at CERN (netmon-mb.cern.ch and netmon-test-mb.cern.ch), they will be used to broadcast data collected by perfSONARs to experiments.
- OSG will test-enable publishing of the perfSONAR results to the netmon-test-mb.cern.ch from the ITB collector service.
- Proximity service - developed mapping matrix that experiments could use to map storages to sonars and use it to process the perfSONAR stream from. Currently tested by LHCb, which is developing a perfSONAR to DIRAC connector.
- New project to analyse meshes and report infrastructure issues vs network problems is being developed at AGLT2 (MadAlert http://maddash.aglt2.org/madalert.html
). Plan is to continue to develop it targeting an eventual way to automate problem finding.
- perfSONAR operations status
- Progress made on the WLCG-wide meshes, latency mesh now with 70 sonars.
- Validation of the perfSONAR 3.5rc1 started, final release expected in October.
- ESNet is finalizing the development design document on the perfSONAR configuration interface - open source project motivated by the existing version developed for WLCG.
Report 30/07/2015
- Successfully tested publishing of the perfSONAR results to the message bus directly from the OSG collector. Discussing possible SLA to run this as a production service in collaboration with OSG.
- OSG datastore on track to go production at the end of July, this will be a service provided to the WLCG, it will store all the perfSONAR data and provide an API
- Started testing proximity service, which helps to map sonars to storages and thus enables integration of the network and transfer metrics.
- Review of the experiments use cases was presented/discussed at the last meeting, see slides for details (https://indico.cern.ch/event/393101/
)
- FTS performance study update - see slides for details (https://indico.cern.ch/event/393101/
), observations from the report so far:
- Peak transfer rates between Europe and North America are less asymmetric than they were last month (to be followed up)
- Almost all incoming to BNL uses TCP=1 (Alejandro confirmed this is how BNL is configured right now, the other FTS instances use auto-tuning)
- CMS T1s have better transfer rates compared to ATLAS and LHCb (to be followed up)
- CMS uses TCP=1 more often than ATLAS and LHCb for large files
- TCP stream=1 transfer do timeout about 2-3% of the time, however timeouts are concentrated at a few sites.
- Throughput dependence on TCP streams possibly understood (see http://egg.bu.edu/lhc/fts/docs/2015-05-26-status/results_so_far.pdf
)
- perfSONAR operations status
- Agreed to establish WLCG-wide meshes for top 100 sites (based on the contributed storage and location). This will enable full mesh testing of latencies, traceroutes and throughput (ongoing).
- ESNet interested in the perfSONAR configuration interface developed for WLCG, development design document for an open-source project based it is currently discussed.
Report 02/07/2015
- perfSONAR status
- Agreed to establish WLCG-wide meshes for top 100 sites (based on the contributed storage and location). This will enable full mesh testing of latencies, traceroutes and throughput
- Working in collaboration with ESNet to narrow down on an issue affecting latency measurements for long distance testing (US to Europe, Europe to Asia, etc.). A fix has been released and will be auto-deployed to all sites.
- perfSONAR 3.5 RC is planned to be released next week. The following sites agreed to participate in the validation testbed: Nebraska, BNL, SWT2, AGLT2, MWT, TAMU, IEPSAS-Kosice
- perfSONAR support involved in debugging the network issues at RAL
- Successfully tested publishing perfSONAR results directly from the OSG collector (that populates OSG/esmond datastore).
- Started testing proximity service, which helps to map sonars to storages and thus enables integration of the network and transfer metrics.
- Next meeting will be on 8th of July (https://indico.cern.ch/event/393101/
), planning a detailed update on OSG datastore and FTS performance study.
Report 18/06/2015
- perfSONAR status
- Proposed to establish WLCG-wide meshes for top 100 sites (based on their storage contribution and geographical location). This would enable full mesh testing of latencies, traceroutes and bandwidth.
- Potential bug was identified and submitted to ESNet affecting latency measurements for long distance testing (US to Europe, Europe to Asia, etc.).
- Currently evaluating the possibility to publish perfSONAR results directly from the OSG collector (that populates OSG/esmond datastore). Set of patches to extend the OSG collector were submitted for consideration.
- Next meeting will be on 8th of July (https://indico.cern.ch/event/393101/
), planning a detailed update on OSG datastore and FTS performance study.
Report 04/06/2015
- perfSONAR status
- Detailed report from the WG was presented on Monday at the LHCOPN-LHCONE meeting - LBL Berkeley (US) (https://indico.cern.ch/event/376098/
)
- Both LHCOPN and LHCONE meshes stable now, consistently delivering metrics. RAL shows signs of continuing network problems in both latency and bandwidth.
- Based on the positive experience in ramping up latency mesh, we plan to establish full WLCG meshes for all types of tests and use it as a baseline for other meshes
- In collaboration with ESNet, a bug was found in parsing tracepath results, causing significant reduction in efficiency of getting tracepath results. Plan is to revert back to traceroutes and only run low frequency tracepath tests until the issue is fixed.
- The old mesh configuration interface hosted from grid-deployment.web.cern.ch will be decomissioned on Monday (8th of June). Few sites that still have the old URLs configured have been notified.
- Network performance incidents process - new GGUS SU (WLCG Network Throughput) already available, more information at https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Performance_Incidents
- Test deployed esmond2mq at CERN (developed in collaboration with LHCb), core functionality works fine, waiting for the OSG datastore to enter production in order to run it continuously
- Next meeting postponed to 10th of June (https://indico.cern.ch/event/382624/
). Plan is to focus it on discussing full WLCG meshes proposal, proximity service and initial report from the FTS performance study.
- Very special thanks for major contributions to the WG and farewell to Soichi Hayashi (OSG) and Aaron Brown (Internet2).
WLCG perfSONAR service status report on 2015-06-04 04:02:22.794725 =======
Active perfSONAR instances: 240
Registered/monitored perfSONAR instances: 260
perfSONAR-PS versions deployed:
3.4.1 : 17
3.4.2 : 200
Unknown: 21
Incorrectly configured (failing >4 metrics): 23
Report 21/05/2015
- perfSONAR status
- Security: New SSL vulnerability dubbed Logjam: https://weakdh.org/sysadmin.html
. WLCG perfSONAR hosts should NOT be vulnerable to this attack. The Apache configuration installed by the Toolkit disables the cipher suites in question by default.
- Network performance incidents process - new GGUS SU (WLCG Network Throughput) will become available on 24th of June.
- Next meeting 3rd of June (https://indico.cern.ch/event/382624/
). Plan is to focus it on latency ramp up and proximity service.
Report 07/05/2015
- perfSONAR status
- Security: NDT 3.7.0.1 was released, fixing potential security issue in NDT. This shouldn't affect WLCG sites that followed our instructions, since they should have NDT/NPAD disabled. We encourage ALL sites to double check this and also to ensure they have auto-updates enabled. The latest perfSONAR Toolkit version that all sites should be running is 3.4.2-12.pSPS (Latest versions of all sub-components are Toolkit-3.4.2 (3.4.2-12.pSPS), BWCTL-1.5.4-1.el6, OWAMP-3.4-10.el6, NDT-3.7.0.1-2.el6, NPAD-1.5.6-3.el6, esmond-1.0-13.el6, Regular Testing Daemon-3.4.2-4.pSPS, iperf3-3.0.11-1.el6).
- All meshes migrated from iperf to iperf3 and from traceroute to tracepath. This should improve our bandwidth measurements and enable MTU path discovery.
- Very good progress in ramping up latency tests, currently with 34 sonars, we're able to consistently get results for all tested links.
- Network performance incidents process put in place as was agreed at the last meeting (https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Performance_Incidents)
- OSG/Datastore validation progressing well, resolved all performance issues and targeting July for production (progress already visible at http://psmad.grid.iu.edu/maddash-webui/
).
- Publishing results to message bus progressing, development has finalized for esmond2mq prototype and we plan to enter pilot phase. Initial version of the proximity service (mapping sonars to storages) in testing.
- Last meeting held yesterday (https://indico.cern.ch/event/382623/
) - focused on FTS perfromance
- Hassen Riahi (FTS dashboard) reported on FTS performance for WLCG during the first phase of production (3 months)
- Initial report on the FTS performance study presented by Saul Youssef (Boston University), common study for ATLAS, CMS and LHCb. Early results already provide valuable insights and also show how we could benefit from integrating FTS and perfSONAR. Agreed to follow up on a regular basis at the next meetings.
- Next meeting 3rd of June (https://indico.cern.ch/event/382624/
). Plan is to focus it on latency ramp up and proximity service.
WLCG perfSONAR service status report on 2015-05-07 04:02:24.706444 =======
Active perfSONAR instances: 235
Registered/monitored perfSONAR instances: 259
perfSONAR-PS versions deployed:
3.4.1 : 23
3.4.2 : 183
Unknown: 25
Incorrectly configured (failing >4 metrics): 17
Report 02/04/2015
- perfSONAR status
- Security: CVE released today for cassandra, which is used by the perfSONAR measurement archive software, esmond. NO action required to protect perfSONAR Toolkit since vulnerable ports are both disabled and firewalled.
- perfSONAR 3.4.2 was released and auto-deployed to 163 sonars, there are 42 instances still on 3.4.1. We no longer have any active instances on older versions.
- We encourage ALL sites that are still on 3.4.1 to check status of their sonars (mainly disk space) and enable auto updates ASAP.
- Significant improvement observed in getting consistently all the needed metrics after this update. The plan is to resume validation in LHCOPN/LHCONE and continue with a ramp up to full mesh latency tests.
- Full mesh trace paths now at 80%
- Network performance incidents follow up (proposal):
- New mailing list and GGUS SU will be established to follow up, proposed name is wlcg-network-throughput, initial participation will be the same as for the WG mailing list (transfer systems, experiments, perfsonar support, esnet, lhcopn/lhcone).
- Experiments can report to the GGUS SU/mailing list potential network performance incidents/degradations, WLCG perfSONAR support unit will investigate and confirm if this is network related issue. Once confirmed, it will notify relevant sites and will try to assist in narrowing down the problem to particular link(s). Affected sites will be contacted and should open an incident with their network providers. Tracking of the ongoing incidents will be done on the WG page (https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Performance_Incidents).
- Sites observing a network performance problem should follow their standard procedure, i.e. report to their network team and if necessary escalate to their network provider while informing the wlcg-network-throughput mailing list. If confirmed to be WAN related, WLCG perfSONAR support unit can assist in further debugging of the problem. For the non-technical (policy) issues or if unclear, sites should escalate to the WLCG operations coordination.
WLCG perfSONAR service status report on 2015-04-02 04:02:22.925555 =======
Active perfSONAR instances: 233
Registered/monitored perfSONAR instances: 259
perfSONAR-PS versions deployed:
3.4.1 : 42
3.4.2 : 163
Unknown: 26
Incorrectly configured (failing >4 metrics): 27
Report 19/03/2015
- WG meeting was held on 18th of March (https://indico.cern.ch/event/379017/
)
- perfSONAR status
- All sites should be running 3.4.1, final deadline was 16th of February, 5 sites received tickets (3 of them responded)
- Testing/evaluation of the 3.4.2rc candidate ongoing, additional issues were identified and fixed by the ESNet developers team.
- Plan is to follow up the testbed for next couple of days, if there are no issues reported, 3.4.2rc will get a green light (once released, this should propagate to all sites within 24 hours)
- Datastore (esmond) status
- Esmond testing is ongoing, gathering 100% of the meshes (some with missing data due to issues in 3.4.1)
- Network performance incidents follow up
- Procedure was proposed and is still under discussion within the WG.
- Integration projects
- Revised proposal for the experiment’s interface to perfSONAR, esmond2mq prototype was developed and tested, feedback will be reported to OSG and ESNet.
- Next meeting: 8th of April (https://indico.cern.ch/event/382622/
)
Report 05/03/2015
- WG meeting was held on 18th of February (https://indico.cern.ch/event/372546/
)
- All sites should be running 3.4.1, final deadline was 16th of February, 5 sites received tickets (2 of them responded)
- Follow up campaign to bring all perfSONARs to the correct configuration ongoing, started with LHCOPN/LHCONE instances, several issues found and reported
- Testbed established to evaluate/test 3.4.2rc (release candidate), which was released last week. Several issues fixed that were reported by us during LHCOPN/LHCONE configuration campaign. One new issue found and reported to the development team.
- New meshes: IPv6/IPv4 dual stack (lead by Duncan Rand), Latin America (lead by Renato Santana, Pedro Diniz)
- Testing and evaluation of the pilot instances for esmond/maddash ongoing (psds.grid.iu.edu, psmad.grid.iu.edu)
- Production instance of the infrastructure monitoring (psomd.grid.iu.edu) updated with new tests that check completeness/freshness of data in the local measurement archives (high level functional test)
- Integration of the network and transfer metrics: two pilot projects proposed in the last WG meeting
- LHCb pilot project to provide experiment agnostic prototype to access central datastore (esmond) and publish available metrics to messaging
- Extending ATLAS FTS performance study to CMS and LHCb
- Networking degradation between SARA and AGLT2 under investigation - to be followed up at the next WG meeting
- Original issue noted when many large file transfers SARA->AGLT2 failed. Cause was FTS timeout since files 2-6GB were moving at 10-100s of Kbytes/sec. Problem reported to this working group.
- perfSONAR regular tests between T2 and T1 have been paused so manual perfSONAR tests were done showing poor performance (200-500 Kbytes/sec).
- Saul Youssef's examination of FTS logs indicated possible problematic trans-Atlantic link was involved. Additional reports of poor performance between CERN EOS and MWT2 used same link.
- Recommended procedure (by LHCONE/LHCOPN working group) is to have either end-site contact their R&E network provider to open a ticket. AGLT2 contacted Internet2 and opened a ticket (ISSUE=2688 PROJ=144)
- Temporary debug mesh setup to test paths between SARA, CERN and AGLT2,MWT2. See https://maddash.aglt2.org/maddash-webui/index.cgi?dashboard=Debug%20Mesh%20(temp
)
- Internet2 has opened ticket with GEANT(TT#2015022734000453) and the issue is actively being pursued.
- Work underway getting suitable intermediate perfSONAR instances onto LHCONE to help localize the issue.
- Next WG meeting will be on 18th of March (https://indico.cern.ch/event/379017/
)
WLCG perfSONAR service status report on 2015-03-05 04:02:24.416548 =======
Active perfSONAR instances: 225
Registered/monitored perfSONAR instances: 249
perfSONAR-PS versions deployed:
3.2.2 : 1
3.3.2 : 1
3.4.1 : 207
3.4.2 : 13
Unknown: 27
Incorrectly configured (failing >4 metrics): 31
Report 05/02/2015
- WG still waiting on input from ATLAS on use-cases/requirements for network metrics
- Meeting to discuss the use cases will be held on 18th of February (https://indico.cern.ch/event/372546/
)
- 2nd broadcast was sent to remind sites to update to 3.4.1 - final deadline is 16th of February - sites that won't update by this date will receive tickets
- Production version of perfSONAR infrastructure monitoring available at http://pfomd.grid.iu.edu/
(you need to have your certificate loaded in the browser to access)
- Pilot versions of maddash and datastore (http://pfds.grid.iu.edu
) available
- perfSONAR operations meeting was held last week - minutes available at https://indico.cern.ch/event/369420/
- Agreed to start full mesh latency testing starting with top-k sites and gradually moving to all sites
- Follow up campaign to bring all perfSONARs to the correct configuration
WLCG perfSONAR service status report on 2015-02-05 04:02:21.711952 =======
Active perfSONAR instances: 220
Registered/monitored perfSONAR instances: 241
perfSONAR-PS versions deployed:
3.2.2 : 1
3.3.2 : 4
3.4.1 : 185
Unknown: 51
Incorrectly configured (failing >4 metrics): 51
Report 20/11/2014
- Metrics area meeting held last week, minutes available at https://indico.cern.ch/event/354593/
- WG waiting on input from transfer systems and experiments on use-cases/requirements for network metrics
- Strawman planned for early next year
- Status of perfSONAR presented also at ATLAS jamboree yesterday
- Update campaign ongoing, hard deadline for all sites to update is 8th January 2015
- perfSONAR data store configured in ITB; stress testing ongoing
WLCG perfSONAR service status report on 2014-12-04 04:02:19.233227 =======
perfSONAR instances monitored: 214
perfSONAR-PS versions deployed:
3.3.2 : 26
3.4.1 : 118
Unknown: 66
GOCDB registered total: 190
OIM registered total: 80
Unreachable instances (not monitored): 42
Incorrectly configured (failing >4 metrics): 69
Report 20/11/2014
- 107 instances updated to 3.4.1 following the WLCG and EGI broadcasts sent with the new install/update instructions
- Second broadcast to be sent next week, deadline to update will be 8th January 2015
- Planning to start validation of the existing 3.4.1 sonars next week
- perfSONAR data store configured in ITB; stress testing to start next week
- Metrics area meeting to be held next week (http://doodle.com/ezrfh8eybu7iybxyqzrcbze9
)
WLCG perfSONAR service status report on 2014-11-20 04:02:13.263575 =======
perfSONAR instances monitored: 214
perfSONAR-PS versions deployed:
3.3.2 : 29
3.4.1 : 107
Unknown: 74
GOCDB registered total: 190
OIM registered total: 70
Unreachable instances (not monitored): 45
Incorrectly configured (failing >4 metrics): 70
Report 06/11/2014
WLCG perfSONAR service status report on 2014-11-06 04:02:17.829838 =======
perfSONAR instances monitored: 214
perfSONAR-PS versions deployed:
3.2.2 : 1
3.3.1 : 2
3.3.2 : 47
3.4.1 : 82
Unknown: 78
GOCDB registered total: 188
OIM registered total: 55
Unreachable instances (not monitored): 47
Incorrectly configured (failing >4 metrics): 78
Report 16/10/2014
- Update on WG presented at GDB last week (Details at agenda
)
- perfSONAR 3.4 released 7th of October, we recommend ALL sites to wait with upgrade until the re-install instructions are broadcasted via WLCG and EGI
- Performed internal security audit in collaboration with perfSONAR developers - summary to be provided in the re-install instructions
- Metrics area meeting was canceled, doodle for the new one will be sent shortly
- POODLE: SSLv3.0 vulnerability (CVE-2014-3566) announced yesterday - https://access.redhat.com/articles/1232123
- affecting perfSONARs as well. Patches from distributions not available yet (16th Oct) - perfSONAR team provided their own fixes yesterday (perl-perfSONAR_PS-Toolkit-3.4-29.pSPS and perl-perfSONAR_PS-Toolkit-SystemEnvironment-3.4-29.pSPS). We recommend all sites running 3.3 to temporarily disable SSL3. We recommend ALL sites to wait with upgrade to 3.4 until the re-install instructions are broadcasted via WLCG and EGI.
- perfSONAR operations meeting this Friday (Oct 3 at 3PM), minutes at https://indico.cern.ch/event/342995/
- Highlights: Agreed to introduce several major changes in operations (introduce GGUS SU, security mailing list, setup infrastructure monitoring, introduce automated mesh configurations)
- Next operations meeting will be held next week, please vote at http://doodle.com/qydib32fkv48er2r
WLCG perfSONAR service status report on 2014-10-16 04:03:54.594325 =======
perfSONAR instances monitored: 214
perfSONAR-PS versions deployed:
3.2.2 : 1
3.3.1 : 2
3.3.2 : 66
Unknown: 141
GOCDB registered total: 172
OIM registered total: 55
Unreachable instances (not monitored): 79
Incorrectly configured (failing >4 metrics): 109
Report 02/10/2014
- Details on the shell shock vulnerabilites and its impact on perfSONAR available at https://twiki.cern.ch/twiki/bin/view/LCG/ShellShockperfSONAR
- We recommend ALL sites that didn't patch bash before Friday Sep 26 to terminate their instances and wait until perfSONAR 3.4 is released
- perfSONAR 3.4 to be released on Mon Oct 6, WLCG and EGI broadcasts will be sent with the installation instructions
- perfSONAR operations meeting this Friday (Oct 3 at 3PM), agenda at https://indico.cern.ch/event/342995/
WLCG perfSONAR service status report on 2014-10-02 04:02:15.996763 =======
perfSONAR instances monitored: 214
perfSONAR-PS versions deployed:
3.3.1 : 2
3.3.2 : 96
Unknown: 112
GOCDB registered total: 173
OIM registered total: 55
Unreachable instances (not monitored): 90
Incorrectly configured (failing >4 metrics): 111
Report 18/09/2014
- Kick-off meeting minutes and slides available at https://indico.cern.ch/event/336520/
- The meeting had very good participation including experiments, ESNet Science Engagement Group (perfSONAR development team), Panda, PhEDEx, FTS, FAX as well as majority of the perfSONAR regional contacts. An initial overview of the current status in the network and transfer metrics was presented and a list of topics and tasks to work on in the short-term was proposed. Very good feedback was received and we have agreed on the topics to discuss at the follow up meetings.
- Please check Twiki for updated task table
- 5 sites received tickets on running an outdated version of perfSONAR
- Follow up meetings:
- Metrics area meeting focusing on use cases and review of the transfer systems (T1.1, T1.2)
- Meetings focusing on perfSONAR operations (T2.1):
WLCG perfSONAR service status report on 2014-09-18 09:56:52.693187 =======
perfSONAR instances monitored: 214
perfSONAR-PS versions deployed:
3.2.2 : 4
3.3.1 : 3
3.3.2 : 175
Unknown: 28
GOCDB registered total: 172
OIM registered total: 53
Unreachable instances (not monitored): 7
Incorrectly configured (failing >4 metrics): 26
Report 04/09/2014
- Kick-off meeting
will take place on Mon 8th of Sept at 3PM CEST
- Early version of the WLCG perfSONAR configuration interface will be deployed to production next week.
- Pythia Network Diagnosis Infrastructure (PuNDIT) project will be funded by NSF and starts at the beg. of September (lead by Shawn). The project will use perfSONAR-PS data to identify and localize network problems using the Pythia algorithms. PuNDIT will collaborate with OSG and WLCG over its two year duration.
- Sites with incorrect versions of perfSONAR will receive tickets at the beg. of next week (9 sites in total)
WLCG perfSONAR service status report on 2014-09-04 10:03:22.949799 =======
perfSONAR instances monitored: 214
perfSONAR-PS versions deployed:
3.2.2 : 6
3.3.1 : 3
3.3.2 : 173
Unknown: 28
GOCDB registered total: 170
OIM registered total: 53
Unreachable instances (not monitored): 8
Incorrectly configured (failing >4 metrics): 28
Report 21/08/2014
- Updated WG page with list of members, task tracking, coming events and reports (https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics)
- Kick-off meeting
will take place on Mon 8th of Sept at 3PM CEST
- On July 21st perfSONAR Toolkit 3.4rc2 became available for testing, version 3.4 is a major milestone for the WG as it enables access via REST API and introduces several important performance improvements, therefore deployment campaign will follow once we get a stable release
- Work is progressing on the WLCG perfSONAR configuration interface (finalized design, work is ongoing on a prototype implementation)
- OSG perfSONAR datastore plan has been agreed and testing of the store based on esmond
is ongoing
WLCG perfSONAR service level report on 2014-08-20 16:59:32.876708=======
perfSONAR instances monitored: 214
perfSONAR-PS versions deployed:
3.2.2 : 6
3.3.1 : 3
3.3.2 : 174
Unknown: 27
GOCDB registered total: 170
OIM registered total: 53
Unreachable instances (not monitored): 8
Incorrectly configured (failing >4 metrics): 30
--
MarianBabik - 19 May 2014