WLCG Operations Coordination Minutes, September 17th 2015
Highlights
- All 4 experiments have now an agreed workflow with the T0 for tickets that should be handled by the experiment supporters and were accidentally assigned to the T0 service managers.
- A new FTS3 bug fixing release 3.3.1 is now available.
- A globus lib issue is causing problems with FTS3 for sites running IPv6.
- No network problems experienced with the transatlantic link despite 3 out of 4 cables being unavailable.
- T0 experts are investigating the slow WN performance reported by LHCb and others.
- A group of experts at CERN and CMS investigate ARGUS authentication problems affecting CMS VOBOXes.
- T1 & T2 sites please observe the actions requested by ATLAS and CMS (also on the WLCG Operations portal
).
Agenda
Attendance
- local: Maria Dimou (Minutes), Andrea Sciaba (chair), Maarten Litmaath, Andrea Manzi, Marian Babik, Christoph Wissing.
- remote: Antonio Maria Perez Calero Yzquierdo, Renaud Vernet, Ulf Tigerstedt, Rob Quick, Vincenzo Spinoso, Pepe Flix, Alessandra Doria, Vladimir Romanovskiy, Maite Barroso, Catherine Biscarat, Di Qing, Dave Mason, Jeremy Coles, Massimo Sgaravatto, Gareth Smith.
- apologies: Maria Alandes (Info Sys TF).
Operations News
NTR
Middleware News
- Baselines:
- Issues:
- T0 and T1 services
- CERN
- FTS3 will be upgraded to 3.3.1 on Monday 21st of September
- IN2P3
- dCache upgrade (2.10.40) on core servers next week (22/09/2015)
- JINR
- dCache updated to 2.10.40
- KIT
- update all of dCache to latest version in 2.13 branch during GridKa's annual downtime from 29th of September till 1st of October.
- NDGF
- IPv6 was enabled on doors and FAX-node.
- RRC-KI-T1
- dCache upgrade to 2.10.39 on pools and gridftp doors
Tier 0 News
Experiment support tickets that reach the T0:
Alice:
Atlas:
- Point to the KB where Atlas computing support is explained, request the user to follow the channels specified there, and close the ticket. Summary of guidelines: no GGUS or SNOW; use workbooks, forums, and atlas.support@cernNOSPAMPLEASE.ch
CMS:
- Reassign in the tool they have been first reported, If GGUS in GGUS Support Unit: VO support(CMS) , if SNOW in SNOW
- In the midterm it might be useful evaluate a technical possibility that allows CERN supporters to assign the ticket back to GGUS TPM from SNOW
LHCb:
- Reassign tickets in GGUS to Support Unit: VO support(LHCb). If this is not possible, reassign it in SNOW to the FE LHCb Computing Support Meyrin.
Tier 1 Feedback
- NDGF-T1 enabled IPv6 (dual stack) for SRM on 14.9.2015. We request that the FTS3 developers would take https://its.cern.ch/jira/browse/DMC-681
seriously and fix the issue. As long as we are alone with IPv6 enabled it should be no problem, but there are already T2s with dualstack. Also ARC did not support delayed passive with IPv6, causing all reads and writes to be proxied. This has already been fixed in ARC and a new release is coming soon.
Comments by Andrea M. and Andrea S. at the meeting: Alejandro, the FTS3 developer is aware of the ticket but it seems to be an issue with the globus libraries. Put a reminder in the jira ticket for Globus developers. Thomas Hartmann (KIT) should be asked to check if the latest Globus is deployed and whether the problem persists.
Tier 2 Feedback
Experiments Reports
ALICE
- high activity
- CERN
- team ticket GGUS:116095
about expired CRLs on myproxy.cern.ch
- IPv6 connectivity issue in the Wigner data center was fixed
- Accessing CASTOR for reading or writing raw data files:
- Various constructive meetings between ALICE experts and the CASTOR team.
- Short- and longer-term ideas were discussed.
- Reco jobs now download the raw data files instead of streaming them.
- The effect should become visible when more data is ready for processing.
- Further ideas involving EOS are being investigated.
- DAQ and CASTOR experts also retraced how a particular file ended up lost.
- Thanks for the good support!
ATLAS
- High rate of activity in both jobs and data transfer
- Transatlantic network link was at high risk last week due to 3 out 4 fibres cut.
Marian will follow-up with the WLCG networking group. They will evaluate whether and how they can help by informing the experiment.
- A large campaign of data transfer created >1M transferring files per FTS server and this seemed to cause slowing down of the transfers
- Procedures and communication for sites upgrading FTS is not great. Do we need a new method?
- Found at least two sites generating gridmap files from the old voms server. This caused problems for ATLAS users with new certificates. Please send a broadcast to all sites asking to check and change if necessary.
During the discussion it was decided not to send yet another broadcast given the Very Large amount of broadcasts that was sent for several months. There was one OSG site and Glasgow. With regards to Glasgow and pointing at the old VOMS Jeremy conveyed the site reports as follows:
A 'rogue' configuration management tool replaced the current configuration for VOMS with the old one. This was only on our older disks with the rest of the site pointed to the new VOMS2 locations. Glasgow moved over to the new servers a long time ago (in fact uses the RPM method of installation to ensure we get the correct certs now) and at present we’re not sure which management tool replaced the new config (we are split between puppet on our new systems and cfengine on our old). If after investigation we discover anything of interest to other sites we will share that via this meeting.
CMS
- Some Issues with DPM Storage Elements in the AAA xrootd data federation
- Seem to have appeared after some updates of regional re-directors to more recent xrootd versions (4.2.x)
- Investigations ongoing between AAA folks, DPM experts and xrootd developers
Andrea M. said that a new version of DPM - xrootd is needed to fix this. The development is on-going now.
- Last two weeks with rather high slot utilization
- Reached (again) 120k parallel jobs in the Global Pool
- File transfer issues from P5 to Wigner are overcome
- CMS DAQ team observes much improved transfer quality after router exchange
- Operational issues
- Ongoing investigations of "UNKNOWN" SAM results for CEs at CERN, GGUS:116069
- No obvious issues with job submission for production or analysis though
- Some ARGUS authentication issues being investigated with help IT experts, GGUS:116092
LHCb
- Data Processing
- Validation and data quality verification of data are going. All data is buffered on disk resident areas -> no staging
- Operations
- Ongoing discussion with IT/PES about worker nodes which are executing payloads significantly slower (e.g. GGUS:116023
)
Maite commented on the last point that, since 2 weeks the batch team of the T0 has set up a lab to test and improve this.
A sustainable fix for these kinds of issues depends on the latest OpenStack release (Kilo) that will be gradually deployed
over the coming weeks throughout Oct. In the meantime the T0 experts will continue working with the 4 experiments
on new cases as they are reported, applying measures from the current set of mitigations.
Ongoing Task Forces and Working Groups
gLExec Deployment TF
HTTP Deployment TF
Information System Evolution
- WLCG Information System Use Cases document presented at the MB
- MB gave feedback to work on several areas that need further discussion and agreement within the TF:
- Future Use Cases: use cases document describes the current interactions with the IS. The TF should now investigate what it is actually needed so that we can better understand how the IS could evolve.
- Static vs Dynamic: MB would like to see summarised the types of information actually needed by the experiments. Probably a more elaborated version of what it is already summarised in this twiki under Types of Information and focus only in the future use cases.
- "Indicative pledges" per site in REBUS: The TF requested the MB to include "indicative pledges" per site in REBUS. MB would like to understand why this information is needed and have a concrete proposal on how it will be collected.
- Installed capacity: a better definition, and maybe also name, is needed for what it is called today "installed capacity". MB would also like to understand why this information is needed and also how it will be collected.
- T3s and opportunistic resources: it would be good to understand how information is going to be collected from T3s and opportunistic resources.
- OSG, NDGF and EGI will present their plans to provide information about their resources in the future at the next TF meeting. GOCDB will also present the latest features.
IPv6 Validation and Deployment TF
Update on the status of IPv6 deployment in WLCG (from Bruno Hoeft)
Tier-1 |
Site |
LHCOPN IPv6 peering |
LHCONE IPv6 peering |
perfSONAR via IPv6 |
ASGC |
- |
- |
- |
BNL |
not on their priority list |
CH-CERN |
yes |
yes |
LHC[OPN/ONE] |
DE-KIT |
yes |
yes |
LHC[OPN/ONE] |
FNAL |
yes |
yes |
LHC[OPN/ONE] but not yet visible in Dashboard |
FR-CCIN2P3 |
yes |
yes |
LHC[OPN/ONE] but not yet visible in Dashboard |
IT-INFN-CNAF |
- |
yes |
LHCONE |
NDGF |
yes |
yes |
LHC[OPN/ONE] |
ES-PIC |
yes |
yes |
LHCOPN |
KISTI |
started but no peering implemented |
NL-T1 |
no peering implemented |
TRIUMF |
IPv6 peering planned at end of 2015 |
RRC-KI-T1 |
- |
- |
- |
Tier-2 |
Site |
LHCONE IPv6 peering |
perfSONAR |
DESY |
yes |
LHCONE |
CEA SACLAY |
yes |
- |
ARNES |
yes |
- |
WISC-MADISON |
yes |
- |
UK sites |
QMUL peers with LHCONE but not for IPv6 |
Prague FZU |
IPv6 still working but the previous contact person left |
There are additional IPv6 perfSONAR servers at Tier-2 centres, but not via LHCONE.
Marian said that FNAL & IN2P3 can be added to the dashboard now. Also, there are multiple T2 sites that can be added in the table but they are not in LHCONE. About the Prague comment in the table, the explanation is that the responsible left but the site appears on the dashboard, so we still have their PerfSONAR data for now.
Machine/Job Features
Middleware Readiness WG
The WG met yesterday. Full minutes in
MWReadinessMeetingNotes20150916. Summary:
- The new DPM version is being tested via the ATLAS workflow by the Edinburgh Volunteer site.
- Many new sites showed interest to participate in MW Readiness testing with CentOS7. It is useful to anticipate the MW behaviour in the event of new HW purchase. DPM validation on CentOS/SL7 is already ongoing at Glasgow.
- ATLAS and CMS are asked to declare whether the xrootd 4 monitoring plugin is important for them or not. As it is now, it doesn't work with dCache v. 2.13.8
- Despite the fact that FTS3 runs at very few sites we decided to test it for Readiness. In this context, ATLAS & CMS are now using the CERN FTS3 pilot in their transfer test workflows.
- PIC successfully tested dCache v.2.13.8 for CMS.
- CNAF has obtained Indigo-DataCloud effort to strengthen the ARGUS development team. The ARGUS collaboration will meet again early October. The problems faced at CERN with a CMS VOBOX are being investigated in ticket GGUS:116092
.
- The next MW Readiness WG vidyo meeting will take place on Wednesday 28 October at 4pm CET.
Renaud asked about the experiments' experience with WNs on CentOS7. The current situation is that it
doesn't work.
The MW packages used on the WN need to be available for the new OS, but that should happen fairly soon.
The big work is in the usual porting of the experiment SW followed by careful validation of physics outputs of
reference tasks in comparison with corresponding SL6 results; all that will take a lot of time and probably
will not make much progress this year. However, early next year we should pick this up.
Multicore Deployment
Network and Transfer Metrics WG
- OSG perfSONAR datastore
entered production on 14th of Sept providing storage and interface for all perfSONAR results.
- Publishing of the perfSONAR results using pre-production (ITB) services was successfully established, working to resolve issue with some event types not being published, production still pending SLA.
- WLCG-wide meshes campaign with latency testing ramped up to 81 sonars caused some instabilities of the sonars with 4GB RAM, therefore we have decreased the number of tests performed and this has improved the situation.
- Final version of the perfSONAR 3.5 is planned to be released on 28th of September and will be auto-deployed to all WLCG instances. There were no issues found in the testbed, but we plan to update couple of production instances in advance to check if everything is fine.
- ESNet and OSG have started developments on the perfSONAR configuration interface - open source project motivated by the existing version developed for WLCG. There has been also interest from GEANT and ESNet to collaborate on an open source project based on the existing proximity service.
- Follow up meeting was held to discuss findings of the FTS performance study lead by Saul Youssef (Boston University), new optimization algorithm was proposed and discussed.
- Next WG meeting will be on 30th of Sept (https://indico.cern.ch/event/400643/
)
RFC proxies
Squid Monitoring and HTTP Proxy Discovery TFs
- Alastair is making progress on the next deliverable (a flexible squid registration exception list), but is not quite ready to put it into production
- We agreed to change the documentation for squid registration to make it clear that T3s that are not already registered in GOCDB do not have to register their squids to have them monitored, they can send an email and we'll add an exception
Action list
Creation date |
Description |
Responsible |
Status |
Comments |
2015-09-03 |
Status of multi-core accounting |
John Gordon |
ONGOING |
A presentation about the plans to provide multicore accounting data in the Accounting portal should be presented at the next Ops Coord meeting on October 1st https://indico.cern.ch/event/393617/ since this is a long standing issue |
2015-06-04 |
Status of fix for Globus library (globus-gssapi-gsi-11.16-1 ) released in EPEL testing |
Andrea Manzi |
ONGOING |
GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. A broadcast message has been sent by EGI. Now the team will start working on the monitoring script that will show the sites that haven't changed and open GGUS tickets to remind them. |
Specific actions for experiments
Specific actions for sites
AOB
--
MariaDimou - 2015-09-14