WLCG Operations Coordination Minutes, December 3rd 2015
Highlights
- Please register to the https://indico.cern.ch/e/WLCG-Workshop-Lisbon-2016
- RedHat just provided a fix for openldap, to avoid current crashes affecting Top BDII and ARC-CE. Tests of this fix can now start.
- The Information Systems (InfoSys) TF Future Use Cases Document
is now ready in the WLCG Document Repository.
- The French and Spanish sites can now have host certificates compliant with the Globus change in validation; Spain will use the Terena CA, France has introduced the feature in its CA
- MJF and Infosys TFs are harmonising their position about definitions for number of cores and HS06 values
- VOMS is not working with IPv6, probably due to a relatively recent change (GGUS:117987
).
Agenda
Attendance
- local: Andrea Sciabà (chair), Maria Dimou (minutes), Maarten Litmaath, Andrea Manzi, Marc Slater, Jerôme Belleman, Maria Alandes.
- remote: Alessandra Doria, Michael Ernst, Christoph Wissing, Dave Mason, Vincenzo Spinoso, Peter Gronbech, Andreas Petzold, Andrew McNab, Di Qing, Massimo Sgaravatto, Ulf Bobson Severin Tigerstedt, Josep (Pepe) Flix, Julia Andreeva, Andrea Valassi, Gareth Smith, Antonio Yzquierdo, Jeremy Coles (part).
- apologies: Catherine Biscarat, Alessandro Di Girolamo, David Cameron
Operations News
Middleware News
- Issues:
- Some news from RedHat regarding the openldap crash affecting Top BDII and ARC-CE. They just provide us today the fix to test. We will then communicate any news during the next ops or ops coord meetings
- T0 and T1 services
- JINR
- Minor postgres upgrade to 9.4.5
- PIC
- Major dCache upgrade to v 2.13.13
Tier 0 News
- Working on adding more capacity to Condor cluster.
- LSF 9 software deployed on all worker nodes for both the ATLAS T0 instance and the main one.
DB News
Tier 1 Feedback
- CC-IN2P3 (France, C. Biscarat) - Globus host certificate validation change - our CA is now able to deliver compliant certificates, allowing to declare alias in the AltName. We are in the process of changing the certificate on the affected hosts (2 SRM servers).
- PIC (Spain, J. Flix) - Globus host certificate validation change - PIC is now using TCS (Terena) certificates, and we solved the issues with all of the host with aliases already. We are in the process to migrate the remaining machines to TCS certificates, which will happen rather soon.
- PIC (Spain, J. Casals) - Fixed some issues in the Nagios-plugin for SAM3: https://github.com/jcasals/nagios-plugins-lcgsam
- NDGF-T1(Nordics, Tigerstedt) will upgrade dCache to 2.14.x on 14.12.2015, full day outage.
Tier 2 Feedback
Experiments Reports
ALICE
- generally normal to high activity
- so far the heavy ion run has been smooth from the grid perspective!
- reco jobs run very successfully
- their RSS memory consumption has remained up to max ~2.5 GB
- we have to see what happens at the planned higher beam intensities
- this has allowed the use of normal queues at various T1
- at CERN the fraction of two-core jobs is being lowered in steps
- CERN: submission to HTCondor CE in production since yesterday evening
- CERN: TEAM ticket GGUS:118062
opened Monday evening. ALICE was severely impacted by an OpenStack issue:
- the standard build system could not be used to release analysis updates
- a local mini build system was put together for the most urgent cases
- thanks to the OpenStack team for solving the complex issue as fast as possible!
ATLAS
- HeavyIon data taking: in general everything OK, nothing to be particularly worry about. Tier-0 performance in terms of events/second reconstructed from the whole cluster are quite low (few tents of Hz), observed huge I/O wait in Wigner spinning disks nodes. Those nodes now have been configured by the ATLAS Tier-0 to run less jobs than what is their standard batch configuration, and the performances improved.
- Reprocessing test now ongoing: the plan is to launch a full reprocessing campaign the 14th of December. Plan is quite tight since there are still problems with the release, please stay tuned on the Tuesdays ADC weekly next weeks.
CMS
- Heavy Ion Run is ongoing
- Permission problem in EOS: GGUS:118027
- Issue with mapping
- Fixed by EOS team during the weekend - Thanks!
- Some storage pools disappearing from the network: GGUS:118082
, GGUS:118037
- Investigated by CERN storage and network teams
- Only seen by CMS?
- CMS Tier-0 workflows is driving some CERN Openstack hardware to its limits: GGUS:118056
- Staging problem at KIT: GGUS:117910
- Let to too many queued transfers within the CMS transfer system
- Quite some overall performance degradation, also affecting transfers where KIT is not involved
- Situation is improving (at KIT and globally)
- Had a little ticketing campaign for DPM sites to move to DPM 1.8.10
- Earlier versions have issues with recent global/regional redirectors
LHCb
- Operations
- Currently processing pp reference run
- Finished 13TeV pp data processing
- Will be starting processing of Heavy Ion runs soon
- Significant MC generation in-coming
- Issues
- Problems with user accessing files at IN2P3 were experience last Tuesday pm. Assumed to be down to CA issues as they went away at the same time but more likely just coincidentally fixed at the same time (GGUS:118077
)
- Problem with RRCKI tape put offline and preventing access from certain files now solved
- Developments
- MC simulation workflows have been executed successfully on commercial clouds, on both DBCE (up to 600 simultaneous jobs running) and Azure (up to 1000 simultaneous jobs running, high rate of stalled jobs under investigation).
Ongoing Task Forces and Working Groups
gLExec Deployment TF
Machine/Job Features TF
- Ongoing discussions clarifying key/value pairs: some changes, some expanded definitions
- Attempting to be consistent with WLCG Information Systems Evolution TF
- 2nd draft of HSF technical note to record the communication procedure and the key/value pair definitions
- Deployed at several batch sites, and many VM-based installations (all the ones using Vac/Vcycle)
- DIRAC reading time limit information from MJF in LHCb pilot jobs and pilot VMs.
- Next steps to review experience with implementations and installations, and update in view of technical note discussions.
HTTP Deployment TF
Information System Evolution
- The Future Use Cases Document is now ready in the WLCG Document Repository ( PDF
). There is a general agreement that a central information system owned by WLCG is an interesting idea. For some VOs the requirement is stronger than for others, but all VOs agree that they would rely on a central information system that provides good quality information. Activities like WLCG Monitoring and Operations will definitely rely on such tool. The WLCG Information System should:
- Cache information from heterogeneous resources by regularly collecting information from primary data sources for WLCG service discovery (Now GOCDB, OIM and BDII, but the list of primary resources can evolve in the future).
- Provide a consistent interface for all interested WLCG clients offering an intermediate layer between the sources of information maintained by EGI and OSG.
- Include grid and non grid resources, like HPC and Clouds and be flexible enough to be able to include new types of resources.
- Validate information before it gets published, applying corrective actions if necessary.
- Logging information, namely when, how, by whom information was provided
- Starting to prepare a Roadmap to GLUE 2.0 so that VOs and WLCG clients start consuming GLUE 2.0 information and we can plan at some point the decommission of GLUE 1.3.
- EGI presented their plans to move to GLUE 2.0. Main showstopper is GLUE 2 WMS that was never tested in production. EGI is now trying to understand its actual use.
- Waiting for OSG input about their plans to provide information to WLCG once they stop publishing in the BDII and whether we could expect information published in GLUE 2 after the implementation of the ClassAds to GLUE 2 translator.
- Ongoing discussion to agree on a better definition of the GLUE 2 attributes defining HS06 (GLUE2BenchmarkValue) and Logical CPUs (GLUE2ExecutionEnvironmentLogicalCPUs), so that sites understand in a clear way what it is expected from them to be published in these attributes.
Alessandra Doria (Napoli) expressed the sites' appreciation for the TF's definitions' dissemination via the lcg-rollout mailing list and the poll for feedback from the sites.
IPv6 Validation and Deployment TF
- The NAGIOS service set-up is still being tuned.
- VOMS still doesn't work with IPv6. There is a ticket to follow this up. No problem with voms-admin.
- The ARGUS - IPv6 status was discussed at the ARGUS collaboration meeting on December 2nd. Extract from the minutes
: IPv6 support: not really tested but no problem expected as Java has a good IPv6 support and as ARGUS is binding to all interfaces/addresses. Sharing a lot of network code with VOMS Admin that is IPv6 compliant: the only known issues are with VOMS that is not using the same code.
Middleware Readiness WG
The
http://indico.cern.ch/e/MW-Readiness_14
meeting yesterday, Dec. 2nd, was virtual. Summary:
- New MW versions are now being under test via the ATLAS workflow. The dCache v.2.10.44 and v.2.14.0 and HTCondor v.8.4.1 verifications are already completed.
- BDII v. 5.2.23 is the new MW product being verified at Brunel on CentOS7 and completed at GRIF-IRFU.
- The ARC-CE v.5.0.3 verification is completed for CMS.
- The ARGUS Collaboration met twice since our last meeting: on Nov. 6th and Dec. 2nd.
- Please comment on the suggested date for the next meeting: Wednesday 20th January 2016 at 4pm CET. Objections with alternative dates should be sent to wlcg-ops-coord-wg-middleware@cernSPAMNOTNOSPAMPLEASE.ch
Full minutes are
here with details per product and site. The new verified top-BDII is now in UMD (done by EGI).
Multicore Deployment
Network and Transfer Metrics WG
RFC proxies
- CMS have switched test pilot factories to RFC proxies
Squid Monitoring and HTTP Proxy Discovery TFs
Action list
Creation date |
Description |
Responsible |
Status |
Comments |
2015-06-04 |
Status of fix for Globus library (globus-gssapi-gsi-11.16-1 ) released in EPEL testing |
Andrea Manzi |
ONGOING |
GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 15 GGUS tickets opened for SRM and Myproxy certificates not correct, 6 already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). 5-6 tickets still open at the 2015-12-03 meeting. All for sites which have no technical issues to proceed. |
2015-10-01 |
Follow up on reporting of number of processors with PBS |
John Gordon |
CLOSED |
Everyone uses the development instance. |
2015-10-01 |
Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites |
SCOD team |
ONGOING |
A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting |
Specific actions for experiments
Specific actions for sites
AOB
--
MariaDimou - 2015-12-01