Full Report |
Highlights |
06 October 2014 |
Batch jobs not dispatched on Sep 29-30 |
SLC Kerberos domain joining failing on Sep 30 |
SVN http access not available on Oct 01 |
WMS decommissioning at CERN (SAM instances) |
CvmFS mounts inside CERN faulty on Sep 29-30 |
Termination of SLC5-based central Plus (lxplus5) service |
Websites not resolved (DNS issue) on 26 Sep |
EOSATLAS unavailable on 30 Sep |
|
IT News for Experiments 6 October 2014
Batch jobs not dispatched on Sep 29-30
The batch service was affected in a major way by the CVMFS issue. The availability of CVMFS on the worker nodes themselves is used for scheduling such that if lemon detects an issue with CVMFS the host stops to accept new jobs. For that reason the general problem with CVMFS caused a complete disruption of job dispatching, effectively putting the whole farm into draining mode. When the issue was finally solved, many worker nodes were idling. The occasion was used to schedule a kernel upgrade on those nodes which did not have any payload. Unfortunately, setting the nodes into draining mode from the nodes themselves apparently did not work as expected. In addition, a large fraction of worker nodes did not come back by themselves and had to be reset from nova, causing a large ticket with no-contact alarms.
See more at :
https://cern.service-now.com/service-portal/view-outage.do?n=OTG0014388
SLC Kerberos domain joining failing on Sep 30
Due to an expired certificate, SLC machines trying to join the Kerberos / AD domain were failing from 10:17 to 10:55 on Tuesday 30 Sept. The certificate was replaced and service went back to normal.
SVN http not available on Oct 01
Httpd access to
SVN failed intermittently. The problem seemed to be caused by an unexpected change related to httpd which required httpd to be restarted. Ssh access continued to work normally.
See more at :
https://cern.service-now.com/service-portal/view-outage.do?n=OTG0014447
WMS decommissioning at CERN (SAM instances)
As announced previously, the Workload Management System (WMS) at CERN is being decommissioned.
Update: The WMS service has been decommissioned; CERN does not run any WMS service any longer.
CvmFS mounts inside CERN faulty on Sep 29-30
CVMFS mounts on plus and batch were hung and blocked, with serious knock-on effects on plus and batch, the latter having been drained almost entirely. The analysis of what exactly has happened is not complete yet, but the problems on batch and plus have most likely been provoked by a remote Stratum 1 service that went up and down. (The root cause may well have been Squid servers at CERN with a filled-up file system.) This oscillation caused clients referring to that Stratum 1 to hang, and failing over to another Stratum 1 after timing out did not appear to work correctly. We tried to re-configure the CERN clients not to use the Stratum 1 in question, but this reconfiguration in turn blocked on a number of client nodes, making the problem worse. The problems were resolved manually. We are establishing procedures to handle the problem more effectively and efficiently should it happen again. We are also in touch with the CVMFS developers in order to understand why the client behaved the way it did, and what to do to protect against this situation to re-occur.
See more at :
https://cern.service-now.com/service-portal/view-outage.do?n=OTG0014388
Termination of SLC5-based central Plus (lxplus5) service
Stopping of the SLC5-based central Plus (lxplus5) service has been scheduled for Monday 13th October 2014
Websites not resolved (DNS issue) on 26 Sep
Human error during an urgent software update to the DNS servers (to address the bash security issue) led to unavailability of the DNS service for the GPN and public internet. The technical network was not affected.
See more at:
https://cern.service-now.com/service-portal/view-outage.do?n=OTG0014371
EOSATLAS unavailable on 30 Sep
EOSATLAS was unavailable after a crash of the namespace. A stuck gdb process on the main daemon was preventing the automatic service restart. The service was restarted manually.
See more at:
https://cern.service-now.com/service-portal/view-outage.do?n=OTG0014401
--
MiroslavPotocky - 06 Oct 2014