Workflow Team - extraordinary meeting 17 Oct 2014

Attending: Julian was unable to attend but gave the following report FNAL : Jen and Luis

CERN power outage yesterday

Bottom line:

  • GlideIn, Factories, Frontier, etc. were restarted. CMSWeb is up and running (no info lost so far).
  • WMAgent machines weren't rebooted, but network unavailability crashed all components. Luis restarted everything.
  • We should check for jobs failed: network errors, xrootd errors, submit failures and jobs missing.
    • Not a big impact though, only 10K jobs running at the time.
  • CERN is taking measures:
    • Alternate comm. channels.
    • Server redundancy on Wigner
    • Alternate documentation availability.
  • I say we were the less affected. No serious or permanent damage found so far.

  • For T0 - the jobs finished, and the output file was copied to EOS but by the time they got there the machines were down due to the power outage.
    • Midweek global run 8
    • the agent marked the jobs as successful, but the files failed to completely transfer
    • discussing the possiblity of using the recovery proceedure to recreate the missing files rather than attempt to figure out where the jobs ran and finding the files on one of hundreds of machines. Note Log files were deleted because the agent thought that the jobs were successful.
    • Dirk is tracking down this failure mode and trying to figure out how we can avoid it in the future.
    • Dave is suggesting flipping switches in toast to get the jobs to rerun.

Time-line of the power cut

Detailed Time-line by Ivan Glushkov

In a summary, thanks to Jose, James, Krista, Dirk, CERN Luis, Stefan and Alessandro we managed to get most of the services back on track. Please, have a look at your service and let us know if we’ve missed something important. Here is my log file for the night [1].

        Kind regards,
        Ivan Glushkov

21:43 - Power in building 8 back on
        - Local networking - wifi and LAN - operational
        - Phones are not affected
        - ping to all nodes gives nothing
        - twiki is down - no documentation - We have to have a backup of our documentation somewhere outside CERN..
        - ggus is fine (it is not hosted at CERN)

        - trying to get in contact with Jose (the current CRC)
        - works - it is not one of our voboxes (
        - LanDB works - at least I can get some info for some nodes already
        - Services status:
                - hypernews (vocms91 / vocms92) are up
                - lxplus / lxvoadm / aiadm - no connection (ping works, hangs on )
        - (from the official IT Service Outage Report) After having received the green light from EL, sysadmins and opperations are restarting the equipment powered bu UPS 3.

22:06 - new power cut

22:17 - power back on
        - back to square zero - nothing works
        - wifi / LAN / phone - ok
        - I just received my announcement to the I’ve sent one hour ago

        - SLS woke up - many e-mails..

        - got in contact with Jose
        - got in contact with Alessandro - SDT nodes / services - ok
        - no access to lxplus / lxvoadm / aiadm yet

        - hypernews is back
        - cmsonline is back
        - things are slowly coming back here:

22:40 (from the official IT Service Outage Report)
        - The power is restored on all boxes. Sysadmins are checking them. The CC operator is calling the Service Managers that are on the list of Service Managers to be called in case of major problem in the CC. (Was anyone called in CMS? Is this going to be the case during data taking also?)

        - cmslogbook and Stefan… lxvoadm / aiadm are not accessible. Nologin to the nodes from my laptop (the old CVI nodes)
        - Stefan reports that apache is back but not elogd

        - Luis and frontier - ping - ok, no login - he is not in CERN
        - insider tip - the times in these plots should not be zero or infinity:

        - twiki is back - I am able to see..
        - cmsweb
                - DAS: interface - ok, functionality - DBS3 error !!!
                - phedex: interface - ok, functionality - waiting for agents
                - dashboard: interface - ok, functionality - ok

        - phedex agents - up, phedex functionality - ok
        - popularity - intercface - ok, functionality - ok
        - request manager - interface - ok, functionality - no access
        - request manager 2 - interface - ok, functionality - details - n/a
        - DQM
                - relval seems fine
                - offline - also
        - SiteDB down
        - hammercloud - interface - ok, functionality - ok
        - site configuration - down

        - cmsdoc - back online (restarted apache on vocms153 and vocms154)

        - siteconf - back online - did Diego do something?
        - SiteDB - also up - definitively Diego.. or maybe central databases were down?

        - DBS is back (without action) tested by this:*%2F*%2F*
        - login to lxplus / lxvoadm / aiadm is back (I am guessing that was an AFS problem)
        - cmslogbook: apache - ok, not able to restart elogd, Stefan is sleeping already - rebooting the node
        - cmslogbook: reboot restart elogd and reloading apache brought it back. I am leaving the preprod instance to Stefan to bring back
        - docdb also works -
        - Alison is educating me on glidein infrastructure survival techniques..
        - I just started and James kicked in - leaving it to him.
        - Dirk is reporting that glideinwms works and T0 replay is restarted. Jobs are running and finishing successfully. Even on AI.
        - looking at Andreas’ nodes:
                - vocms150 / vocms151
                                - does not work - something for Giullio
        - Krista confirms that the glidein infrastructure looks fine..
        - some of Andreas’ backends did not come up gracefully:

                        - - not working
                        - - not working
                        - - not working
                        - - ok
                        - - ok
                        - - ok
        - pinging all nodes. missing:
                - vocms143 (Glidein frontend, preprod) - on, no ping - rebooted and now the node is accessible
                - vocms149 (DB backend, production services LB) - on, no ping - rebooted and now the node is accessible
                - vocms158 (Glidein factory, preprod) - on, no ping - rebooted and now the node is accessible

        - all Puppet nodes had a Puppet run within the last 30 minutes which means that they are on and reporting

        - xrootd SLS monitoring looks fine:             

-- JulianBadillo - 17 Oct 2014

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2014-10-17 - JenniferAdelmanMcCarthy
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback