Ceph Block Store incident and associated service incidents



As noted on OTG:0054944, on Thursday 20th February 2020, around 10am, some parts of the Ceph infrastructure went down.

Ceph is used as the backing store for VMs providing attached disks and shared filesystems to many services - consequently several CERN site and VO services were affected.


What was affected ( Grid-facing services )

  • The CERN Batch Service was unavailable for job submission. Longer running jobs eventually failed due to losing touch with the batch shedds for too long. ( OTG:0054948 )

  • The CMVFS stratum-0 was unavailable for some hours ( OTG:005496)

  • The WLCG monitoring infrastructure was degraded, recovering the backlog some hours later ( OTG:0054946 )

  • Multiple VO services, both local and those supporting the experiments' distributed production and analysis were unavailable.

What was not affected (Grid-facing services)

  • The EOS Grid services were not affected

  • The BDII, MyProxy and VOMS services were not affected

Time line of the incident


  • 10:10: lxbatch submissions blocked
  • 18:00: lxbatch job submissions open again
  • 19:30: lxbatch back to normal levels


  • In most cases, the direct cause of secondary service degradation was due to a direct dependency on Ceph.

  • In a few cases (e.g. Indico file access) the effect was tertiary, due to affected AFS volumes hosted on VMs.

  • The services' dependencies (on Ceph) were reviewed by IT management at the C5 Service Meeting (PDF) and generally found to be reasonable and a required component of the service delivery.

Follow up



Ceph OTG:

Linked OTGs:

This topic: LCG > WebHome > WLCGCommonComputingReadinessChallenges > WLCGOperationsWeb > WLCGServiceIncidents > CERNProdIncident200220
Topic revision: r1 - 2020-02-27 - GavinMcCance
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback