WLCG Critical Services

These numbers have been discussed and agreed between the experiments, WLCG operations and the IT/Tier-0 service managers.

Definitions

Impact
the amount of "damage" made by a service unavailability to operations or people if no action is taken
Urgency
the delay between the start of the service unavailability and the time the full impact is reached
Functional service
a high level service corresponding to a particular function of the computing system, as defined in the WLCG MoU Annex 3
Specific service
a service contributing to one or more functional services

CERN functional services

Operations-related services
High bandwidth connectivity from detector area to computer centre
Recording and permanent storage in a MSS of raw and reconstructed data
Disk storage of reconstructed data
Distribution of raw and reconstructed data to Tier-1 sites in time with data acquisition
Prompt reconstruction, calibration and alignment
Storage and distribution of conditions data
Data analysis facility
Databases
VO management services
Tools and support services
Tools and services for application development (CVS, SVN, etc.)
Desktop services (email, web, Twiki, Indico, Vidyo, etc.)

Tier-1 functional services

Operations-related services
Raw and reconstructed data import from Tier-0
Simulated and processed data import from other WLCG centres
MSS archival storage of raw, reconstructed, processed and simulated data
Disk storage for data and temporary files
Provision of data access to other WLCG centres
Data analysis and reprocessing
Other experiment services
Network and data transfer services to Tier-0 and Tier-1 sites (high bandwidth) and to Tier-2 sites
Databases

Tier-2 functional services

Operations-related services
Disk storage for data and temporary files
Provision of data access to other WLCG centres
Data analysis
Simulation and data processing
Other experiment services
Network and data transfer services

Impact

Impact on operations

Level Definition
10 Most operations services stop
9 Some operations services stop
8 One operations service stops
7 Most operations services disrupted
6 Some operations services disrupted
5 One operations service disrupted
4 Some support services stop
3 One support service stops
2 Some support services disrupted
1 One support service disrupted

Impact on people

Level Definition
10 Whole VO affected
8 Users affected > 50%
5 10% < users affected < 50%
3 Users affected < 10%
1 A single user affected
The overall impact is taken as the maximum between the impact on operations and on people.

Urgency

Level Time (hours)
10 0
9 0.5
8 1
7 2
6 4
5 6
4 12
3 24
2 48
1 72

Criticality of Tier-0/CERN services

  ALICE ATLAS CMS LHCb
Service Urgency Impact Urgency Impact Urgency Impact Urgency Impact
Px→Computer centre network 6 10 4 10 3 10 10 10
WLCG network (LHCOPN, GPN) 6 10 7 9 7 9 7 10
CERN Oracle online 10 10 9 10 10 10 - -
CERN Oracle Tier-0 (inc. streaming) 4 6 8 9 6 10 10 10
DB-on-demand             10 10
Frontier and squid - - 6 8 6 10 - -
CASTOR tape (CTA soon) 4 6 7 8 2 8 2 8
CASTOR disk (to disappear soon) 4 6 - - - - - -
EOS 5 10 7 10 6 8 6 8
Batch service 3 10 6 9 5 9 5 6
CE 3 10 6 8 3 3 5 6
FTS - - 7 8 4 6 5 9
VOM(R)S 3 10 4 10 4 10 7 10
BDII - - - - - - 1 1
MyProxy 3 10 3 3 4 9 - -
CVMFS Stratum0 4 9 4 9 4 6 4 9
CVMFS Stratum1 3 5 3 5 4 6 4 10
Monit 1 3 5 8 3 5 1 1
SAM 3 3 3 3 5 3 4 2
AI cloud services 3 10 9 10 8 8 10 10
LXPLUS 3 5 5 5 8 6 10 6
AFS - - 6 8 6 9 4 4
CAF 6 10 8 9 8 8 1 1
GitLab 5 9 4 8 6 6 6 6
JIRA/TRAC 5 9 4 8 3 5 3 6
Global xrootd redirector - -     6 8 6 6
Twiki 3 3 7 9 6 6 6 6
Mail and web services 6 10 8 10 5 10 8 10
Hypernews - - - - 4 5 - -
Indico 3 3 3 8 3 5 8 9
Vidyo     6 5 6 5 8 9
SSO 7 10 8 10 8 10 8 10
Terminal servers 3 2 - - - - - -
NICE AD servers 3 2 - - - - - -
CRIC         4 3    
 
ALICE crit.png
ALICE criticalities
ATLAS crit.png
ATLAS criticalities
CMS crit.png
CMS criticalities
LHCb crit.png
LHCb criticalities
Notes:
  • The Stratum0 entry includes the release nodes
  • The CAF, for CMS, consists of LSF queues
  • AI cloud services include openstack, kubernetes

Criticality of services which are not hosted by T0

  ALICE ATLAS CMS LHCb
Service Urgency Impact Urgency Impact Urgency Impact Urgency Impact
GOCDB                
GGUS                

Review of the critical services in October 2019

ALICE

  • Services which are obsolete and should be deleted from the table
  • Critical services missing in the table (with urgency and impact estimation)
  • Any changes required for urgency and impact for services which are in the table
  • Any suggestions regarding overall approach and definitions

New values for ALICE are proposed here

ATLAS

  • Services which are obsolete and should be deleted from the table
    • SVN, BDII
  • Critical services missing in the table (with urgency and impact estimation)
    • CRIC, Mattermost, estimates to come later
  • Any changes required for urgency and impact for services which are in the table
    • Changes: EOS
    • Added: GIT, Vidyo, SSO
    • Propose name changes (numbers same): CASTOR Tape/Disk -> CTA(Disk+Tape), Dashboard -> Monit
  • Any suggestions regarding overall approach and definitions
    • Perhaps the granularity is a bit fine, for us there is little difference in a service requiring two, four or six hour reaction time.
    • We do believe it's healthy to periodically review the definitions and their values, based on the experience from the past 10 years

CMS

  • Services which are obsolete and should be deleted from the table
    • BDII, SVN
  • Critical services missing in the table (with urgency and impact estimation)
    • CRIC
  • Any changes required for urgency and impact for services which are in the table
    • Vidyo and SSO
  • Any suggestions regarding overall approach and definitions

LHCb

  • Services which are obsolete and should be deleted from the table
    • MyProxy
    • svn
  • Critical services missing in the table (with urgency and impact estimation)
    • SSO: set (urgency, impact) = (8,10)
  • Any changes required for urgency and impact for services which are in the table
    • global xrootd redirector: set (urgency,impact)=(6,6)
    • CVMFS stratum0: increase impact to 9
    • CVMFS stratum1: increase impact to 10
    • git: set (urgency,impact)=(6,6)
    • change "CASTOR" --> "CTA"
  • Any suggestions regarding overall approach and definitions

Previous versions

  • Criticalities during Run1 ( link)
  • Criticalities during Run2 ( link)
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng ALICE_crit.png r3 r2 r1 manage 19.9 K 2015-03-12 - 15:27 AndreaSciaba  
PNGpng ATLAS_crit.png r2 r1 manage 21.9 K 2015-03-12 - 15:28 AndreaSciaba  
PNGpng CMS_crit.png r2 r1 manage 20.9 K 2015-03-12 - 15:28 AndreaSciaba  
PNGpng LHCb_crit.png r2 r1 manage 22.1 K 2015-03-12 - 15:28 AndreaSciaba  
Edit | Attach | Watch | Print version | History: r26 < r25 < r24 < r23 < r22 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r26 - 2019-10-04 - StephanLammel
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback