WLCG Critical Services
These numbers have been discussed and agreed between the experiments, WLCG operations and the IT/Tier-0 service managers.
Definitions
- Impact
- the amount of "damage" made by a service unavailability to operations or people if no action is taken
- Urgency
- the delay between the start of the service unavailability and the time the full impact is reached
- Functional service
- a high level service corresponding to a particular function of the computing system, as defined in the WLCG MoU Annex 3
- Specific service
- a service contributing to one or more functional services
CERN functional services
Operations-related services |
High bandwidth connectivity from detector area to computer centre |
Recording and permanent storage in a MSS of raw and reconstructed data |
Disk storage of reconstructed data |
Distribution of raw and reconstructed data to Tier-1 sites in time with data acquisition |
Prompt reconstruction, calibration and alignment |
Storage and distribution of conditions data |
Data analysis facility |
Databases |
VO management services |
Tools and support services |
Tools and services for application development (CVS, SVN, etc.) |
Desktop services (email, web, Twiki, Indico, Vidyo, etc.) |
Tier-1 functional services
Operations-related services |
Raw and reconstructed data import from Tier-0 |
Simulated and processed data import from other WLCG centres |
MSS archival storage of raw, reconstructed, processed and simulated data |
Disk storage for data and temporary files |
Provision of data access to other WLCG centres |
Data analysis and reprocessing |
Other experiment services |
Network and data transfer services to Tier-0 and Tier-1 sites (high bandwidth) and to Tier-2 sites |
Databases |
Tier-2 functional services
Operations-related services |
Disk storage for data and temporary files |
Provision of data access to other WLCG centres |
Data analysis |
Simulation and data processing |
Other experiment services |
Network and data transfer services |
Impact
Impact on operations
Level |
Definition |
10 |
Most operations services stop |
9 |
Some operations services stop |
8 |
One operations service stops |
7 |
Most operations services disrupted |
6 |
Some operations services disrupted |
5 |
One operations service disrupted |
4 |
Some support services stop |
3 |
One support service stops |
2 |
Some support services disrupted |
1 |
One support service disrupted |
Impact on people
Level |
Definition |
10 |
Whole VO affected |
8 |
Users affected > 50% |
5 |
10% < users affected < 50% |
3 |
Users affected < 10% |
1 |
A single user affected |
The overall impact is taken as the maximum between the impact on operations and on people.
Urgency
Criticality of Tier-0/CERN services
|
ALICE |
ATLAS |
CMS |
LHCb |
Service |
Urgency |
Impact |
Urgency |
Impact |
Urgency |
Impact |
Urgency |
Impact |
Px→Computer centre network |
6 |
10 |
4 |
10 |
3 |
10 |
10 |
10 |
WLCG network (LHCOPN, GPN) |
6 |
10 |
7 |
9 |
7 |
9 |
7 |
10 |
CERN Oracle online |
10 |
10 |
9 |
10 |
10 |
10 |
- |
- |
CERN Oracle Tier-0 (inc. streaming) |
4 |
6 |
8 |
9 |
6 |
10 |
10 |
10 |
DB-on-demand |
|
|
|
|
|
|
10 |
10 |
Frontier and squid |
- |
- |
6 |
8 |
6 |
10 |
- |
- |
CASTOR tape (CTA soon) |
4 |
6 |
7 |
8 |
2 |
8 |
2 |
8 |
CASTOR disk (to disappear soon) |
4 |
6 |
- |
- |
- |
- |
- |
- |
EOS |
5 |
10 |
7 |
10 |
6 |
8 |
6 |
8 |
Ceph |
- |
- |
5 |
8 |
5 |
8 |
- |
- |
Batch service |
3 |
10 |
6 |
9 |
5 |
9 |
5 |
6 |
CE |
3 |
10 |
6 |
8 |
3 |
3 |
5 |
6 |
FTS |
- |
- |
7 |
8 |
4 |
6 |
5 |
9 |
VOMS |
3 |
10 |
4 |
10 |
4 |
10 |
7 |
10 |
BDII |
- |
- |
- |
- |
- |
- |
1 |
1 |
MyProxy |
3 |
10 |
3 |
3 |
4 |
9 |
- |
- |
CVMFS Stratum-0 |
4 |
9 |
4 |
9 |
4 |
6 |
4 |
9 |
CVMFS Stratum-1 |
3 |
5 |
3 |
5 |
4 |
6 |
4 |
10 |
Monit |
1 |
3 |
5 |
8 |
3 |
5 |
1 |
1 |
SAM |
3 |
3 |
3 |
3 |
5 |
3 |
4 |
2 |
AI cloud services |
3 |
10 |
9 |
10 |
8 |
8 |
10 |
10 |
Kubernetes |
- |
- |
6 |
7 |
6 |
7 |
- |
- |
LXPLUS |
3 |
5 |
5 |
5 |
8 |
6 |
10 |
6 |
AFS |
- |
- |
6 |
8 |
6 |
9 |
4 |
4 |
CAF |
6 |
10 |
8 |
9 |
8 |
8 |
1 |
1 |
GitLab |
5 |
9 |
4 |
8 |
6 |
6 |
6 |
6 |
JIRA/TRAC |
5 |
9 |
4 |
8 |
3 |
5 |
3 |
6 |
Global xrootd redirector |
- |
- |
|
|
6 |
8 |
6 |
6 |
Twiki |
3 |
3 |
7 |
9 |
6 |
6 |
6 |
6 |
Mail and web services |
6 |
10 |
8 |
10 |
5 |
10 |
8 |
10 |
Hypernews |
- |
- |
- |
- |
4 |
5 |
- |
- |
Indico |
3 |
3 |
3 |
8 |
3 |
5 |
8 |
9 |
Vidyo |
|
|
6 |
5 |
6 |
5 |
8 |
9 |
SSO |
7 |
10 |
8 |
10 |
8 |
10 |
8 |
10 |
Terminal servers |
3 |
2 |
- |
- |
- |
- |
- |
- |
NICE AD servers |
3 |
2 |
- |
- |
- |
- |
- |
- |
CRIC |
|
|
|
|
4 |
3 |
|
|
|
|
|
|
|
Notes:
- The Stratum0 entry includes the release nodes
- The CAF, for CMS, consists of LSF queues
- AI cloud services include openstack, kubernetes
Criticality of services which are not hosted by T0
|
ALICE |
ATLAS |
CMS |
LHCb |
Service |
Urgency |
Impact |
Urgency |
Impact |
Urgency |
Impact |
Urgency |
Impact |
GOCDB |
|
|
|
|
|
|
|
|
GGUS |
|
|
|
|
|
|
|
|
FTS |
|
|
|
|
|
|
|
|
Stratum-1 |
|
|
|
|
|
|
|
|
Review of the critical services in October 2019
ALICE
- Services which are obsolete and should be deleted from the table
- Critical services missing in the table (with urgency and impact estimation)
- Any changes required for urgency and impact for services which are in the table
- Any suggestions regarding overall approach and definitions
New values for ALICE are proposed
here
ATLAS
- Services which are obsolete and should be deleted from the table
- Critical services missing in the table (with urgency and impact estimation)
- CRIC, Mattermost, Kubernetes, estimates to come later
- Any changes required for urgency and impact for services which are in the table
- Changes: EOS
- Added: GIT, Vidyo, SSO
- Propose name changes (numbers same): CASTOR Tape/Disk -> CTA(Disk+Tape), Dashboard -> Monit
- Any suggestions regarding overall approach and definitions
- Perhaps the granularity is a bit fine, for us there is little difference in a service requiring two, four or six hour reaction time.
- We do believe it's healthy to periodically review the definitions and their values, based on the experience from the past 10 years
CMS
- Services which are obsolete and should be deleted from the table
- Critical services missing in the table (with urgency and impact estimation)
- Any changes required for urgency and impact for services which are in the table
- Any suggestions regarding overall approach and definitions
LHCb
- Services which are obsolete and should be deleted from the table
- Critical services missing in the table (with urgency and impact estimation)
- SSO: set (urgency, impact) = (8,10)
- Any changes required for urgency and impact for services which are in the table
- global xrootd redirector: set (urgency,impact)=(6,6)
- CVMFS stratum0: increase impact to 9
- CVMFS stratum1: increase impact to 10
- git: set (urgency,impact)=(6,6)
- change "CASTOR" --> "CTA"
- Any suggestions regarding overall approach and definitions
Previous versions
- Criticalities during Run1 ( link)
- Criticalities during Run2 ( link)