EGEE Site SLA metrics
DRAFT
This document is a proposition of metrics which can be used in EGEE RC SLA. Discussion and comments are welcome.
SLA Aims
To provide formal description of resources/services provided by Resource Centers (RCs) for:
- EGEE (SLA included in contracts between RCs and EGEE)
- Virtual Organizations
Allows to evaluate sites operation in EGEE as well as enforce declared service level.
Metrics
The SLA will cover following areas:
- Resources and performance (CPU, Storage)
- Connectivity
- Availability
- Software/Middleware (?)
- VO Support
- Support and expertise
- Data privacy and other
Bellow you can find first draft of metrics, which can be used in each area:
Resources
- CPU architecture -- (x86, IA64, ...)
- CPU count -- total number of CPUs available for EGEE VOs
If multicore CPU are used, each core can have one job slot. Total count of job slots can not be greater than number of CPU cores.
It could be useful for tracking performance issues, like memory bottleneck.
- CPU performance (benchmark - which one? SpecINT?)
How to treat multi-core CPUs??? If site publish N-core CPU as N job slots, then CPU performance should have average value of values returned by N benchmarks run simultaneously on a single N-core CPU. CPU performance should be measured, not (easily) configurable by site admin.
Amount of memory installed on single node (shared by all jobs running on the node)
Amount of memory available for single (not MPI) job.
- Cluster interconnection (important for MPI jobs)
- Type (ethernet, infiniband, etc.)
- Latency [ms]
- Bandwidth [Mbit]
- Storage
- Type (disk,tapes,...)
- Size [TB]
- Avg. access time [ms]
- Storage bandwidth [Mbit]
Resources can be checked using GSTAT. However, GSTAT uses
BDII, which can be easily altered by site admins...
GridICE with WN monitoring?
How to treat heterogeneity? (different CPUs, amount of RAM, interconnection...)
- Define each resources type
- Define minimum guaranteed resources
Network
- Connectivity
- Site should provide enough connectivity (open ports on firewall) to allow correct execution of SAM test jobs
- outbound from WN (obligatory?)
- inbound to WN (optional)
- Network bandwidth
- Site uplink bandwidth
- Bandwidth between site and GEANT2?
- Connection quality
- Packet loss (Can we measure this?)
- latencies
- reordering and mss (is it relevant?)
Minimal acceptable inbound/outbound bandwidth should be relative to CPU count.
How this is related to SA2 Network SLA?
Availability
- Site availability (SAM: % of time when site was available -- all ciritical test were OK) (excluding Downtimes?)
- Site declared downtime (% of time when site was in Downtime)
Is SAM accurate enough? Taking long term average (month, year) it should be enough. However, error relevance should be taken from site reports.
Software/Middleware
- Middleware flavor (gLite, LCG, ???)
- Time for update
Time to install latest midlleware patches and updates
- Time for new services deployment
This vary depending on service.
- Coreservices provided by site
Should this SLA cover Coreservices?
VO Support
ops and dteam (?)
- Time for configure new VO
How long does it take to configure new VO? (Days, not months)
list of supported VOs
- Support for "catch-all" VOs
- Minimum number of supported not-obligatory VOs
Support and expertise - problem handling
- Ticket response time (taken from GUS)
- Time to solve a ticket
- Number of ticket and its severity (for monitoring only)
Site does not have any control over tickets it received, therefor it can to be taken under consideration during site operation evaluation.
- Effectiveness in ticket solving (% of tickets solved)
- Site administartors/security officers
- Site Administrators and Security Officers FTEs
- Working hours
- Incident response procedures
- Reaction time
- Conformance to EGEE/ROC procedures
- Site should maintain up-to-date site information in GOC DB
Others
Sites should configure it's resources (storage element, pool accounts, ...) to prevent any unauthorized data access.
-- Main.lskital - 23 May 2007