EGEE Site SLA metrics

DRAFT

This document is a proposition of metrics which can be used in EGEE RC SLA. Discussion and comments are welcome.

SLA Aims

To provide formal description of resources/services provided by Resource Centers (RCs) for:

  1. EGEE (SLA included in contracts between RCs and EGEE)
  2. Virtual Organizations

Allows to evaluate sites operation in EGEE as well as enforce declared service level.

Metrics

The SLA will cover following areas:

  • Resources and performance (CPU, Storage)
  • Connectivity
  • Availability
  • Software/Middleware (?)
  • VO Support
  • Support and expertise
  • Data privacy and other

Bellow you can find first draft of metrics, which can be used in each area:

Resources

  • CPU architecture -- (x86, IA64, ...)

  • CPU count -- total number of CPUs available for EGEE VOs

If multicore CPU are used, each core can have one job slot. Total count of job slots can not be greater than number of CPU cores.

  • Cores per CPU

It could be useful for tracking performance issues, like memory bottleneck.

  • CPU performance (benchmark - which one? SpecINT?)

How to treat multi-core CPUs??? If site publish N-core CPU as N job slots, then CPU performance should have average value of values returned by N benchmarks run simultaneously on a single N-core CPU. CPU performance should be measured, not (easily) configurable by site admin.

  • RAM per node

Amount of memory installed on single node (shared by all jobs running on the node)

  • RAM per job slot

Amount of memory available for single (not MPI) job.

  • Cluster interconnection (important for MPI jobs)
    • Type (ethernet, infiniband, etc.)
    • Latency [ms]
    • Bandwidth [Mbit]

  • Storage
    • Type (disk,tapes,...)
    • Size [TB]
    • Avg. access time [ms]
    • Storage bandwidth [Mbit]

Resources can be checked using GSTAT. However, GSTAT uses BDII, which can be easily altered by site admins... GridICE with WN monitoring?

How to treat heterogeneity? (different CPUs, amount of RAM, interconnection...)

  • Define each resources type
  • Define minimum guaranteed resources

Network

  • Connectivity
    • Site should provide enough connectivity (open ports on firewall) to allow correct execution of SAM test jobs
    • outbound from WN (obligatory?)
    • inbound to WN (optional)

  • Network bandwidth
    • Site uplink bandwidth
    • Bandwidth between site and GEANT2?
    • Connection quality
    • Packet loss (Can we measure this?)
    • latencies
    • reordering and mss (is it relevant?)

Minimal acceptable inbound/outbound bandwidth should be relative to CPU count.

How this is related to SA2 Network SLA?

Availability

  • Site availability (SAM: % of time when site was available -- all ciritical test were OK) (excluding Downtimes?)
  • Site declared downtime (% of time when site was in Downtime)

Is SAM accurate enough? Taking long term average (month, year) it should be enough. However, error relevance should be taken from site reports.

Software/Middleware

  • Middleware flavor (gLite, LCG, ???)
  • Time for update

Time to install latest midlleware patches and updates

  • Time for new services deployment

This vary depending on service.

  • Coreservices provided by site

Should this SLA cover Coreservices?

VO Support

  • Support of mandatory VOs

ops and dteam (?)

  • Time for configure new VO

How long does it take to configure new VO? (Days, not months)

  • Supported VOs

list of supported VOs

  • Support for "catch-all" VOs

  • Minimum number of supported not-obligatory VOs

Support and expertise - problem handling

  • Ticket response time (taken from GUS)
  • Time to solve a ticket
  • Number of ticket and its severity (for monitoring only)

Site does not have any control over tickets it received, therefor it can to be taken under consideration during site operation evaluation.

  • Effectiveness in ticket solving (% of tickets solved)

  • Site administartors/security officers
    • Site Administrators and Security Officers FTEs
    • Working hours

  • Incident response procedures
    • Reaction time
    • Conformance to EGEE/ROC procedures

  • Site should maintain up-to-date site information in GOC DB

Others

  • Data privacy
    • storage
    • pool accounts

Sites should configure it's resources (storage element, pool accounts, ...) to prevent any unauthorized data access.

-- Main.lskital - 23 May 2007

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2007-06-18 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback