(musing on the cost of a new IT service - nothing here is "official")

(actually, this might make a good template for a number of other services as well, try to keep thing generic)

EOS cost breakdown

Suggest to calculate "daily" cost (with per-year month/sums):

  • Per-node costs (these could be calculated by some per-instance headnode that already has CDB access)
    • machine in warranty: hardware purchase divided by warranty period (i.e old machines are free; installation cost is booked as hardware cost)
    • electricity (consumption of box (from CDB) × some avg kWh cost. Have "idle" and "load" numbers in CDB /hardware/power/idle). Can also book the cost of cooling based on this?
    • network operation costs (fixed amount per box?)
    • sysadmin team and operator (shared between all "active" CDB nodes in "D" and "E"? or count actual number of ITCM tickets/box * cost per intervention)
    • tag costs based on SMS state as 'production' and 'non-production', split that into "ITCM" (i.e. ongoing open normal ITCM ticket exists, such as a vendor call) and "other" (e.g. us forgetting to put machine back into production; or spare; or pre-production instances)
  • Per-service costs
    • personnel costs (as seen by CERN, incl pension/CHIS/benefits; privacy: do not use actual money but CERN-wide averages - split by career path might be useful compromise; use monthly booking from APT and MARS-style fixed-percentage ratios)
      • operations team
      • development team
      • overhead: pro-rata SL (unless already part of either team), GL, IT-DHO (might be really marginal - but these should appear somewhere)
    • materials: IT building+infrastructure, desktops etc. (assume for now that this is included in personnel costs - or have per-person cost)

Per-node costs are based on CDB "cluster". Daily counting should be good enough to give accurate numbers (can track box migrations, can track "stuck" machines - so can improve procedures). Purchase price should not be confidential (public records [FIXME CHECK]?). Might leave out some marginal costs.

Per-service costs operate on longer timelines, need to capture people joining/leaving the service but not day-to-day split (impossible to get numbers anyway..). Privacy is an issue, but in principle assigned work and status (+career grade) are not secret ([FIXME CHECK]), and average salaries should not be secret either [FIXME CHECK].

"bracketing" the cost via overall budget

While the nitty-gritty breakdown inside the service may be hard to do, we have fairly reliable numbers on the overall IT department cost (APT) - ideally we'd "just" distribute these. However, the current split into budget codes ignores "free for all" internal services, so the drilldown is not ideal.

combined approach: top-down + bottom-up

With a (relatively) fixed budget ceiling for IT in general (from APT) and the need to justify service costs (through rather wide access) both for external as well IT-internal services, we should get some frantic activity where budget holders find a way to
  • drill down into their service to find out where juicy chunks of their budget goes
  • "internal services": scramble to find some other higher-level service that could get part of the cost booked on them (ie. "sell"). Need some rules, i.e. receiving end needs to agree (internal "contract")
  • "external services": go look for commercial services and figure out whether current cost is > advertised prices (outsource!) or << (congratulations!, or more probably, missing some cost element).
This should quickly improve data quality - internal service distribute costs, external absorb (get more expensive)

What to do with the data

  • (ammo for budget fights smile More seriously, allow outsiders to challenge the status quo - "why are so many machines not in production?", "why do you need so many people?" - these question can be used constructively.
    • also allow to benchmark against outsourcing offers - CERN pays no VAT, gets special prices for electricity, and (per staff association/5-year comparison) the salaries are "low" - so we have every cost advantage imaginable. If industry can beat us (modulo vendor lock-in and loss-leader offers), we must be doing something wrong...
  • discover and quantify low-efficiency services, or procurement choices
    • would need some kind of drill-down system (or rather, keep intermediate numbers apart as much as possible; early aggregation kills drill-down)
    • would benefit from tie-in to other systems (e.g. hardware type reliability, individual hardware history, vendor responsiveness, ..)
  • new KPIs for EOS
    • cost per TB - requires to distribute per-service costs over the production instances/subclusters (e.g. split equally between production instances, no matter what the size)
    • cost per transfer (eeevil since will go up when service is idle - think l/100km..)
    • cost efficiency of service (ratio production cost vs non-production states)
  • planning: guesstimate cost for an additional instance

Risk

  • Any kind of honest cost breakdown will expose inefficient services (i.e. make the owner look bad) - this is arguably the point,
  • will have to expose HR 'secrets' (salaries; dormant/extended-leave but booked people)
  • will flush out "secret" sources of income (e.g. external project money) - which promptly will get taken away, or be subjected to the ritual "you haven't used in December, you won't ever need it"-logic
  • may expose confidential contracts (but CERN is funded via tax money - would need to review our obligations to publish such data)

Notes

  • CERN-IT is cost center (own budget), so no need to include CERN-wide infrastructure (roads, other departments as "overhead")
    • however, the IT budget does not include electricity (which can be a major cost factor for external comparisons), may need to add. The per-kWh price might again be confidential, consumption is at https://energy.cern.ch/ ("PS - Meyrin")
  • difference "cost" vs price:
    • cost is what you have paid for something. Can haggle about how to share that cost, but not about absolute value.
    • price is what you charge your customers (from 0 to more than cost = normal business)
  • graph cost vs usage: expect normal service to be in some-cost-some-usage quadrant, some-cost-no-usage=kill; some-usage-no-cost=encourage; no-cost-no-usage=zen
  • note the time dimension of cost - some inefficient procedures (purchasing) should generate higher-than-usual cost for all participating services, although the actual machines arrive only later in the service. Need to decide to book costs from which moment in time (budget allocation? DAI? delivery? payment?)

CERN Links

  • APT follows the current IT structure/organigram,
    • breakdown detail is up to owner (bad example: IT-DSS:Managed Storage Operations = 47786, but no further breakdown into EOS/CASTOR)
    • (idea: link functions form MARS into APT? APT to service catalog?)
    • looks like the APT "workunits" and "budget codes" are up for reorganization - GLM 2012-11-12
  • IT-DHO (Alan/Christian) - contact to not miss IT-specific things
  • Procurement team (Olof) - have the per-machine purchase costs. need to know what we can publish
    • Attempts to get this information from them:
      • RQF0349216 - "RFE: interface to get per-node (per-cluster) hardware cost". opened in 2014, cancelled after 2 years of being unassigned
      • mail thread "various DSS hardware types, rough cost+end-of-warranty?" - Jan 2015 - got a Excel sheet, but marked as confidential.
      • AIRU (see below) as costing tool - mail thread, June 2016, petered out
      • RQF0884446 - "how to get the purchase cost of disk trays attached to a diskserver?". Nov 2017 (cannot.) "information already provided to your management" via "disk server working group", neither Massimo nor Alberto seem to have this?
      • mail to Olof+Eric , asking for cost summaries in quotable form- Mar 2018.
    • RQF0948687 - request to access to cost data in InforEAM - problem is that the data is not tagged in a way that allows splitting by service

  • CERN "internal audit" team can provide assistance - but at this low level? Contact later to make sure that what we provide is "compatible" with what they need
  • Service-DB: probably good source for technical costs
    • groups several clusters into a service, can use this to tie together machines, and also to book machine overhead (service-specific headnodes; "spare" but already-assigned capacity)
    • has "service managers"
  • ITIL service catalog: unclear structure, no direct link to Service-DB/machines and contains "internal" services - useless (but supposedly revamped in 2012). Ideally would be the common base for everything, i.e. tie into APT... Might at least use SNOW ticket count/ratio to distribute the service desk cost
    • RQF0172054 - '"cost management" with SNOW - used at CERN, more info?' - Nov 2012. "not used".
  • Bernd's technology overview contains cost estimates for CHF/GByte - CostEst and TechMarketPerf
    • Sep 2014: "Cloud computing, markets and costs" document (recv'd Nov 2015) "Cloud computing Sep2014 v12.pdf".
    • Updated in Mar 2018: "Cloud computing Mar2018 v4.pdf" and "Ingredients and plans for IT resources in Run 3_Mar2018_v3.pdf" (was sent to ITMM)

Tools and data sources

  • NEO4J is a nice graph database. Many of the dependency + costs diagrams look like graphs. Other are tables (machine type->costs), can these be joined? (no, but added on import)
  • can combine with "disaster recovery" dependency chain analysis ("circular is bad") - split out into ServiceDependencyDiscovery
    • need to split between "runtime","setup= configuration","monitoring" (or other auxilliary services) dependencies
      • could perhaps even auto-discover ("netstat" says you have a runtime dependency)
    • can combine with performance modeling, in particular networking (blocking factor - asked for info at RQF0899792)
  • LanDB is canonical info source for network structure. Changes rarely
  • PuppetDB or Foreman can tie a machine to their hostgroup, and might have other interesting data ("ec2_metadata->instance-type","productserialassettag"). changes sometimes
  • ROGER has alarm state. Changes frequently
  • SNOW has services, people and machines ("CI")
  • Infor EAM has the CC hardware inventory (among other things)
    • "AIRU" (Asset Inconsistencies Reporting Utility) was mentioned as a cost-aware tool in ASDF June 2016

External resources / Links

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2018-03-21 - JanIven
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback