Summary of pre-GDB meeting on Cloud Issues, January 14, 2014 (CERN)

Agenda

https://indico.cern.ch/conferenceDisplay.py?confId=272783

Accounting

Introduction - J. Gordon

Most work happened inside the EGI Fed Cloud TF

  • Based on APEL infrastructure
  • UR 1.1 from OGF
  • SSM as transport mechanism

Demonstrated that we can collect accounting data, send them to a central DB and publish them into the accounting portal

  • From different cloud middleware
  • A prototype working since several months

Merging grid and cloud accounting, if desirable, requires normalization of collected data

  • How to benchmark VMs?

Benchmarking is the difficult part if we continue to use accounting based on the CPU time

  • Impact of overcommitment on VM CPU power
  • Overcommitment required to reach a high efficiency in the usage of physical resources
  • Overcommitment is even worse if moving to WC time

Discussion

A. Girolamo: what about ability to publish Amazon usage records into APEL

  • No consensus that the usage accounting of commercial providers has to go to APEL
  • J. Gordon: probably technically no showstopper to do it if a VO wants to do it by parsing the Amazon bill

Benchmarking by VO test applications: may help a VO to compare different sites providing resources to the VO but this doesn't allow to compare resources delivered to different VOs

  • VO-specific metrics

Job machine features may help to publish a minimum CPU power for the VM but how to push it to the accounting (cloud MW config used by accounting)

  • If the use payload get more power delivered (because of a low usage by competing jobs), this will be "for free"
  • As part of providing the value in machine features, a small benchmark can be run and can also give a spread
  • Agreement that this is the preferred method to publish VM CPU power, even though technical details on how to push this information to VM may require some work

Probably try to stay with CPU-time for accounting and let sites overcommit to overcome potential inefficiency

  • Seems easier in the short term, in particular for pledges
  • Both CPU time and WC time are collected, as for grid jobs

HS06 remains the most accepted VO-independent metrics

  • Clarify how many HS bench instance to run based on the number of core in the VM.
  • Don't need a very precise measurement: +/-20% will be ok

Security - V. Brillault

From grid:

  • the most important asset: trust
  • the most important thread: misused identity

Thus the focus on traceability: containment, prevention of new occurrences

EGI policy based on VM endorsement

  • VM image endorsed by a limited number of endorsers
  • VM operator has root access to the VM
  • endorser/operator vs. VO/site
  • Traceability probably requires that endorser and operator are not the end user

Clouds: logging requirements remain the same but probably some new logs required (VM images instantiated...)

  • We cannot expect any cloud resource provider to save/keep images modified for forensics

User compartmentalization: one UID per end user

  • Required to be able to block a user rather than a VO
  • Prevent user from interacting with each others
  • Keeping mapping between DNs and UIDs may lead to reintroducing the grid mapping complexity, creating unnecessary accounts...: an alternative may be to create a new userid on the fly for every payload
    • Authorization to create the userid based on the DN (banning check) could be implemented by a small tool run before creating the userid

Commercial providers have a different approach: any cost involved for forensic are charged to the user

  • Not applicable to community clouds: nothing charged to the user

Root access: if given to end users, traceability/security is difficult to guarantee

  • Will potentially break the endorsement trust
  • Can be given to trusted people in a VO: VM operators
  • But is it possible to prevent an end user to submit their own VMs into a cloud world?

Vulnerability handling requires the ability to terminate a VM if long lifetimes are used

  • Job machine features should allow to do it easily and if the VM is not stopped before the deadline, it will be killed by the site
  • VOs are in the best position to do the shutdown in fact and the VO is the entity who knows that the image is vulnerable

A questionnaire is currently being prepared by EGI SVG: probably requires still some work before being distributed

Target Shares in Clouds

UVic Experience - R. Sobie

Operating a distributed production cloud for Atlas since 2012

  • Made of IaaS clouds from 3 continents: Nimbus and OpenStack
  • Static allocation of resources
    • Opportunistic use of some resources
  • Mostly for production jobs
  • Recently opened to Belle2

Some collaborative activities

  • Dynamic SW caching using Squids and Shoal: shoal discovers, tracks and advertizes dynamic squids
  • CloudScheduler installed at CERN/Wigner
  • Multicore jobs for Atlas

CloudScheduler in charge of stopping VMs from one user if another one with higher priority requires resources

  • It is also in charge of provisioning VMs to execute jobs in the batch queue

Security: VM contextualized at instantiation time (e.g. credentials for Condor pool)

  • Currently logging and traceability nearly inexistent...

Accounting not done currently

Discussion

Most VOs currently foresee cloud usage only for production jobs.

  • More controlled needs

Queuing: VOs with a pilot factory generally would like to avoid queuing at sites but how to avoid static partitioning without queuing

Claudio: should not reinvent the wheel and should reuse batch system queuing concept but for requests only, not for managing jobs

  • Should be integrated in the cloud management software

Andrew: could imagine a component monitoring the logs to determine VO pressure and adjust resource used or VO share based on that to give a chance for the VO under quota to instantiate a VM

  • Resource usage integrated over a period of time to allow dynamic partitioning
  • Claudio: would be better to keep track of the request and start a VM on behalf of the VO (using the request credentials) which had a request rejected
    • Probably more intrusive to cloud MW...
    • Caching credentials (that may expire) may proved painful
    • Several potential issues with the asynchronous start of VMs, including the ability for the VO the shutdown the VM if it is no longer needed

Philippe: another option would be to start VMs over quota with a short lifetime advertized by job features.

  • Spot market with guaranteed minimum lifetime
  • Could be extended later
  • Would avoid the complexity of VM reclamation triggered by request pressure analysis
  • Tim: could be combined with some temporary overcommitment to allow requests to start when other resources are reclaimed
  • Must keep in mind that a cloud is intended to be large...

Tim: combine static partitioning of cloud resources and a batch system for the fluctuating part

  • Will not work for normal sites with huge variation (10x is common)

Wrap-Up

Good and useful discussions on each topic: Michel will try to summarize each discussion as a separate thread on the mailing

  • Try to use a more active usage of the mailing list

Plan a new meeting during Spring.

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2014-01-23 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback