The goal of this page is to summarize what is or needs to be on the table to enable us to run jobs on opportunistic resources.

This will be done mostly with Virtual Machines that have all the software and configurations pre-defined in order to connect back to our infra-structure and be able to run CMS jobs. Then we would only need to tell the cloud how much instances (slots) we want form it.

Clouds to be explored

  • TeraGrid
  • Amazon
    • We should format a formal proposal, it would help to mention some inovation that they would be helping to provide, not only bare resources to our physics purposes

Generic features

  • Virtual Machine image
    • CVMFS -- grid middleware + CMSSW
    • Condor execute node -- connecting back to our Central Manager
    • Storage interface (??) -- need to analyze the options. We definitely don't want to fuse mount hadoop there. Need then to specify to the jobs what to do for stage-out. SRM is the best.

Specific features

Each cloud will have very different instantiation/termination tools (for VMs), therefore, needs specific scripts to add/remove nodes. On the other hand, the framework can remain the same. Further discussion when looking at it closer. The only very different (conceptually) cloud is BOINC.

Interesting discussions

  • Condor Central Manager
A question that we want to ask ourselves is, do we want to have a condor pool, separated from our Tier-2, or integrated to it? We could both make the "pilot VMs" to connect back to the Tier-2 central manager, or use the FORK feature from Condor, to migrate jobs to the cloud condor pool according to a given expression.

Integrating Clouds into a CMS T2

Goal

To provide a distributable, reproduceable way of integrating cloud computing resources (processing for now) into a CMS Tier 2 site. For that we will count on Generic cloud API's like OpenStack, EC2 or even DeltaCloud which is compatible to many cloud APIs.

Once we have a development cluster, that connects to the Tier-2 local batch system, providing job slots in a transparent way, we could use this, to develop a deployment model that enforces special classAds for the cloud worker-nodes, and a special queue in the Computing Element to match that, so the resource can be used as transparent or not.

Initial support will be for Condor batch system clusters only.

Outcoming scripts and configuration tweaks could be integrated into official OSG Software stack, so it would come OOTB for other sites, that will only need to minimally configure it and use it. We will need OSG support for that.

Technologies and Layers

For how the system will organize hierarchically you can find the diagram here

DeltaCloud

Main API to instantiate and destroy nodes. Should act on top of OpenStack and provide a generic interface for the admin tools (scripts) in the CE.
  • Part of "generic cloud interface"

OpenStack

Responsible for actually instantiating the workernode images. Will take commands from DeltaCloud. Could be replaced for Amazon EC2.
  • Part of "generic cloud interface"

Condor

Besides receiving the cloud nodes, we need to make sure that the cloud nodes are configured such as having a defined ClassAd unique to cloud nodes , for posterior matching.

GRAM Gatekeeper (OSGCE)

We need to copy the Condor jobmanager, but this one will be called "jobmanager-cloud", and the only difference is that it adds the special requirement for the jobs to match the Cloud Node classad.

This way, it's optional to use the cloud nodes as transparent, or use exclusively them. By altering the original Condor.pm is also possible to only run on non-cloud nodes.

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2020-08-21 - TWikiAdminUser
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox/SandboxArchive All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback