The goal of this page is to summarize what is or needs to be on the table to enable us to run jobs on opportunistic resources.
This will be done mostly with Virtual Machines that have all the software and configurations pre-defined in order to connect back to our infra-structure and be able to run CMS jobs. Then we would only need to tell the cloud how much instances (slots) we want form it.
Clouds to be explored
- TeraGrid
- Amazon
- We should format a formal proposal, it would help to mention some inovation that they would be helping to provide, not only bare resources to our physics purposes
Generic features
- Virtual Machine image
- CVMFS -- grid middleware + CMSSW
- Condor execute node -- connecting back to our Central Manager
- Storage interface (??) -- need to analyze the options. We definitely don't want to fuse mount hadoop there. Need then to specify to the jobs what to do for stage-out. SRM is the best.
Specific features
Each cloud will have very different instantiation/termination tools (for VMs), therefore, needs specific scripts to add/remove nodes. On the other hand, the framework can remain the same. Further discussion when looking at it closer. The only very different (conceptually) cloud is BOINC.
Interesting discussions
A question that we want to ask ourselves is, do we want to have a condor pool, separated from our Tier-2, or integrated to it? We could both make the "pilot VMs" to connect back to the Tier-2 central manager, or use the FORK feature from Condor, to migrate jobs to the cloud condor pool according to a given expression.
Integrating Clouds into a CMS T2
Goal
To provide a distributable, reproduceable way of integrating cloud computing resources (processing for now) into a CMS Tier 2 site. For that we will count on Generic cloud API's like
OpenStack, EC2 or even
DeltaCloud which is compatible to
many cloud APIs.
Once we have a development cluster, that connects to the Tier-2 local batch system, providing job slots in a transparent way, we could use this, to develop a deployment model that enforces special classAds for the cloud worker-nodes, and a special queue in the Computing Element to match that, so the resource can be used as transparent or not.
Initial support will be for Condor batch system clusters only.
Outcoming scripts and configuration tweaks could be integrated into official OSG Software stack, so it would come OOTB for other sites, that will only need to minimally configure it and use it. We will need OSG support for that.
Technologies and Layers
For how the system will organize hierarchically you can find the diagram
here
Main API to instantiate and destroy nodes. Should act on top of
OpenStack and provide a generic interface for the admin tools (scripts) in the CE.
- Part of "generic cloud interface"
Responsible for actually instantiating the workernode images. Will take commands from
DeltaCloud. Could be replaced for Amazon EC2.
- Part of "generic cloud interface"
Condor
Besides receiving the cloud nodes, we need to make sure that the cloud nodes are configured such as having a defined
ClassAd unique to cloud nodes , for posterior matching.
GRAM Gatekeeper (OSGCE)
We need to copy the Condor jobmanager, but this one will be called "jobmanager-cloud", and the only difference is that it adds the special requirement for the jobs to match the Cloud Node classad.
This way, it's optional to use the cloud nodes as transparent, or use exclusively them. By altering the original Condor.pm is also possible to only run on non-cloud nodes.