WN Working Group.

The worker node working group is TCG sponsored activity that aims to address the matching and utilization of of worker node resources within the EGEE grid.

The mandate, list of members and mailing list is available here. Mandate

Steps to Reach Goals for Homogeneous subclusters

A number of steps must be completed to reach a state where we support rich configuration of Glue Clusters and SubClusters. For all deployment dependencies they are within a site or gLite release. We have no expectation that something must be deployed everywhere before we can proceed to the next step.

Key Item Development Status YAIM Status Certification Status Dependencies Status 13th Oct 2009
A Resolving WN to a GlueSubCluster Package glite-wn-info exists and can be configured to return this value. PATCH:2114 is submitted but is in configuration status, i.e YAIM needs to support a richer wn-list.conf file allowing per WN properties. Needs the updated YAIM before it can really begin. None, could be deployed tomorrow. This is now deployed
B Resolving a GlueSubCluster to an RTEPublisher Steve Burke's has a provider for this resolution glite-info-provider-service-1.0.3-0.noarch.rpm. I reviewed it recently and gave some comments that were accepted. BUG:45313. Trivial configuration, trivial addition to YAIM. New YAIM function attached on WNWorkingGroupInstallLog. Submitted after updates, easy to certify. Can be deployed now even on today's lcg-CE and would be worth while anyway before being moved with the eventual creation of the glite-CLUSTER node. Although can be deployed , effectivly not used till C is done is now deployed
C Creation of per SubCluster RTE Tags areas No development. Not submitted formally as a request to yaim but a trivial function that can be added to the current lcg-CE as well. Will do it. Essentially for each SubCluster create a /opt/glite/var/info/<SubCluster>/<VO> directories with sgm like permissions. Trivial to certify. As mentioned this can be done now on the lcg-CE where we know the name of the single SubCluster. However D should be deployed first so that any tags that end up in their are subsequently published as software tags in the SubCluster. Is now deployed and working
D Infoprovider update to publish software tags per SubCluster Requires an update to lcg-info-dynamic-software BUG:45310. This is the next thing I will do. No changes to YAIM Easy to certify if a little fictitious. Nothing will actually be putting tags in here at the time of deployment. No dependencies, can and should be deployed tomorrow. Is now deployed and working
E lcg-ManageTags and lcg-tags need to support a --cluster option Developers are expecting bugs from me with a request to support "--subcluster" option. We have to have publication details of the RTEPublisher finalised in B before they can have details of what to query. No YAIM work. Certification has to check not only --subcluster works but mainly that it is backwards compatible assuming that a lack of RTEpublisher for a SubCluster signifies that the SubCluster is the hostname. As mentioned the RTEPublisher publishing must be finalised as per B. Once done and developed can be deployed without delay. The deployment is irrelevant in the sense that sgm users will just carry on using --cehost and not --subcluster in the first instance of deployment lcg-tags on SL5 is now deployed and working.
F YAIM supporting free style Clusters and SubClusters There is now a good attempt at this within YAIM not in the main development branch. This is documented well YAIM Cluster and I reviewed the installation recently here WNWorkingGroupInstallLog. Its a very good first attempt and all of the complexity needed is present. The items highlighted in the install log are all small fixes. Essentially another round of development process is required. This is all YAIM. Another development round needed. This could in principal be deployed tomorrow. But this unlikely to happen, much of the above can and should be deployed first. C should be done first though.

Comments

A number of comments have already been addressed to the group that might be considered for inclusion within discussions and outputs.
  • CPU Numbers - The working group will most likely touch on describing heterogeneous batch farms with multiple GlueSubClusters. Consequently publishing reliable numbers numbers for CPUs in GlueSubCluster for use by e.g gstat becomes a sensible objective. Long lcg-rollout thread.
  • Passing wallclocktime for jobs. It is very likly that the group will consider the passing to the LRMS values for memory and or disk requirements. Particularly for sites supporting MPI jobs it is vital that jobs are all also submitted with a wall clock time to allow for backfill. While not an objective of the group since different WNs do not generally support different wall clocktimes it is related to argument passing and so can be considered.

Strategy

  • VOs to produce a list of constraints related to WN capacity they wish to describe their jobs by, e.g Memory, Diskspace, anything else?
  • Produce an outline of what can be achieved today. By today we are talking about Glue 1.3 Schema, WMS 3.1 and the LCG CE.
    • We can consider from this if a short term solution is worth implementing given the anticipated constraints of the lcg-CE. Any such solution would likely result in recommendations to the YAIM team for such a deployment.
    • Some sites notably RAL already run with a configuration such that matching different worker node resources within the same site is possible but far from optimal.
  • Run within the PPS a CREAM CE which is expected to at the very earliest available as pre-pre-release at the end of October 2007.
    • This will be configured as a CE, torque batch system and two batch workers with different hardware configurations.
    • Information publishing of this CREAM CE can be tweaked by hand to establish the publishing of this heterogeneous cluster.

Software Tags

The software tags need to addressed with respect to running multiple SubClusters on the same physical host.

Job Signals

The batch systems send various signals on job termination.

Test Rig

A test installation is being set up within the PPS.

See: WNWorkingGroupInstallLog

Presentations

Relevant Documents

Meetings.

-- SteveTraylen - 25 Sep 2007

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2009-10-13 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback