Ideas regarding co-scheduling

Introduction

This page was updated following a meeting at CERN on the 16/05/2007, where the entire CERN ETICS team was present.

The objective of this note is capture a possible component-level use-case and implementation of the co-scheduling feature of ETICS. The overall goal is for ETICS users to be able to execute tests on several resources (i.e. machines) at the same time, following a user defined deployment. In other words, the idea is to be able to dynamically deploy a test-bed on generic resources, prior to the execution of the test-suite, gather the test results and release the resources.

This note doesn't address the user interface, it's focused on the server-side activities, between the B&T WS and Metronome, as well as what happens as part of the jobs execution.

QUESTIONS

Here are a few questions that we asked ourselves, and some agreed answers:

  1. Do we want to express dependencies between nodes and/or services across the deployment? For the moment, no. This is the responsibility of the user, and we provide the set and get (option for blocking read) methods on a deployment info system for them code in there TestCommand (and/or inside scripts called by it).
  2. Are we supporting cross sites testing?
  3. How to we consolidate the reports from several nodes?

Representation of the deployment model

The current data model doesn't include concepts like deployment, node or service and their possible inter-dependencies. These would make sense to in order to full express a deployment.

Having said that, considering time constraints, we have explored the possibility to provide a simple deployment model using existing elements in the current data model.

Here is what we need to express: For a multi-node test, the user defines a deployment. This deployment contains one or several node. Each node, in turn, contains one or several configuration. For example:

deployment
   - node1
      - configuration1
      - configuration2
   - node3
      - configuration1
      - configuration3
   - node4
      - configuration4
      - configuration5
   - node5
      - configuration3

We are proposing to implement this using existing elements:

  • deployment = subsystem configuration
  • node = component configuration (children of above subsystem configuration)
  • configuration = configuration (set as dependencies)

We also need to upgrade the TestCommand type to include possibly things like:

  • clean *
  • init
  • deploy
  • test
  • publish
  • finalise *

where * means that these targets require to be explicitly called.

with the pre/post for all of them.

Taking this subsystem confguration structure as the input to a test, let's look at a possible use-case.

USECASE: Deploy and execute a multi-node test-suite

CHARACTERISTIC INFORMATION

Goal: Execute a test-suite following a multi-node deployment model

Scope: Server and execution engine side

Preconditions: A deployment model is provided. All the required platforms are available. All packages have being built for the required platforms.

Success End Condition: Test report is generated and available to the submit node, all used resources are released.

Failed End Condition: An error report is provided back to the user, including the likely cause of the error and suggestions if applicable.

Primary Actor: build and test WS

Trigger: This use-case is triggered by submit method on the B&T WS


MAIN SUCCESS SCENARIO

  1. B&T WS crafts a multi-node submit file (based on the node info in the deployment model) - see section Sample nmi submit file for an example
  2. B&T WS submits to Metronome
  3. Metronome match-makes and allocates the required resources based on the submit file info
  4. All nodes register their hostname to the info system:
    • host_i = ' < ip address or hostname > ' (where i is the index of the allocated node by Metronome)
  5. The following part of the execution is focused on individual node
  6. Execute all targets (e.g. init, deploy, test, publish)
  7. Wait for a the finalise value to be set (i.e. blocking read of the value set by other nodes)
  8. Execute the finalise target
  9. Exit the job
  10. End of the part of the execution is focused on individual node
  11. Metronome releases the resource
  12. Metronome gathers the reports.tar.gz back to the submit node
  13. post_all script consolidates the reports from all the nodes
  14. post_all registers the artefacts with the repository

EXTENSIONS

  1. a. At any point in the execution of the TestCommand, a script or a target can set / get values to provide user defined synchronisation between the different nodes. This is entirely the responsibility of the user.
  2. b One node must set the finalise flag in the info system in order to trigger for each node the finalise target and trigger the clean-up of the resources by Metronome. One suggestion is for the node0 to do it, automatically, when the TestCommand returns.
  3. c Between the execution of each target, check if the abort or finalise flag has not being set by another node. If it has, abort or finalise, by executing the finalise target

NOTES

  • The "info system" could be implemented using classAdds or a simple WS with setter and getter methods.
  • If during job execution a step fails, the default behaviour is to abort the job execution. This can be overridden if the --continueonerror command-line flag is set

CONSTRAINTS REQUIREMENTS

  • The info system must "box" the information for each test run (in other words, a job cannot set or get a value of a job not part of the current test)
  • Metronome should have a timeout to abort a test execution in order to avoid deadlock or race conditions, or never returning tasks to block resources.
  • The implementation of the set / get to/from the info system (e.g. used for synchronisation), checks for existence of abort flag (by performing a none blocking read on that flag) and trigger abortion (i.e. call finalise and exit the job).

-- MebSter - 16 May 2007

SAMPLE NMI SUBMIT FILE

Here's the multi-nodes submit file GuillermoDiezAndinoSancho wrote last summer, with MarianZUREK. This submit file successfully executed in parallel on 2 nodes.

We might want to add to the submit file the total number of nodes required, which might help the different jobs to find each other.

project = System scheduling
component = WS-UTILS
description = testing WS and UTILS
version = 0.1
inputs = /tmp/scripts.scp,/tmp/scripts.scp
platforms = (x86_slc_3,x86_slc_3)
                                                                                                                                           
register_artefact = false
                                                                                                                                           
#Job 0
project_0 = org.etics
component_0 = org.etics.build-system.webservice
version_0 = org.etics.build-system.webservice.HEAD
checkout_options_0 = --project-config=org.etics.HEAD
submit_options_0 =
                                                                                                                                           
service_name_0 = WS
depends_0 = UTILS
remote_declare_args_0 = 0
remote_declare_0 = remote_declare
remote_task_0 = remote_task
remote_post_0 = remote_post
post_all_0 = post_all
post_all_args_0 = x86_slc_3
                                                                                                                                           
#Job 1
project_1 = org.etics
component_1 = org.etics.build-system.java-utils
version_1 = org.etics.build-system.java-utils.HEAD
checkout_options_1 = --project-config=org.etics.HEAD
submit_options_1 =
                                                                                                                                           
                                                                                                                                           
service_name_1 = UTILS
depends_1 =
remote_declare_args_1 = 1
remote_declare_1 = remote_declare
remote_task_1 = remote_task
remote_post_1 = remote_post
post_all_1 = post_all
post_all_args_1 = x86_slc_3
                                                                                                                                           
platform_pre_0 = noop.sh
platform_post_0 = noop.sh
                                                                                                                                           
platform_pre_1 = noop.sh
platform_post_1 = noop.sh
                                                                                                                                           
run_type = build
notify = Guillermo.Sancho@cern.ch
                                                                                                                                           
# EVENTUALLY ENABLE but disabling for early testing so as not to have to
# login all the time and remove root-expiry file.
# run_as_root = true
                                                                                                                                           
# try to match based on sudo_enabled
append_requirements = sudo_enabled==TRUE
                                                                                                                                           
reimage_delay = 60m

remote_declare FILE

This is the remote_declare task script, which works in conjunction with the submit file shown above.

This task generates the task.sh file.

#!/usr/bin/env python
import os
import sys
                                                                                                                                           
tasklist = open('task.sh','w')
taskIndex = sys.argv[1]
tasklist.write('taskIndex=%s\n' %sys.argv[1])
tasklist.write('wget "http://eticssoft.web.cern.ch/eticssoft/repository/org.etics/client/0.3.1/noarch/etics-workspace-setup.py" -O etics-workspace-setup\n')
tasklist.write('python etics-workspace-setup\n')
tasklist.write('python etics/src/etics-get-project $NMI_project_%s\n' %taskIndex)
tasklist.write('python etics/src/etics-checkout $NMI_checkout_options_%s -c $NMI_version_%s $NMI_component_%s\n' %(taskIndex,taskIndex,taskIndex))
                                                                                                                                           
# Source nmi staff
tasklist.write('source util.sh\n')
# Configure Chirp.
tasklist.write('setup_chirp\n')
                                                                                                                                           
#Wait for the dependencies to be resolved
if len(os.getenv('NMI_depends_%s' %taskIndex)) > 0:
  dependencies=os.environ['NMI_depends_%s' %taskIndex]
  dependencyList = dependencies.split(';')
  for i in dependencyList:
    tasklist.write('discover_services nodeid %s 35\n' %i)
    tasklist.write('echo \"************************************************\"\n')
    tasklist.write('echo \"Printing requirement _NMI_HOST_%s\"\n' %i)
    tasklist.write('echo $_NMI_HOST_%s\n' %i)
    tasklist.write('echo \"************************ENV***********************\"\n')
    tasklist.write('env|sort\n')
    tasklist.write('echo \"************************************************\"\n')
                                                                                                                                           
if (not os.environ.has_key('NMI_run_type')) or os.getenv('NMI_run_type')=='build':
   cmd = 'python etics/src/etics-build $NMI_submit_options_%s $NMI_component_%s\n' %(taskIndex,taskIndex)
else:
   cmd = 'python etics/src/etics-test $NMI_submit_options_%s $NMI_component_%s\n' %(taskIndex,taskIndex)
tasklist.write(cmd)
                                                                                                                                           
tasklist.write('SRV_PREFIX=$NMI_service_name_%s\n' %taskIndex)
tasklist.write('RESULT=OK_result\n')
tasklist.write('publish_service ${SRV_PREFIX} ${RESULT}\n')
tasklist.close()

-- GuillermoDiezAndinoSancho - 15 May 2007

CoSchedulingIdeasOriginal

-- MebSter - 16 May 2007

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2007-05-16 - MebSter
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ETICS All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback