lcg-info-dynamic-scheduler

From /opt/lcg/share/doc/lcg-info-dynamic-scheduler/lcg-info-dynamic-scheduler.txt:

FOR THE IMPATIENT

Just installing the RPM should result in a working system.

You may want to look at the "vomap" section below if your unix group names are not the same as your VO names.

OK THE REAL DOCS

This is an implementation of per-VO estimated response times and free slots for the LCG 'generic information provider' framework.

HOW IT IS SUPPOSED TO WORK

There are two parts to the system.

One part (lcg-dynamic-info-scheduler program) contains the algorithm that computes the response times. This part doesn't know the details of the underlying batch system, so that the estimated times are as independent as possible from the various LRMS.

The second part is the LRMS specific part. This gathers information from the LRMS and writes it out in a LRMS-independent format. There are two of these critters; one for the LRMS state, and one for the scheduling policy.

A typical deployment will consist of the generic part, and two LRMS- and/or site-specific plugins for this second part. These plugins need to write their information in a specific format that is the same for all LRMSes/schedulers. For the LRMS state information, see the file lrmsinfo-generic.txt; for the free slot information, see the file vomaxjobs-generic.txt.

WORKING EXAMPLES

See the doc directory, where a working set of files is included. YAIM sets these up for your site by default, if you use YAIM.

DEPLOYMENT

The RPM building takes care of making sure that the scripts can find the support modules (e.g. lrms.py, EstTT.py). If you decide to move these modules somewhere else (relocate the RPM) then you will need to adjust the paths in the script.

Aside from that the dynamic-scheduler plugin needs a configuration file, which is described below. Finally, one needs to drop the ERT plugin (lcg-info-dynamic-scheduler-wrapper) into the GIP plugin directory, typically

/opt/lcg/var/gip/plugin

CONFIG FILE

The format of the config file is like a Windows INI file; for the exact format, see the documentation for the ConfigParser module in the Python standard library. Concisely it looks like this:

[Main]
static_ldif_file : value   # required
vomap :                    # required
  unixgroup1 : vo1
  unixgroup2 : vo2
  [ ... ]

[LRMS]
lrms_backend_cmd : value  # required

[Scheduler]
vo_max_jobs_cmd : value # optional
cycle_time  : value   # optional

Here is an explanation of each of the options.

[Main] static_ldif_file

Set to the full pathname of the static LDIF file used by the GIP system. It is used to find out which CEs are in the system, and which VOs are supported by which CE. You can find this in a normal deployment under

/opt/lcg/var/gip/ldif/static-file-CE.ldif

[Main] vomap

On many systems, control of access to the LRMS queues is done by unix group names, associating a single unix group to a set of pool accounts belonging to a particular VO. Sometimes the unix group names are not the same as the VO names, "VO name" here being the thing that appears in the GlueCEAccessControlBaseRule lines. The vomap is how information coming from the LRMS backends is modified so that the ERT and free-slot computations can be done using VO names found in the static LDIF file, rather than having to make site-specific customizations to either the LRMS backend plugins or to the ERT scripts.

The format looks like

vomap:
   atlsgm : atlas
   biome  : biomed
   LSF_lhcb : lhcb

The leading spaces are significant, they signify that each line is a continuation of the 'vomap' parameter. The left hand of each line is the unix group name, the right hand is the VO name found in the GlueCEAccessControlBaseRule field. If the unix group name and VO name are the same, there is no need to include it in the vomap construct. NOTE I have yet to test what happens if vomap is empty.

New in release 2.0 : the vomap can also map unix groups to VOMS FQANs, like

aliceprd : /VO=alice/GROUP=/alice/ROLE=production

which maps the unix group 'aliceprd' to the listed VOMS FQAN. If there is a VOView for this FQAN in the static_ldif_file, the relevant info will be printed for jobs belonging to this unix group.

[LRMS] lrms_backend_cmd

Set this to a string that will run the command producing the queue state information (first part of the lrms-specific part). This could either be a real command like

/path/to/cmd/lrmsinfo-pbs -c /other/path/to/config_file_if_needed

or it could simply write the contents of some file to standard output if you choose to generate the queue state information by a mechanism other than the GIP subsystem:

cat /place/where/you/find/text_file_containing_the_generic_lrms_state.info

[Scheduler] vo_max_jobs_cmd

This is the same kind of critter as [LRMS] lrms_backend_cmd, except it will provide for each unix group known to the LRMS, the maximum number of job slots this group can take. Mapping from group to VO is handled by the vomap parameter. If a VO has no cap on the number of jobs, it can be left out. This entire command is optional, in which case the free slots will only be limited by the number of free CPUs (or will be set to zero when jobs for a VO are in a waiting state, in which case there are obviously no free slots at the moment). Please take note of this request for enhancement: https://savannah.cern.ch/bugs/?23586

[Scheduler] cycle_time

There are various pieces of information that only make sense modulo the scheduler cycle time. For example if the ERT prediction is 'zero', it is reset to half the scheduler cycle since the job will almost never be run immediately, unless it happens to have been submitted one second before the scheduling cycle starts. Set this parameter to the length of your scheduling cycle. It is optional. NOTE I have yet to test what happens if you leave this out.

LIMITATIONS

The program assumes that the structure of the static LDIF file is always such that each block (corresponding to a given DN) is separated from the next one by a blank line. If this is not the case, the parser will fail.

The program assumes that dn's it needs to parse are constructed like

dn: GlueCEUniqueID=lxb...

where the line begins in the first column, and there is exactly one space between the colon and the 'G' character.

For the purposes of VOView scheduling information, the plugin only looks at the last AccessControlBaseRule in each VOView. The assumption here is that the only reason that two ACBRs would be present in a single VOView is if they mapped to the same underlying group in the batch system, in which case it does not matter which one is used. The last one happens to be easier for parsing purposes.

lrmsinfo-xyz

This is a program that for batch system xyz should output the queue state in a prescribed format.

From /opt/lcg/share/doc/lcg-info-dynamic-scheduler/lrmsinfo-generic.txt:

The output of the LRMS-specific part needs to contain a snapshot of the state of the LRMS. This state should be as faithful as possible; 'massaging' of the state should be left to higher-level programs such as the ERT system (which handles mapping of unix group names to VO names). Placing the massaging at a higher level and keeping the LRMS-specific part pristine has two main values:

1) the massaging is uniform across LRMS types, so one can at least hope that there won't be some LRMS bias in the estimates

2) if the LRMS tool reports the real information, it might well be useful for some purpose besides predicting ERTs.

The required format of this file is described below.

EXAMPLE FILE

nactive     240
nfree       191
now         1119073982
schedCycle  120
{'queue': 'atlas', 'start': 1119073982.0, 'state': 'running', 'group': 'atlsgm', 'user': 'atlsm003', 'maxwalltime': 345600.0, 'qtime': 1119073781.0, 'jobid': '612049.tbn20.nikhef.nl'}
{'queue': 'qlong', 'start': 1119060911.0, 'state': 'running', 'group': 'atlsgm', 'user': 'atlsm003', 'maxwalltime': 259200.0, 'qtime': 1119060774.0, 'jobid': '612043.tbn20.nikhef.nl'}
{'queue': 'atlas', 'start': 1119060910.0, 'state': 'running', 'group': 'atlsgm', 'user': 'atlsm003', 'maxwalltime': 345600.0, 'qtime': 1119060759.0, 'jobid': '612039.tbn20.nikhef.nl'}
{'queue': 'qlong', 'start': 1119136200.0, 'state': 'running', 'group': 'atlsgm', 'user': 'atlsm003', 'maxwalltime': 259200.0, 'qtime': 1119135972.0, 'jobid': '612176.tbn20.nikhef.nl'}
{'queue': 'dzero', 'start': 1119268211.0, 'state': 'running', 'group': 'dzero', 'user': 'dzero004', 'maxwalltime': 345600.0, 'qtime': 1119268047.0, 'jobid': '612241.tbn20.nikhef.nl'}

The last structure between "{}" characters is repeated one line for each job currently either executing or waiting in the queue. Here are some explanations for the semantics of the values:

nactive is the number of job slots that are actually capable of running jobs at the snapshot time (let's call the snapshot time t0 for brevity). by 'actually capable of running jobs' i mean that at t0, what is the maximum number of jobs that could be running on the system. so nactive counts all jobs slots, empty or occupied, but does not count the job slots on CPUs that are 'down' or 'offline'. So it's not the theoretical maximum number of job slots in your farm (unless ALL your WNs are working), it's the number that are 'up'.

nfree is the number of these active job slots that at t0 do not have an assigned job. they can potentially accept a new job at t0 (or at least at the start of the next scheduling cycle).

Note these numbers don't have anything to do with VOs (unless each node happens to be exclusively assigned to a single VO). They are aggregates of all job slots that are being controlled by a single LRMS.

'now' is a timestamp in seconds of when the queue was inspected. The only constraint here is that 'now' has to be in the same units, and have the same zero reference, as do all the times in the per-job lines (like 'qtime' or 'start'). In the PBS version provided, 'now' is in local time seconds, meaning seconds since midnight Jan 1st 1970 local time. Again as long as the units are seconds and all times have the same reference point, the actual reference point does not matter.

'schedCycle' is the cycle time of your batch scheduler; how often does it start a new scheduling pass? As of this writing at NIKHEF it is 120 seconds, meaning a new scheduling attempt is started every 120 seconds.

Each line thereafter reports the info for a single job.

{'queue': 'qlong', 'start': 1119060911.0, 'state': 'running', 'cpucount': 1, 'group': 'atlsgm', 'user': 'atlsm003', 'maxwalltime': 259200.0, 'qtime': 1119060774.0, 'jobid': '612043.tbn20.nikhef.nl'}

This has a structure { 'key1' : 'attr1', 'key2' : 'attr2' } and is written in this particular format because it is the string representation of a python 'dictionary' (same as perl 'hash'), making the input parsing for the other part very easy. The order of the various keys is irrelevant, you could write {'key2' : 'attr2', 'key1' : 'attr1' } if you wanted.

Not all the fields are required but they should be consistent. All jobs should have a 'qtime' since they must have entered the queue at some point. If a job is in state 'running' it better have a 'start' time; if it is 'queued' then 'start' should be absent.

Here is a bit of explanation of the various fields:

In the example above, the local PBS jobid is 612043.tbn20.nikhef.nl ; this just has to be a unique string (no two jobs should have the same string).

qtime is the timestamp when it entered the queue, with the same ref point as 'now'. now - qtime will tell you how long it has been since the job entered the queue (submitted). maxwalltime is the maximum amount of real time the execution of a job in this queue may take in seconds). 'user' and 'group' are the pool account ids under which the job runs.

'cpucount' is how many CPUs are assigned to this job.

'state' can be either 'queued', 'running', 'pending', or 'done'. 'pending' means it is in the queue but has been placed on 'hold'.

'start' is the time stamp for when the job actually started to execute. Again needs to be measured in the same coords as 'now'. Finally 'queue' gives the name of the queue in which this job is running (like 'qlong').

-- DVanDok - 11 Dec 2008

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2009-01-28 - DVanDok
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback