Intoduction

WLCG Tier-1 and Tier-2 VO require to run multicore jobs (SMP) in all the sites.

Maui handles mcore scheduling using Backfilling. There are several backfilling options in Maui that should be correctly tuned depending in your site requirements. However, there are two problems using this system, Maui only prioritizes the jobs, not the "holes" created in WN to allow mcore jobs, and, more important, Backfilling has a strong dependency in the job wallclock and it is difficult to guess the walltime of the jobs from our VOs.

Thus, Jeff Templon from NIKHEF wrote a Python script to dynamically change the number of WNs dedicated to mcore or single core jobs, playing with the properties of the nodes and queues. So, mcore queues only point to WN with mcore property.

The mcfloat script

Installation

The installation is done by puppet. The script consists in the mcfloat script plus 3 extra modules. Our installation is done in /usr/local/bin/mcfloat directory. There are also a mcfloat.py link to mcfloat executable in /usr/local/bin.

   # ls /usr/local/bin/mcfloat
   mcfloat  mcfloat.mod.py  mcfloat.orig  torqueAttMappers.py  torqueAttMappers.pyc  torqueJobs.py  torqueJobs.pyc  torque_utils.py  torque_utils.pyc

The scripts runs as a cron each 10 minutes and the debug is stored in /var/log/mcfloat/mcfloat.log file. These logs are sent to accounting01 machine.

 */10 * * * * /usr/local/bin/mcfloat/mcfloat -L debug -l /var/log/mcfloat/mcfloat.log

Modifications

There are different modifications introduced to the original Jeff Templon's script to adapt it to our site.

MCQUEUE

We have three queues, one per VO.

 MCQUEUE  = 'mcore_sl6'
 MCQUEUE2 = 'mcore_sl6_atlas'
 MCQUEUE3 = 'mcore_sl6_at2'

Thus, we must add an extra line to check if any of the 3 queues have mcore jobs in queue (only check if there are mcore jobs in queue, not the number).

    mcjoblist = list()
    waitingJobs = False
    for j in jlist:
        if (j['queue'] == MCQUEUE) or (j['queue'] == MCQUEUE2) or (j['queue'] == MCQUEUE3):
            newj = mapatts(j,usedkeys)
            mcjoblist.append(newj)
            if newj['job_state'] == 'queued':
                waitingJobs = True

MAXDRAIN and MAXFREE

They control the maximum number of nodes in draining state (that means, with mcore property and with slots free) and the maximum number of free slots.

 MAXDRAIN = 16            # max num of nodes allowed to drain
 MAXFREE  = 105            # max num of free slots to tolerate

Right now (07/08/2014), 16 WN (4.9% of the whole farm) and 105 slots (2.4% of the whole farm).

CANDIDATE_NODES

These are the candidate nodes to be moved to mcore or single core queues. We modify the candidate nodes from time to time.

 CANDIDATE_NODES = [ 'td%d.pic.es' % (n) for n in range(600,718)]
 CANDIDATE_NODES2 = [ 'td%d.pic.es' % (n) for n in range(503,505)]
 CANDIDATE_NODES3 = [ 'td%d.pic.es' % (n) for n in range(720,736)]
 CANDIDATE_NODES4 = [ 'td0%d.pic.es' % (n) for n in range(70,86)]
 CANDIDATE_NODES.extend(CANDIDATE_NODES2)
 CANDIDATE_NODES.extend(CANDIDATE_NODES3)
 CANDIDATE_NODES.extend(CANDIDATE_NODES4)

Right now (07/08/2014), 152 WN (46.8% of the whole farm) and 2608 slots (59.3% of the whole farm).

DEBUG INFORMATION

We added some extra debug information into the logs.

 
 logging.debug("Starting script...") 

 logging.debug("Candidate nodes: %d. Total slots: %s" % (len(CANDIDATE_NODES) , total_mcore_slots))
 logging.debug("Draining nodes: %d" % (draining_nodes))
 logging.debug("Draining slots: %d" % (draining_slots))
 logging.debug("Dedicated WN: %d" % (len(mcdedicated)))
 logging.debug("Nodes underpopulated: %d" % (len(nodes_consistently_underpopulated)))
 logging.debug("Nodes with too few jobs: %d" % (len(nodes_with_too_few_jobs)))

The total_mcore_slots variable was added by us.

 total_mcore_slots = 0

 for node in wnodes:
     if node.name in CANDIDATE_NODES and node.state.count('offline') == 0:
         mcnodes.append(node)
         total_mcore_slots += node.numCpu

NODES UNDERPOPULATED

This is the most important modification done in the original script.

The script uses two concepts, the ''nodes_with_too_few_jobs'' and ''nodes_underpopulated''. The first ones, nodes_with_too_fee_jobs are the WNs that are dedicated (mcore property) and have more than 7 free slots. When the script runs, store these WNs in a list and, 10 minuts after, in a second run, the WNs that are still with too few jobs are considered underpopulated. The original scripts works like this with the ''nodes_underpopulated'':

 if undcount > 0:
     logging.info("nodes consistently underpopulated:")
     for n in nodes_consistently_underpopulated:
         logging.info("     " + n.name)
     if undcount > 1:
         remcount = undcount / 2  # number to remove
         logging.info("going to remove " + repr(remcount) + " from mc pool")
     else:
         remcount = 1

Thus, if there are more than 1 nodes underpopulated, half of them is removed from mcore pool, if there is one, the WN is removed from mcore pool. Basically, if in 10 minutes (10 Maui schedules in our case), the dedicated WN is not filled up with jobs the script move it to single core.

We modified this in order to only restore the WNs to single core queues when there are more than 10 WNs underpopulated. Then, the nodes_underpopulated persist more in time but when there are more than 10, that means too few mcore jobs in queue, the script quickly moves them to single core (not half of the underpopulated as it is in the original script, all the underpopulated).

 if undcount > 0:
     logging.info("nodes consistently underpopulated:")
     for n in nodes_consistently_underpopulated:
         logging.info("     " + n.name)
     if undcount > 10:
         remcount = undcount  # number to remove
         logging.info("going to remove " + repr(remcount) + " from mc pool")
 
         # find out how many running mc jobs each node has
 
         waiting, node_tuples = getmcjobinfo(nodes_consistently_underpopulated)
 
         for node_t in node_tuples[:remcount]:
                 nname, nnode, running_mc, unused_slots = node_t
                 logging.info("dumped %d empty slots, %d mc jobs on %s" % \
                         (unused_slots, running_mc, nname) )
                 if not opts.noopt : remove_from_mc_pool(nnode)
                 nodes_with_too_few_jobs.remove(nnode)
                 nodes_consistently_underpopulated.remove(nnode)
                 removed_a_node = True

This modification is arguable. We can go back to the original one in the future, our idea is to compare the two behaviours to chose the best one.

Nagios tests

There is a Nagios check called check_mcfloat.sh. There are no CRITICAL messages, only WARNINGs.

  • Check if the script is running each 10 minutes. If not, it is due to changes in the code probably.

  • Check the number of dedicated WN. If it less than 10, that means a really low number of dedicated WN, just to check if this is in agreement with the number of mcore jobs queued.

  • Check the number of underpopulated nodes. If it is more than 10, that means that the farm is drained of mcores jobs. This also works to check our script modification.

#!/bin/bash

LOG=/var/log/mcfloat/mcfloat.log
TIME=$(cat $LOG | grep "Starting script" | tail -1 | awk '{print $2}' | cut -d "," -f1)
RUN=$(date --date "$TIME" +%s)
DATE=$(date +%s)
let LAST_RUN=$DATE-$RUN
DEDICATED_NODES=$(grep "Dedicated WN" $LOG | tail -1 | awk '{print $6}')
UNDERPOPULATED_NODES=$(grep "Nodes underpopulated" $LOG | tail -1 | awk '{print $6}')

if [ $LAST_RUN -gt 600 ]; then
        echo "[WARNING]: mcfloat script does not run for the last 10 minutes"
        exit 1;
fi

if [ $DEDICATED_NODES -le 10 ]; then
        echo "[WARNING]: there are few dedicated WN: $DEDICATED_NODES, ensure that mcore queues are empty"
        exit 1;
elif [ $UNDERPOPULATED_NODES -ge 10 ]; then
        echo "[WARNING]: there are too many nodes underpopulated: $UNDERPOPULATED_NODES"
        exit 1;
else
        echo "[OK]: mcfloat provides $DEDICATED_NODES WN to mcore queues"
        exit 0;
fi
-- CarlosAcostaSilva - 04 Sep 2014
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2014-09-09 - CarlosAcostaSilva
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback