Intoduction
WLCG Tier-1 and Tier-2 VO require to run multicore jobs (SMP) in all the sites.
Maui handles mcore scheduling using Backfilling. There are several backfilling options in Maui that should be correctly tuned depending in your site requirements. However, there are two problems using this system, Maui only prioritizes the jobs, not the "holes" created in WN to allow mcore jobs, and, more important, Backfilling has a strong dependency in the job wallclock and it is difficult to guess the walltime of the jobs from our VOs.
Thus, Jeff Templon from NIKHEF wrote a Python script to dynamically change the number of WNs dedicated to mcore or single core jobs, playing with the properties of the nodes and queues. So, mcore queues only point to WN with mcore property.
The mcfloat script
Installation
The installation is done by puppet. The script consists in the mcfloat script plus 3 extra modules. Our installation is done in /usr/local/bin/mcfloat directory. There are also a mcfloat.py link to mcfloat executable in /usr/local/bin.
# ls /usr/local/bin/mcfloat
mcfloat mcfloat.mod.py mcfloat.orig torqueAttMappers.py torqueAttMappers.pyc torqueJobs.py torqueJobs.pyc torque_utils.py torque_utils.pyc
The scripts runs as a cron each 10 minutes and the debug is stored in /var/log/mcfloat/mcfloat.log file. These logs are sent to accounting01 machine.
*/10 * * * * /usr/local/bin/mcfloat/mcfloat -L debug -l /var/log/mcfloat/mcfloat.log
Modifications
There are different modifications introduced to the original Jeff Templon's script to adapt it to our site.
MCQUEUE
We have three queues, one per VO.
MCQUEUE = 'mcore_sl6'
MCQUEUE2 = 'mcore_sl6_atlas'
MCQUEUE3 = 'mcore_sl6_at2'
Thus, we must add an extra line to check if any of the 3 queues have mcore jobs in queue (only check if there are mcore jobs in queue, not the number).
mcjoblist = list()
waitingJobs = False
for j in jlist:
if (j['queue'] == MCQUEUE) or (j['queue'] == MCQUEUE2) or (j['queue'] == MCQUEUE3):
newj = mapatts(j,usedkeys)
mcjoblist.append(newj)
if newj['job_state'] == 'queued':
waitingJobs = True
MAXDRAIN and MAXFREE
They control the maximum number of nodes in draining state (that means, with mcore property and with slots free) and the maximum number of free slots.
MAXDRAIN = 16 # max num of nodes allowed to drain
MAXFREE = 105 # max num of free slots to tolerate
Right now (07/08/2014), 16 WN (4.9% of the whole farm) and 105 slots (2.4% of the whole farm).
CANDIDATE_NODES
These are the candidate nodes to be moved to mcore or single core queues. We modify the candidate nodes from time to time.
CANDIDATE_NODES = [ 'td%d.pic.es' % (n) for n in range(600,718)]
CANDIDATE_NODES2 = [ 'td%d.pic.es' % (n) for n in range(503,505)]
CANDIDATE_NODES3 = [ 'td%d.pic.es' % (n) for n in range(720,736)]
CANDIDATE_NODES4 = [ 'td0%d.pic.es' % (n) for n in range(70,86)]
CANDIDATE_NODES.extend(CANDIDATE_NODES2)
CANDIDATE_NODES.extend(CANDIDATE_NODES3)
CANDIDATE_NODES.extend(CANDIDATE_NODES4)
Right now (07/08/2014), 152 WN (46.8% of the whole farm) and 2608 slots (59.3% of the whole farm).
DEBUG INFORMATION
We added some extra debug information into the logs.
logging.debug("Starting script...")
logging.debug("Candidate nodes: %d. Total slots: %s" % (len(CANDIDATE_NODES) , total_mcore_slots))
logging.debug("Draining nodes: %d" % (draining_nodes))
logging.debug("Draining slots: %d" % (draining_slots))
logging.debug("Dedicated WN: %d" % (len(mcdedicated)))
logging.debug("Nodes underpopulated: %d" % (len(nodes_consistently_underpopulated)))
logging.debug("Nodes with too few jobs: %d" % (len(nodes_with_too_few_jobs)))
The total_mcore_slots variable was added by us.
total_mcore_slots = 0
for node in wnodes:
if node.name in CANDIDATE_NODES and node.state.count('offline') == 0:
mcnodes.append(node)
total_mcore_slots += node.numCpu
NODES UNDERPOPULATED
This is the most important modification done in the original script.
The script uses two concepts, the ''nodes_with_too_few_jobs'' and ''nodes_underpopulated''. The first ones, nodes_with_too_fee_jobs are the WNs that are dedicated (mcore property) and have more than 7 free slots. When the script runs, store these WNs in a list and, 10 minuts after, in a second run, the WNs that are still with too few jobs are considered underpopulated. The original scripts works like this with the ''nodes_underpopulated'':
if undcount > 0:
logging.info("nodes consistently underpopulated:")
for n in nodes_consistently_underpopulated:
logging.info(" " + n.name)
if undcount > 1:
remcount = undcount / 2 # number to remove
logging.info("going to remove " + repr(remcount) + " from mc pool")
else:
remcount = 1
Thus, if there are more than 1 nodes underpopulated, half of them is removed from mcore pool, if there is one, the WN is removed from mcore pool. Basically, if in 10 minutes (10 Maui schedules in our case), the dedicated WN is not filled up with jobs the script move it to single core.
We modified this in order to only restore the WNs to single core queues when there are more than 10 WNs underpopulated. Then, the nodes_underpopulated persist more in time but when there are more than 10, that means too few mcore jobs in queue, the script quickly moves them to single core (not half of the underpopulated as it is in the original script, all the underpopulated).
if undcount > 0:
logging.info("nodes consistently underpopulated:")
for n in nodes_consistently_underpopulated:
logging.info(" " + n.name)
if undcount > 10:
remcount = undcount # number to remove
logging.info("going to remove " + repr(remcount) + " from mc pool")
# find out how many running mc jobs each node has
waiting, node_tuples = getmcjobinfo(nodes_consistently_underpopulated)
for node_t in node_tuples[:remcount]:
nname, nnode, running_mc, unused_slots = node_t
logging.info("dumped %d empty slots, %d mc jobs on %s" % \
(unused_slots, running_mc, nname) )
if not opts.noopt : remove_from_mc_pool(nnode)
nodes_with_too_few_jobs.remove(nnode)
nodes_consistently_underpopulated.remove(nnode)
removed_a_node = True
This modification is arguable. We can go back to the original one in the future, our idea is to compare the two behaviours to chose the best one.
Nagios tests
There is a Nagios check called check_mcfloat.sh. There are no CRITICAL messages, only WARNINGs.
- Check if the script is running each 10 minutes. If not, it is due to changes in the code probably.
- Check the number of dedicated WN. If it less than 10, that means a really low number of dedicated WN, just to check if this is in agreement with the number of mcore jobs queued.
- Check the number of underpopulated nodes. If it is more than 10, that means that the farm is drained of mcores jobs. This also works to check our script modification.
#!/bin/bash
LOG=/var/log/mcfloat/mcfloat.log
TIME=$(cat $LOG | grep "Starting script" | tail -1 | awk '{print $2}' | cut -d "," -f1)
RUN=$(date --date "$TIME" +%s)
DATE=$(date +%s)
let LAST_RUN=$DATE-$RUN
DEDICATED_NODES=$(grep "Dedicated WN" $LOG | tail -1 | awk '{print $6}')
UNDERPOPULATED_NODES=$(grep "Nodes underpopulated" $LOG | tail -1 | awk '{print $6}')
if [ $LAST_RUN -gt 600 ]; then
echo "[WARNING]: mcfloat script does not run for the last 10 minutes"
exit 1;
fi
if [ $DEDICATED_NODES -le 10 ]; then
echo "[WARNING]: there are few dedicated WN: $DEDICATED_NODES, ensure that mcore queues are empty"
exit 1;
elif [ $UNDERPOPULATED_NODES -ge 10 ]; then
echo "[WARNING]: there are too many nodes underpopulated: $UNDERPOPULATED_NODES"
exit 1;
else
echo "[OK]: mcfloat provides $DEDICATED_NODES WN to mcore queues"
exit 0;
fi
--
CarlosAcostaSilva - 04 Sep 2014