Yoda is the Event Service implemention on HPC. Overall architecture is shown below.
Yoda is composed of several parts:
- pilotRunJobHpcEvent - the frontend part to download jobs and get event ranges from Panda, stage out outputs to objectstore and update event status to Panda.
- YodaDroid - HPC MPI job to run Events
- EventServerJobManager - main part in Droid to manage AthenaMP, TokenExtractor and Yampl messaging. It's the main part to inject events to AthenaMP to process and retrieve outputs.
Yoda code
The live updates are in gitlab.
The stable updates are in github. Normally not much delay.
git remote -v
origin ssh://git@gitlab.cern.ch:7999/wguan/PandaPilot.git (fetch)
origin ssh://git@gitlab.cern.ch:7999/wguan/PandaPilot.git (push)
upstream ssh://git@github.com/PanDAWMS/pilot.git (fetch)
upstream ssh://git@github.com/PanDAWMS/pilot.git (push)
How pilot runs
How Yoda runs(we are talking here with non backfill mode).
- catchall defines nodes, walltime, corespernode, time_per_event
- pilot calculates total cpu hours.
- pilot estimates the total number of events by using totalcpuhours/time_per_event Normally pilot cannot finish all events.
- pilot gets multiple jobs until the total number of events is reached or no more jobs. If no more jobs, pilot will shrink needed ranks. So if pilot defines to run with 700 nodes but it only gets one job, it will shrink to one or two nodes.
- pilot prepares all jobs(stagein, setup...) and event ranges and send them to Yoda.
- When yoda is running, pilot will update all panda jobs from 'starting' to 'running'. At the same time, it will recalculate the corecount: corecount = totalcpucores/total_panda_jobs. Because Yoda has another scheduler way to distribute events inside. It's not easy for pilot to get the exact corecount for one job and running time. Maybe pilot sets the panda job to running, but it never run in Yoda.
Yoda Accounting.
- Accounting info in Droid in every rank: When one panda job is running in one Droid, Droid will account these information below for this panda job.
setupTime: from init to first 'ready for event' message
runningTime: after setup to finish athenamp
totalTime: from init to finish the droid
stageoutTime: totalTime - setupTime - runningTime
cores: CPU cores that AthenaMP used. At NERSC, it's 24 except rank 0. Because rank 0 is running both Yoda and Droid, rank 0 reserve one cpu core for Yoda. So Droid in rank 0 is using one core less than other ranks.
cpuConsumptionTime: getCPUTime from athenaMP job report, if fails, using os.tims(). It's cpu time which is something like 'time usercommand' in shell. It's a time which adds user time and system time together, which is much smaller than walltime * cores.
queuedEvents: total injected events to this droid for current panda job.
processedEvents: total finished processing events.
avgTimePerEvent: totalTime * cores / processedEvents, So it includes setup, running, stageout time.
startTime: for a panda job, more than one rank can run it. When the panda job is pulled by a rank for the first time, yoda will mark it as the startTime of the job.
endTime: for a panda job, more than one rank can run it.When the panda job is reported as finished or failed by the last rank. yoda will mark it as the endTime.
If it's killed by cluster system(normally at first the kill SIGTERM signal is send, then in 30 seconds, if it's still running, a force kill SIGKILL signal will be send),
Yoda will catch the SIGTERM signal to report all unfinished jobs' endtime to the current kill signal time.
In the kill signal, Yoda will force dump accounting info to share file system.
* Accounting in pilot:
Pilot scans the share file system to get Yoda accounting info. Then it will report it to panda.
jobMetrics: all yoda accounting info will be here.
startTime: the startTime from yoda accounting. only when pilot found a startTime for one panda job, the panda job will be reported as running.
Even Yoda is 'running', but if there is no startTime for one panda job in yoda's report, the panda job will still be reported as 'starting'
endTime: the endTime from yoda accounging. the panda job will be reported as 'transferring'.(Other steps in pilot will update it to other state)
timeSetup: avgYodaSetupTime
timeExe: avgYodaRunningTime
timeStageOut: avgYodaStageoutTime
nEvents: totalQueuedEvents
nEventsW: totalProcessedEvents
cpuConsumption: cpuConsumptionTime from yoda's report
How Yoda runs.
Example script to run pilot on Edison.
You need to have an ATLAS proxy.
#!/bin/bash
cd /scratch2/scratchdirs/wguan/Edison/hpcf/edison01/wguan-pilot-dev-HPC_merge/1458756089.95
export OSG_GRID=/global/project/projectdirs/atlas/pilot/grid_env
export VO_ATLAS_SW_DIR=/project/projectdirs/atlas
source $OSG_GRID/setup.sh
export COPYTOOL=gfal-copy
export COPYTOOLIN=gfal-copy
export RUCIO_ACCOUNT=wguan
rm -f pilot.tar.gz
wget http://wguan-wisc.web.cern.ch/wguan-wisc/wguan-pilot-dev-HPC_merge.tar.gz -O pilot.tar.gz
tar xzf pilot.tar.gz
source /etc/profile
source /global/homes/w/wguan/.bashrc
source /global/homes/w/wguan/.bash_profile
python pilot.py -s NERSC_Edison -h NERSC_Edison -w https://pandaserver.cern.ch -p 25443 -d /scratch2/scratchdirs/wguan/Edison/hpcf/edison01/wguan-pilot-dev-HPC_merge/1458756089.95 -N 50 -Q premium
You need to replace this part to your own(The default .bashrc and .bash_profile is ok). Interactive run doesn't need this part. But for cron job, you need to run this part to load batch modules. Otherwise, you will not be able to find slurm commands.
source /etc/profile
source /global/homes/w/wguan/.bashrc
source /global/homes/w/wguan/.bash_profile
It's part of pilot to start HPC ES. After pilot get job from Panda. It will setup the environment, prepare job files and commands. Then it will use HPCManager to getHPCResource(free cores for backfill mode, default resource defined in schedconfig for normal mode), submit HPC jobs and poll the jobs. HPCManager is the interface between pilot and HPC. Now it's implemented based on PBS/Torque cluster. It can be extended.
Yoda-Droid
Yoda-Droid is the HPC MPI job.
- Yoda is the part running on MPI rank 0. It manages the job and events table centrally. It uses MPI interface to distributed job and events to Droid. Outputs received from Droid through MPI interface will be updated in events table and dumped to pilot periodly.
- Droid is the part running on MPI rank more than 0. It gets job from Yoda, then starts EventServerJobManager to start the job. When EventServerJobManager is ready(AthenaMP is setup), Droid will get event ranges from Yoda and inject event ranges to ESJobManager. Then Droid will poll ESJobManager to wait the outputs and send outputs to Yoda.
main part in Droid to manage
AthenaMP,
TokenExtractor and Yampl messaging thread.
AthenaMP and
TokenExtractor are components in Athena. Users can use yampl messaging channel to contact
AthenaMP. So
ESJobManager is the part to handle messages in Yampl messaging channel.
How to Run Yoda ES pilot
If you are interested in running Yoda ES jobs. You can follow theses steps:
--
WenGuan - 2015-02-24