HpcYoda

Yoda is the Event Service implemention on HPC. Overall architecture is shown below.

yoda-structure.png

Yoda is composed of several parts:

  • pilotRunJobHpcEvent - the frontend part to download jobs and get event ranges from Panda, stage out outputs to objectstore and update event status to Panda.
  • YodaDroid - HPC MPI job to run Events
  • EventServerJobManager - main part in Droid to manage AthenaMP, TokenExtractor and Yampl messaging. It's the main part to inject events to AthenaMP to process and retrieve outputs.

Yoda code

The live updates are in gitlab. The stable updates are in github. Normally not much delay.
      git remote -v
      origin  ssh://git@gitlab.cern.ch:7999/wguan/PandaPilot.git (fetch)
      origin  ssh://git@gitlab.cern.ch:7999/wguan/PandaPilot.git (push)
      upstream        ssh://git@github.com/PanDAWMS/pilot.git (fetch)
      upstream        ssh://git@github.com/PanDAWMS/pilot.git (push)
    

How pilot runs

How Yoda runs(we are talking here with non backfill mode).
  • catchall defines nodes, walltime, corespernode, time_per_event
  • pilot calculates total cpu hours.
  • pilot estimates the total number of events by using totalcpuhours/time_per_event Normally pilot cannot finish all events.
  • pilot gets multiple jobs until the total number of events is reached or no more jobs. If no more jobs, pilot will shrink needed ranks. So if pilot defines to run with 700 nodes but it only gets one job, it will shrink to one or two nodes.
  • pilot prepares all jobs(stagein, setup...) and event ranges and send them to Yoda.
  • When yoda is running, pilot will update all panda jobs from 'starting' to 'running'. At the same time, it will recalculate the corecount: corecount = totalcpucores/total_panda_jobs. Because Yoda has another scheduler way to distribute events inside. It's not easy for pilot to get the exact corecount for one job and running time. Maybe pilot sets the panda job to running, but it never run in Yoda.

Yoda Accounting.

  • Accounting info in Droid in every rank: When one panda job is running in one Droid, Droid will account these information below for this panda job.
               setupTime: from init to first 'ready for event' message
               runningTime: after setup to finish athenamp
               totalTime: from init to finish the droid
               stageoutTime: totalTime - setupTime - runningTime
               cores: CPU cores that AthenaMP used. At NERSC, it's 24 except rank 0. Because rank 0 is running both Yoda and Droid, rank 0 reserve one cpu core for Yoda. So Droid in rank 0 is using one core less than other ranks.
               cpuConsumptionTime: getCPUTime from athenaMP job report, if fails, using os.tims(). It's cpu time which is something like 'time usercommand' in shell. It's a time which adds user time and system time together, which is much smaller than walltime * cores. 
               queuedEvents: total injected events to this droid for current panda job.
               processedEvents: total finished processing events.
               avgTimePerEvent: totalTime * cores / processedEvents, So it includes setup, running, stageout time. 
        

  • Accounting in Yoda as a collect jobMetrics: One panda job can be scheduled to more than one Droid(rank) to run. Yoda will do a collect accounting for all ranks. Yoda will dump the accounting info to a json file on share file system every one minute.
               avgYodaSetupTime : 
               avgYodaRunningTime: 
               avgYodaTotalTime:  
               avgYodaStageoutSetupTime
    
               maxYodaSetupTime
               maxYodaRunningTime
               maxYodaStageoutTime
               maxYodaTotalTime
    
               minYodaSetupTime
               minYodaRunningTime
               minYodaStageoutTime
               minYodaTotalTime
    
               cores: total cores of all ranks
               cpuConsumptionTime:  total cpuConsumptionTime
               totalQueuedEvents
               totalProcessedEvents
               avgTimePerEvent
        

           startTime: for a panda job, more than one rank can run it. When the panda job is pulled by a rank for the first time, yoda will mark it as the startTime of the job.
           endTime: for a panda job, more than one rank can run it.When the panda job is reported as finished or failed by the last rank. yoda will mark it as the endTime. 
                          If it's killed by cluster system(normally at first the kill SIGTERM signal is send, then in 30 seconds, if it's still running, a force kill SIGKILL signal will be send), 
                          Yoda will catch the SIGTERM signal to report all unfinished jobs' endtime to the current kill signal time. 
                          In the kill signal, Yoda will force dump accounting info to share file system.
    

* Accounting in pilot: Pilot scans the share file system to get Yoda accounting info. Then it will report it to panda.

           jobMetrics: all yoda accounting info will be here.
           startTime: the startTime from yoda accounting. only when pilot found a startTime for one panda job, the panda job will be reported as running. 
                            Even Yoda is 'running', but if there is no startTime for one panda job in yoda's report, the panda job will still be reported as 'starting'
           endTime: the endTime from yoda accounging. the panda job will be reported as 'transferring'.(Other steps in pilot will update it to other state)
           timeSetup: avgYodaSetupTime
           timeExe: avgYodaRunningTime
           timeStageOut: avgYodaStageoutTime
           nEvents: totalQueuedEvents
           nEventsW: totalProcessedEvents
           cpuConsumption: cpuConsumptionTime from yoda's report
    

How Yoda runs.

  • Because the athenaMP setup time is a big part for ES, Yoda needs to run with as less setup as possible. In Yoda, it will estimate how many ranks it needs to finish one job(Yoda will add one or two more nodes for big jobs with many events). Then for one panda job, it will not run the setup too many times.
                self.__eventsPerWorker = (int(walltime) - int(initialtime_m))/time_per_event_m
                eventsPerNode = int(self.__ATHENA_PROC_NUMBER) * (int(self.__eventsPerWorker))
                if jobId in eventRanges:
                    job['neededRanks'] = len(eventRanges[jobId]) / eventsPerNode + (len(eventRanges[jobId]) % eventsPerNode + eventsPerNode - 1)/eventsPerNode
                    totalNeededRanks += job['neededRanks']
                self.__jobs[jobId] = job
        
  • Yoda will also sort jobs with number of events it has. The few events it has, the more priority it will get to run.
  • Droid will get one job from Yoda and then send the job payload to EventServiceManager to run.
  • EventServiceManager will setup yampl message thread, AthenaMP and tokenExtractor(if needed for old athenaMP).
  • When one rank finishes one job, it will get another panda job from yoda to run.

Example script to run pilot on Edison.

You need to have an ATLAS proxy.
            #!/bin/bash
            cd /scratch2/scratchdirs/wguan/Edison/hpcf/edison01/wguan-pilot-dev-HPC_merge/1458756089.95
            export OSG_GRID=/global/project/projectdirs/atlas/pilot/grid_env
            export VO_ATLAS_SW_DIR=/project/projectdirs/atlas
            source $OSG_GRID/setup.sh
            export COPYTOOL=gfal-copy
            export COPYTOOLIN=gfal-copy
            export RUCIO_ACCOUNT=wguan
            rm -f pilot.tar.gz
            wget http://wguan-wisc.web.cern.ch/wguan-wisc/wguan-pilot-dev-HPC_merge.tar.gz -O pilot.tar.gz
            tar xzf pilot.tar.gz
            source /etc/profile
            source /global/homes/w/wguan/.bashrc
            source /global/homes/w/wguan/.bash_profile
            python pilot.py -s NERSC_Edison -h NERSC_Edison  -w https://pandaserver.cern.ch -p 25443 -d /scratch2/scratchdirs/wguan/Edison/hpcf/edison01/wguan-pilot-dev-HPC_merge/1458756089.95 -N 50 -Q premium 
    

You need to replace this part to your own(The default .bashrc and .bash_profile is ok). Interactive run doesn't need this part. But for cron job, you need to run this part to load batch modules. Otherwise, you will not be able to find slurm commands.

            source /etc/profile
            source /global/homes/w/wguan/.bashrc
            source /global/homes/w/wguan/.bash_profile
    

RunJobHpcEvent

It's part of pilot to start HPC ES. After pilot get job from Panda. It will setup the environment, prepare job files and commands. Then it will use HPCManager to getHPCResource(free cores for backfill mode, default resource defined in schedconfig for normal mode), submit HPC jobs and poll the jobs. HPCManager is the interface between pilot and HPC. Now it's implemented based on PBS/Torque cluster. It can be extended. RunJobHpcEvent.png

Yoda-Droid

Yoda-Droid is the HPC MPI job.

  • Yoda is the part running on MPI rank 0. It manages the job and events table centrally. It uses MPI interface to distributed job and events to Droid. Outputs received from Droid through MPI interface will be updated in events table and dumped to pilot periodly.
  • Droid is the part running on MPI rank more than 0. It gets job from Yoda, then starts EventServerJobManager to start the job. When EventServerJobManager is ready(AthenaMP is setup), Droid will get event ranges from Yoda and inject event ranges to ESJobManager. Then Droid will poll ESJobManager to wait the outputs and send outputs to Yoda. HPC-Yoda.png

EventServerJobManager

main part in Droid to manage AthenaMP, TokenExtractor and Yampl messaging thread. AthenaMP and TokenExtractor are components in Athena. Users can use yampl messaging channel to contact AthenaMP. So ESJobManager is the part to handle messages in Yampl messaging channel. EventServerJobManager.png

How to Run Yoda ES pilot

If you are interested in running Yoda ES jobs. You can follow theses steps:

-- WenGuan - 2015-02-24

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng EventServerJobManager.png r1 manage 16.3 K 2015-02-24 - 10:35 WenGuan  
PNGpng HPC-Yoda.png r1 manage 22.8 K 2015-02-24 - 10:35 WenGuan  
PNGpng RunJobHpcEvent.png r1 manage 19.3 K 2015-02-24 - 10:35 WenGuan  
PNGpng yoda-structure.png r1 manage 90.4 K 2015-02-24 - 10:35 WenGuan  
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2016-06-17 - WenGuan
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PanDA All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback