DetailEventServiceIntroduction
EventService introduction
- An Event Service job is similar to a normal panda job. The difference is in the panda job payload. When the pilot gets a panda job which includes "('eventService', 'True')" in the payload, it will start a ES process to handle it.
- For an ES job, the input files can be staged in to local working directory or read directly (same with normal panda job).
- The event objects and logs will be send to an object store when the ES job is running.
- When the pilot finishes all work, it will transfer its logs to a grid storage element (same as with normal panda job, these log files can be accessed from the panda monitor).
- EventService=1: Normal eventservice jobs. Currently they are only simulation jobs. They can run on sites with jobseed=all or jobseed=es. Currently every eventservice job is created to process an event chunk with fixed inputs. It cannot process events outside of the event chunk.
- EventService=2: Eventservice merge jobs. They are jobs to merge outputs from eventservice=1 or eventservice=4 or eventservice=5. They normally run on sites with jobseed=all or jobseed=std. They don't run on opportunistic resources.
- EventService=3: Clone jobs. They are not eventservice jobs. For a normal job(generation, simulation, construction or any type of jobs), panda can clone to create multiple copies and schedule them to different sites. When a site starts to run one copy, panda will close all other clones. So there will be only one running instance for a job. It can speed up the job to get a cpu to run. Because panda uses the eventservice way to manage the lock between multiple clones, it's assigned with an eventservice number. But they are really not eventservice jobs.
- EventService=4: Jumbo jobs. Normally they run on HPC. Jumbo job is designed to be able to process all events of a task.
- EventService=5: Cojumbo jobs. When a task is enabled with Jumbo, it will be able to generate jumbo jobs and schedule them to HPC. At the same time, panda will also create cojumbo jobs and schedule them to Grid sites. The cojumbo is almost the same with eventservice=1 jobs(They both can only process an fixed event chunk). The only difference between eventservice=1 and eventservice=5 is that for eventservice=5, the task is shared to run jumbo jobs on HPC. For eventservice=1, all consumers are jobs with eventservice=1; For eventservice=5, there are consumers with eventservice=5 and eventservice=4.
EventService workflow
- When creating an ES task, panda will create ES jobs with nEventsPerJob. In an ES job, there is a special parameter nEsConsumers. It defines the number of pilots which can simultaneously consume events from the same job. For example, if nEsConsumers=10, there will be 10 pilots to process events in the same panda job.
- Pilot starts to getJob from panda.
- Pilot starts to getEventRanges from this job from panda (default is n events every time where n equals corecount). At the same time, maybe other pilots are getting the same job but different event ranges to run.
- Pilot starts AthenaMP to handle these events.
- Pilot in parallel sends finished events to an object store.
- Pilot gets more eventRanges to run until all events in the job are finished.
- When all events in one panda job finish, panda will create an ES merge job to merge all events from the object store. The merge job can run at the same time or at a different site. For example, we had ES jobs on preemptable cluster queue. But we don't want the merge job to be preempted. So we can run the merge job on a different queue.
Define Objectstore in AGIS
Event Service on HPC
Event Service on Amazon
This is a powerful queue. It's using the Amazon computing resources. The storage system it uses is Amazon S3 object store. All files are stored in S3, not only event objects, but also input root files. At first, DDM/Rucio will transfer input files to Amazon S3 (currently through a Bestman-S3 service). Then the pilot will stage-in the input files directly through the S3 interface to run ES jobs.
Event Service on Grid
How to setup a ES panda queue
Defining a panda queue for the Event Service is similar to defining an MCORE queue. Here we will only specify differences.
- (Required)jobseed:
- jobseed=es is used by panda to schedule only ES jobs to this queue.
- jobseed=std is used by panda to schedule only non-ES jobs to this queue.
- jobseed=all is used by panda to schedule both ES and non-ES jobs to this queue.
- jobseed=eshigh is used by panda to schedule both high priority es jobs to this queue.
- (Required)corecount:
- It's used by pilot to set ATHENA_PROC_NUMBER which defines to start multiple processes for AthenaMP. It can be 1.
- (Optional)not_es_to_zip:
- By default pilot tar/zip and upload events every zip_time_gap minutes. If you don't want to use tar/zip function, you can set it in catchall. Otherwise you don't need to update it.
- (Required)pledgedcpu: -1:
- It means opportunistic queues(For panda scheduling, for example, panda will not schedule es merge jobs to it).
- (Optional)zip_time_gap:
- This parameter can update the time gap between two tar/zip functions. It's use with seconds. So if you want to set it to 20 minutes, just set zip_time_gap=1200.
- By default, if pledgedcpu is -1, it's 10 minutes. Otherwise it's 2 hours.
- (Optional)maxtime:
- Maximum time of the job. 10 minutes before the maxtime, pilot will automatically send a *SOFTKILL' to tell AthenaMP to finish the job, uploading produced files and etc.
- (Required)ES output storage:
- It's used by pilot to send finished event objects to it. It can be an objectstore or a normal grid storage. It's defined through 'Associated DDM Storages' with activity 'es_events'
- Required Under 'Associated DDM Storages', attach a storage for activity 'es_events'.
- Optional Under 'Associated DDM Storages', you can aslo attach another storage for activity 'es_failover', which will be used if the first storage fails. It's not forced.
- Optional Under 'Associated DDM Storages', you don't need to attach storages to 'ES READ'. 'ES READ' is only used when you want to use a different 'Associated Pilot Copytools' other than 'rucio' to stagein from objectstore for es merge jobs. So normally you don't need to do anything with 'ES READ' activity.
- Available Objectstore storage: CERN-PROD_ES, MWT2_ES, OSIRIS_ES, RAL-LCG2_ES, UKI-NORTHGRID-LANCS-HEP_ES
- (Required)ES logs:
- ES logs will be written to both 'Pilot Log Special(ES write logs)' and 'Pilot Logs'('Pilot Logs' is the activity for normal jobs' logs). So it's a duplicated copy.
- Required Under 'Associated DDM Storages', attach a object storage for activity 'Pilot Log Special(ES write logs)', such as CERN-PROD_LOGS(objectstore), MWT2_LOGS(objectstore).
- to pass by the releases validation. Note: You need to make sure you have the releases:
- releases = ANY
- validatedreleases = True
- ignore swreleases = True
To Test The Site
Event Service on Support
If you have any questions, please send an email to
atlas-comp-event-service@cernSPAMNOTNOSPAMPLEASE.ch
--
WenGuan - 2019-03
--
WenGuan - 2015-07-08
Topic revision: r4 - 2019-07-29
- WenGuan