An Event Service job is similar to a normal panda job. The difference is in the panda job payload. When the pilot gets a panda job which includes "('eventService', 'True')" in the payload, it will start a ES process to handle it.
For an ES job, the input files can be staged in to local working directory or read directly (same with normal panda job).
The event objects and logs will be send to an object store when the ES job is running.
When the pilot finishes all work, it will transfer its logs to a grid storage element (same as with normal panda job, these log files can be accessed from the panda monitor).
* DetailEventServiceIntroduction * can be found here. Please read it at first if you are not familiar with Event Service.
Pure ES task: For this task, Panda will only generate ES jobs with EventService=1.
Panda generates ES jobs with EventService=1
Pilot gets ES jobs to run
When a bunch of events finishes, for example 1000 events, panda generates ESMerge jobs with EventService=2.
Pilot gets ESMerge jobs to run. When an ESMerge job finish, it means a bunch of events finished.
JumboEnabled ES task: For this task, Panda will generate both jobs with eventservice=4 and eventservice=5.
Normally JumboEnabled ES tasks have a lot of events. At first, Panda will Panda generates ESJumbo jobs with EventService=4. These jobs will be scheduled to panda queues with useJumboJobs defined in AGIS catchall. For HPC, harvester will be installed to run these jumbo jobs.
At the same time, Panda will a lot cojumbo ES jobs to schedule them to Grid. These cojumbo jobs may share the share input events with Jumbo jobs. If the events are already prefetched by HPC, the cojumbo jobs will be set in waiting status. Cojumbo job is similar with EventService=1 jobs, except the input events are shared with Jumbo jobs. So from pilot view, they are the same.
When a Jumbo job finishes, Harvester will update the status of events. some left events can be handled by cojumbo jobs.
When a task has not many left events, it's not efficient to use HPC to process them. Panda will automatically disabled Jumbo. After that, these tasks will only generate cojumbo jobs to process the left events.
Panda Queues:
useJumboJobs in catchall(AGIS): Panda will schedule jumbo jobs with EventService=4 to it.
jobseed=es or jobseed=all: Panda will schedule normal es jobs with EventService=1 and cojumbo jobs with EventService=5 to it.
jobseed=eshigh: At the end of a task, panda will increase its priority to speed up the processing of the tail. These high priority jobs will be scheduled to eshigh panda queues. For many MCore queues, we added jobseed=eshigh to them. Normally these panda queues will only simu, reco and jobs other than ES. But when there are high priority tails of es tasks, panda will schedule these high priority tails to these panda queues.
ESMerge:
EventService=1, EventService=4 and EventService=5 will generate a lot of event level files or a tar file with many event level files. These files are stored in objectstore for Grid and in some datadisk for HPC. We need to merge them to root files.
Currently ESMerge jobs are generated by panda with a bunch of continuous events. For example, if these events, 1 to 999, 1001 to 3000 are finished and we defined 1000 events per job. Two ES merge job will be generated with events "1001 to 2000" and "2001 to 3000". For events "1 to 999", panda will wait until the event 1000 is finished.
Currently ESmerge failure is one issue for ES. Specially when there are some es premerge files on datadisk(objectstore is more scaleable and less errors for remote reading). So for an esmerge job, panda will check the storage of input premerge files. If there are datadisks, panda will schedule the esmerge jobs to panda queues associated with one of those datadisk. If there are no datadisks, panda will schedule esmerge jobs close to the input objectstore.
In this view, you will see the number of running jobs(whether not enough running jobs), number of failed es jobs(if too many failed jobs, click the failed jobs to see what the error is).
Long running tasks: have been started for long time. Whether there is a tail problem(few left jobs for long time).
Many failures: In this view, you can find a column 'Ninputfiles | finished | failed'. For a task, if the 'failed' is too big. then you need to click this task to see what happened.
In this view, you will be able to see how many left events(waiting for processing). If there are not enough left events, you need to let the manager knows.
It also shows the allocated slots.
In the task age plot, you will be able to find the tasks which have been running for long time(need to check whether they have a tail problem).
For HPC, different sites have different errors. It's good to contact HPC manager.
ES on Grid.
yampl is cannot be found: In pilot log, if you found "RunJobEvent caught exception: No module named yampl", it means there is a configuration problem for the site. Yampl is required for AthenaMP to communicate with pilot.
This problem is one of the main ES issues when there is an ES failure for long time. Now this problem is not important anymore(most of them are already fixed). But for some sites, part of the worker nodes sometimes still have this problem.
No objectstore attached: For a panda queue, if there are too many failures, good to check agis whether there is an objectstore attached to it(in AGIS).
objectstore problem: If there are a lot of errors from different sites on the ddm dashboard with activity "Event Service Download" and "EventService upload". Good to check whether objectstore works.