To Test The Site

To test it, please send an email to atlas-comp-event-service@cernSPAMNOTNOSPAMPLEASE.ch, asking for creating a test ES task to your site. Currently we can use aipanda007 to submit test tasks. we use prodsys2 to submit prod tasks.

  • To create a test task on aipanda007, it needs special permission
            ssh atlpan@aipanda007
        
or
        ssh aipanda007
        sudo su - atlpan    
    
Then
        cd /home/atlpan/esTest
        cat README
    
currently we are using this task 'python submitTaskWithES_21_*.py Arizona_ES US' to test sites . It's good to check 'prodSourceLabel' before you send the test task. 'prodSourceLabel=test' can be picked up by apf pilot. 'prodSourceLabel=ptest' is only for pilot developing, we need to send special pilots for it. So normally please use 'prodSourceLabel=test'.

To Monitor EventService Jobs

On the top menu of BigPanda monitor, select 'Tasks' --> 'Event Service', you will be able to find all recent Event Service tasks.

To Debug The Site

  • in pilotlog.txt, make sure yampl server is alive. Because this error is frequently the most case of errors. yampl is special for ES, which is not used in normal jobs. Below is an example.
            24 Oct 07:34:46|PilotYamplSe| Created a Yampl server socket
            24 Oct 07:34:46|RunJobEvent.| The message server is alive
        
  • in pilotlog.txt, check get event ranges. if you cannot find it, it means there is some problem to start EventService
            24 Oct 07:37:06|RunJobEvent.| Received new message: Ready for events
            24 Oct 07:37:06|RunJobEvent.| AthenaMP is ready for events
            24 Oct 07:37:07|RunJobEvent.| Waiting for a new message
            24 Oct 07:37:11|EventRanges.| Downloading new event ranges for jobId=2649098377 and jobsetID=2647803559
            24 Oct 07:37:11|pUtil.py    | Will not send attemptNr for cmd=getEventRanges
            24 Oct 07:37:11|pUtil.py    | toServer: cmd = getEventRanges
            24 Oct 07:37:11|pUtil.py    | toServer: len(data) = 2
            24 Oct 07:37:11|pUtil.py    | data = {'pandaID': '2649098377', 'jobsetID': '2647803559'}
            24 Oct 07:37:11|pUtil.py    | Executing command: curl --silent --show-error --connect-timeout 100 --max-time 120 --compressed --capath /etc/grid-security/certificates --cert /var/lib/condor/execute/dir_3552598/prodProxy --cacert /var/lib/condor/execute/dir_3552598/prodProxy --key /var/lib/condor/execute/dir_3552598/prodProxy --config /var/lib/condor/execute/dir_3552598/Panda_Pilot_3563815_1445672080/PandaJob_2649098377_1445672084/curl.config https://pandaserver.cern.ch:25443/server/panda/getEventRanges
            24 Oct 07:37:12|pUtil.py    | Elapsed seconds: 1
            24 Oct 07:37:12|pUtil.py    | Dispatcher response: [('eventRanges', '[{"eventRangeID": "6760980-2649098377-4428289941-124-9", "LFN": "EVNT.01416937._000004.pool.root.1", "lastEvent": 124, "startEvent": 124, "scope": "valid1", "GUID": "79C2145F-76A7-164E-9CF2-44F67272C8B4"}, {"eventRangeID": "6760980-2649098377-4428289941-125-9", "LFN": "EVNT.01416937._000004.pool.root.1", "lastEvent": 125, "startEvent": 125, "scope": "valid1", "GUID": "79C2145F-76A7-164E-9CF2-44F67272C8B4"}, {"eventRangeID": "6760980-2649098377-4428289941-126-9", "LFN": "EVNT.01416937._000004.pool.root.1", "lastEvent": 126, "startEvent": 126, "scope": "valid1", "GUID": "79C2145F-76A7-164E-9CF2-44F67272C8B4"}, {"eventRangeID": "6760980-2649098377-4428289941-127-9", "LFN": "EVNT.01416937._000004.pool.root.1", "lastEvent": 127, "startEvent": 127, "scope": "valid1", "GUID": "79C2145F-76A7-164E-9CF2-44F67272C8B4"}, {"eventRangeID": "6760980-2649098377-4428289941-128-9", "LFN": "EVNT.01416937._000004.pool.root.1", "lastEvent": 128, "startEvent": 128, "scope": "valid1", "GUID": "79C2145F-76A7-164E-9CF2-44F67272C8B4"}, {"eventRangeID": "6760980-2649098377-4428289941-129-9", "LFN": "EVNT.01416937._000004.pool.root.1", "lastEvent": 129, "startEvent": 129, "scope": "valid1", "GUID": "79C2145F-76A7-164E-9CF2-44F67272C8B4"}, {"eventRangeID": "6760980-2649098377-4428289941-130-9", "LFN": "EVNT.01416937._000004.pool.root.1", "lastEvent": 130, "startEvent": 130, "scope": "valid1", "GUID": "79C2145F-76A7-164E-9CF2-44F67272C8B4"}, {"eventRangeID": "6760980-2649098377-4428289941-131-9", "LFN": "EVNT.01416937._000004.pool.root.1", "lastEvent": 131, "startEvent": 131, "scope": "valid1", "GUID": "79C2145F-76A7-164E-9CF2-44F67272C8B4"}, {"eventRangeID": "6760980-2649098377-4428289941-132-9", "LFN": "EVNT.01416937._000004.pool.root.1", "lastEvent": 132, "startEvent": 132, "scope": "valid1", "GUID": "79C2145F-76A7-164E-9CF2-44F67272C8B4"}, {"eventRangeID": "6760980-2649098377-4428289941-133-9", "LFN": "EVNT.01416937._000004.pool.root.1", "lastEvent": 133, "startEvent": 133, "scope": "valid1", "GUID": "79C2145F-76A7-164E-9CF2-44F67272C8B4"}]'), ('StatusCode', '0')]
        
  • in pilotlog.txt, check whether there are outputs, if there is no ouput, check whether there are ERR_*. 'ERR_*' is a return value from AthenaMP. If some wrong event ranges are injected to AthenaMP which cannot be found in tokenExtractor, or bad events which will cause AthenaMP crash, this error will return.
            24 Oct 07:40:14|RunJobEvent.| Received new message: /var/lib/condor/execute/dir_3552598/Panda_Pilot_3563815_1445672080/PandaJob_2649098377_1445672084/athenaMP-workers-EVNTtoHITS-sim/worker_0/panda.jeditest.HITS.c4c67d2b-05e5-4112-af88-f93fd7b41c55.000037.HITS.pool.root.1.6760980-2649098377-4428289941-124-9,6760980-2649098377-4428289941-124-9,CPU:177,WALL:177
        

  • check whether there is some tokenExtractor errors. Below are the log file names.
            tokenextractor_stdout.txt 
            tokenextractor_stderr.txt
        
  • check Athena logs
            log.EVNTtoHITS
            athena_stdout.txt
        
  • check AthenaMP logs
            /athenaMP-workers-EVNTtoHITS-sim/worker_0/AthenaMP.log
        

To Run Pilot

  • RunPilot: Normally pilot should be submitted by APF. Yoda is already included in the latest pilot. If you are interested in running your own test pilot, here is some instructions.

Available ES panda queues(old)

  • NERSC_Edison is a panda queue which is running Event Service on HPC. It's using HpcYoda.
  • BNL_EC2E1_MCORE is a panda queue which is running Event Service on Amazon.
  • SFU-LCG2_ES is a panda queue which is running Event Service on a preemptable cluster queue.
  • IN2P3-LPSC_CLOUD_MCORE is a panda queue which is running Event Service on a local Cloud.
  • CERN_P1 is being testing. More queues will be available.
  • CONNECT_ES a preemptable, opportunistic single core resource
  • CONNECT_ES_MCORE a limited, opportunistic MCORE resource
  • SLAC_ES one hour walltime limit, preemptable queue
  • Arizona_ES Arizona ES queue

-- WenGuan - 2019-04-01

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2019-04-01 - WenGuan
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PanDA All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback