Activity Reports 2018 - Paul Nilsson
October
Main activities (Pilot 1)
- Released minor versions 73.4 and 73.5
- Import updates: Objectstore stage-in verification, ES merge fix, Replica priorities update, Lingering orphaned processes (Some processes were found hanging even after pilot ended in looping jobs); Added safer handling of stdout_tail variable to prevent discovered lost heartbeat
Main activities (Pilot 2)
- Managed to run user defined container with full analysis code for the first time (end of main development for user container support)
Other activities
- Participated in ATLAS Software and Computing Week, gave one presentation about Pilots
September
Main activities (Pilot 1)
- New Pilot 1 version in development (to be released in October), currently with minor changes
Main activities (Pilot 2)
- Investigated how to improve the pilot timing dictionary handling; requirement for HPCs with limited I/O - switched to use an internal timing dictionary instead of writing to/reading from file. Verified by Danila Oleynik on Titan
- Implemented max lifetime; Pilot will abort ten minutes before running out of time as defined by schedconfig parameter for the given PQ, Pilot aborts and fails the job properly (tar ball creation of log, server update)
- Added handling and identification of several more error codes
- Pilot 2 reached end of main development stage at the end of September (commissioning stage to begin); main developments (features) are now in place, some minor developments remains (such as more error code handling)
- Improved tar ball creation of log (previously did not have sub-directory structure)
- Tested new runcontainer script; managed to run user defined container with simple HelloWorld job to completion for the first time; installed runcontainer script on PanDA server
- Created more test queues for Pilot 1 and 2 (CERN-PROD_UCORE_1 and _2)
- Refactoring of many functions and components to streamline the Pilot
- Created new Google doc to track progress on error code implementation
- Fixed problem with exceptions during stage-in; pilot can now wrap up correctly
- Worked on (and finished) payload output interpretation function; Pilot can now identify special errors in payload output, extract number of events from jobReport, metadata or Athena summary files, etc
- Approved pull request from D. Oleynik about HPC Pilot workflow which is now part of main Pilot 2 code base
- Now testing containers on nine production queues
Other activities
- Chaired three Pilot Development meetings this month
- Presentations in biweekly Containers meeting
- Worked on document to describe direct access in the Pilots
- Wrote CHEP 2018 proceedings paper about Pilot 2 (submitted to ATLAS Speakers Committee for internal review); paper registration prepared in the SAGA system used by the CHEP organizers (paper will be submitted before end of October)
- Participated in special meeting at CERN to discuss pre-emptible queues; we’ve requested that CERN batch system will send a special kill signal, hopefully SIGUSR2 since it’s not used by the pilot yet, whenever they want to pre-empt the resource and/or make the job-ad available for the pilot to parse which contains an explanation of the eviction
- Created script for making self-extractable binaries
which e.g. can be used with new runcontainer script
August
Main activities (Pilot 1)
- Released minor versions 73.2 and 73.3
- It was realized that both pilot and rucio server called list_replicas() which is unnecessary and has increased the load on the rucio servers due to the ongoing migration to use rucio as sitemover and since rucio will call list_replicas() for each input file download. A quick fix for this is to use the --pfn option with rucio download which will prevent rucio from also calling list_replicas().
Main activities (Pilot 2)
- Added and implemented new error code 1316, RUCIOSERVICEUNAVAILABLE: "Rucio: Service unavailable",
- Started testing real user defined containers with prun. Decided to replace runGen with new transform like script to be called runcontainer, which is to be used for user defined containers
- Implemented direct access support for both user analysis and production jobs
- Created initial code for LCF HPC plug-ins to be used with the HPC Pilot workflow
- Proper signal handling now in place; all essential control modules (job, payload and data components) now correctly handles signals and aborts the job, updates the server.
- Pilot 2 error code wiki page up to date
Other activities
- Worked on google doc about list_replicas
which contains a full documentation about how stage-in works in both Pilot 1 and 2 (all steps)
- Worked on google doc about job description
which defines the Pilot 1 and 2 job objects (all fields)
- Revised HC templates for both Pilot 1 and 2 testing (incl. creation of new templates and removal of deleted queues)
- Chaired two Pilot Development meetings this month
- Presentations in biweekly Containers meeting
- Presented "Pilot and Python" in TCB meeting at CERN
July
Main activities (Pilot 1)
- Release major version 73.0, minor version 73.1
- Important updates: on-the-fly measurements of CPU consumption time, protocol updates for Google tests, using killpg instead of kill, to include child processes in time-outs, allowing ES merge jobs to select closest inputs. In debug mode, the pilot now scans for the latest updated payload log file and sends its tail with each heartbeat
Main activities (Pilot 2)
- Adding support for the server commands that are sent with the backchannel to the pilot (tobekilled, debug and debugoff, softkill will be added later when the ES code is ready)
- Full support for running user jobs; code updates include creation of PFC, special test jobs (run initially on ANALY_MANC then on ANALY_MANC_TEST_SL7; starting with simple HelloWorld jobs and then derivations jobs)
- Implemented verification of output file sizes, tested at BNL
- Implementing process monitoring
- Implemented new version of stage-out function for LSM
- Sorted out guids generation/reading in Pilot 2; related to additional output files
- Wrote new adler32 and md5 checksum functions; added checksum verification for lsm-get
- Continued containers testing of production jobs; initial support for user defined containers with the introduction of --containerImage
- Added more signals to signal handling
- prodSourceLabel=rc_test2 added to PanDA server which allows for Pilot 2 RC testing using HC jobs; ran first RC test jobs on ANALY_GOEGRID; created test queues ANALY_GOEGRID_1 and _2 for Pilot 1 and 2 comparison
Other activities
- Continued work on Pilot 3 (Pilot 2 tested against Python 3; any discovered problems have been back ported to Pilot 2)
- WMFS presentation about Pilot 2 status
- Presentations in biweekly Containers meeting
- Pilot 2 poster displayed at CHEP 2018 (I was not present)
May - June (from Quarterly report)
Main activities (Pilot 1)
- Development and testing of five pilot versions in this period; four released, one pending final testing.
- Instant CPU consumption time measurement (incl. recursive process id lookup and processing of corresponding process stat files) + reporting [development, released in July].
- Simplification of PoolFileCatalog (removal of LFN tag to allow for derivation TRF to run properly when many input files).
- Service updates for container testing (main testing done with Pilot 2, but some features must be added to Pilot 1 as well; removal of new pilot commands that arrive with job parameters, and update to the default image).
- Testing updates and changes to Event Service code, including Prefetcher work (in collaboration with W. Guan, N. Magini).
- Final implementation of Zip mapping; Archive_tf can now be used as intended, in combination with metadata collected by the pilot.
- Filtering of metadata to avoid garbage data from irrelevant error messages.
- Support for https protocol in direct access, needed for Google tests.
Main activities (Pilot 2)
- Instant CPU consumption time measurement + reporting (simpler implementation as compared to Pilot 1).
- Proper exception handling for threads (exceptions in threads are caught by responsible process and propagated upwards).
- Pilot+Job monitoring now complete - now verifying pilot lifetime, proxy, disk usage (work dir size, payload stdout size, local disk space, output file sizes), looping job, number of running processes and internal threads, utility command monitoring, memory usage and CPU consumption (incl. leak rate).
- Implementation of job metrics, pilot timing (much improved compared to Pilot 1; can measure any step) and final server update.
- Definition (and continued implementatio) of Pilot APIs; Data API and Service API (which in turn contains Benchmark, Memory Monitor and Analytics APIs).
- Defined and implemented Analytics package (using Analytics API), capable of fitting linear data (used in combination with Memory Monitoring and reporting of memory leaks).
- Implementation of user analysis workflow (setup for user code); both standard job and release-less 'generic' jobs.
- First version of HPC Pilot workflow being finalised and tested at Titan (in collaboration with D. Oleynik).
- Integration of Information Service component completed (all components can use relevant info retrieved from AGIS).
- Discussed and planned for remaining development for Nordugrid (in collaboration with D. Cameron and A. Filipcic)
- Pilot 2 is now running container tests on 8+ sites; user job tests on 1 site.
Other activities
- Defined Pilot 3 project; started to update Pilot 2 for Python 3. Started discussion about using Python 3 in ATLAS (collaborators incl. TRF and software people).
- Updated outdated HC templates, cleared out many inactive sites and added new PQs resulting in more HC jobs running.
- Participated in ATLAS Computing & Software Week, in Hamburg, Germany; gave presentation about Pilot 1, 2 and 3 projects
.
- Created/updated several technical documents; Direct Access and the Pilots, Pilot 2 usages of AGIS/schedconfig queuedata fields, Information for PanDA Pilot developers, Pilot 2 API documentation
.
- Coordinated Pilot related development, chaired four Pilot Development meetings at CERN in this period.
- Created PanDA test job submission cron, used to send test jobs to multiple sites for container testing (using Pilot 2).
- Participated in container meetings; kept Pilot 1+2 updated for container testing, incl. new developments (see here).
- All new major Pilot releases presented in ADC Weekly meetings (see here).
- Presented Pilot updates, plans, etc in three WFMS meetings in this period (see here).
Near term plans
- Begin HC testing of Pilot 2 which is pending finalisation of a major update in the Data API (mainly done in June).
- Continue with event service integration and testing, incl. Prefetcher (in collaboration with W. Guan and N. Magini; delayed because of Data API delay).
- Continue with HPC workflow in Pilot 2 (in collaboration with D. Oleynik).
- Support for direct access.
- Pilot 2 running first real ATLAS production jobs.
April
Pilot 1
- Released pilot version 72.8.
- Updated ZIP_MAP code for upcoming dummy trf. Current Archive_tf script is not usable (it does not create file metadata and removes input files before pilot can create metadata - pilot will instead do everything).
- Simplification of PoolFileCatalog, removal of logical tag and correction to pilot code that relied on logical tag (incl. ES code).
- More updates to Memory Monitoring.
- Improved subprocess killing by adding session id to Popen() which allows for process group killing should it be necessary.
Pilot 2
- Participated in container meetings. Tested Pilot 2 with containers at multiple sites.
- Added proxy verification to job monitoring.
- Added proper exception handling for threads.
Other
- Presented pilot update in WFMS meeting. Outlined road to completion for Pilot 2 project with time line.
- Pilot Development meeting
, April 13.
March
Pilot 1
- Released pilot version 72.6, 72.7.
- Further updates to Memory Monitoring (added platform for default memory monitor setup).
Pilot 2
- Participated in container meetings. Tested Pilot 2 with containers at multiple sites. Also tested ALRB containerisation as an alternative to explicit singularity exec.
- Tested new BNL queue, BNL_PROD_CONTR_TEST in relation with Pilot 2 and containers.
- Added dedicated HPC workflow. Migrated bulk of HPC Pilot developed by D. Oleynik into Pilot 2.
- Added option for setting SCORE/MCORE, needed by Harvester.
- Now supporting utility commands and added Memory and Network (prel. code) monitoring, that can be executed individually or together with other command in various stages (before, with or after payload).
- Added job monitoring that lives and dies with the job.
- Improved thread handling (code simplification and started testing exception handling for threads).
- Updated code for new InfoSys component, the interface to AGIS/Schedconfig.
- Added initial support for Nordugrid (some code, incl. special XML generation is still missing).
- Migration of code to use new Job class.
Other
- Made ACAT 2017 Pilot 2 proceedings corrections as suggested by referee.
- Participated in Sites Jamboree.
- Pilot Development meeting
, March 29.
February
Pilot 1
- Released pilot version 72.3, 72.4, 72.5.
- Prepared code for upcoming PoolFileCatalog simplification (in April release), incl. testing.
- Added support for Harvester job request and kill worker files. Implemented Harvester mode in pilot (through identification of special environment variables).
- Switched to using fixed release (21.0.22) for setting up MemoryMonitor due to several recently discovered problems with the tool.
- Added error handling for trf error code 146 (user tarball cannot be downloaded from PanDA server).
Pilot 2
- Added support for Harvester job request file.
- Added more container support (support for new logic).
- Merged ES update with my own branch, sorted out code conflicts.
- Started testing pilot 2 on Manchester queue (container tests).
- Created wiki explaining schedconfig fields used by Pilot 2.
Other
January
Pilot 1
- Released pilot version 72.2.
Pilot 2
- Added error code dictionary. Prepared new wiki for Pilot 2 error codes and explanations.
- Reworked queue implementation.
- Presented Pilot 2 container plans in Container meeting.
- Expanded on container support.
Other
--
PaulNilsson - 2018-01-17