THIS IS DRAFT OF CORRECTION FOR HLT PART OF
AtlasTDAQLargeScaleTests2006TestPlan
HLT common for both L2 and EF (Haimo, Serge, Andrea N.)
Goals of HLT (L2 & EF) LST tests 2006
The general goal is to measure the response time of the Level2 subsystem during the Configure,
PrepareForRun /Start and Stop transitions of the TDAQ
RunControl. This should be done with as many trigger Algorithms as possible, and with a realistic way of accessing the involved
DataBase (s). The
DataBase (and possible associated file) access is, besides the interaction with
RunControl, the one major aspect of running the HLT algorithms in Level2 that represents possible contention for resources across the Level2 farm. Most of the following applies in fact equally well to the Event Filter, and we indicate this by mentioning both
L2PU and EFPT where appropriate (hence the following does not need to be repeated in the Event Filter Tests section).
The information to be collected should be cast as much as possible in the form of histograms and summary values collected from the
InformationService (IS). Having to parse the application logfiles for such information should be avoided.
Aim of the Event Filter tests & REMARK
The EF tests go in mainstream with L2 test as having very much similarities
between L2 and EF in HLT, so, the above L2 plan can be considered as common
HLT.
With perhaps small remark that the details in L2 section above contain much
descriptions of particular
instumentations , the EF can propose slightly
different projection on common HLT=L2+EF tests below (which is subject
to common agreement, and we better finally merge common issues under
chapter "HLT", while keep just L2-specific and EF-specific under "L2" and
"EF" paragraphs):
Prerequisites for HLT-common (must have before LST)
Components needed for the tests;
- a functionally verified HltImage which provides the entirety of the ATLAS TDAQ, Offline and HLT software
- a set of data files that can be preloaded to the ROS(E) and L2SV applications
- a suite of programs/scripts (PartitionRunner) that automates
- the generation of partitions
- the running of the tests
- the collection of results
- a suite of programs or scripts that combines the collected results and condenses/reduces the information for the preparation of a final summary report
- have HLT release built & tested with Athena 12.0.x
- have PESA slices (JO/datafiles) integrated & tested & included into HLT image
- have DB proxy/caching mechanisms provided, functionality tested with EF on Lab.32
- have detailed statistics/timings instrumented in EF code
- have additional measurements on DB-connection performance instrumented in DB-interface libs
- have EF-alone / L2+EF partitions generated automatically by PartitionMaker tested on small scale Lab.32 to prove usability of PartitionMaker with (L2+)EF+algo
- have PartitionRunner / FarmTools tested for using
Detailed measurements / instrumentation
Tests to be prepared and information to be collected:
- in general: all RunControl command arrival times on all L2PU (EFPT) applications
- Configure transition:
- overall transition start time and duration for the entire partition
- distribution of individual transition start times for each L2PU (EFPT)
- distribution of transition durations across all L2PUs (EFPTs)
- DataBase (DB) connection setup times
- DB access:
- latency distribution
- data volume distribution
- distribution of time when DB accesses are issueed (when are we hitting the DB?)
- HLT trigger algorithm configuration (Offline 12.0.X only?) :
- overall time taken (per algorithm, or per subdetector?)
- computing time (per algorithm, or per subdetector?)
- data volume accesses after configuring the algorithm
- due to reading information provided in external files (e.g. FieldMap etc.)
- usage of shared libraries:
- compile complete list of shared libs used
- total time used to open shared libs ( after initial preload)
- measure these times depending on whether the shared libs reside on local disks or on rack-level File Servers
- clarify whether this comparison will be possible in LST06
- PerpareForRun / Start transition: (note only the PrepareForRun transition is non-dummy in HLT, Start does nothing)
- overall transition start time and duration for the entire partition
- distribution of individual transition start times for each L2PU (EFPT)
- distribution of transition durations across all L2PUs (EFPTs)
- how fast does Level2 start up, i.e. how fast are queues getting filled?
- collect event arrival times after Start for the first N events
- in input queue of each L2PU. N is O(1000).
- implement histogram or persistent output
- combine these into a turn-on curve for Level2
- while Running:
- fill histograms of per-event quantities (TO BE EXPANDED)
- Stop transition:
- overall transition start time and duration for the entire partition
- distribution of individual transition start times for each L2PU (EFPT)
- distribution of transition durations across all L2PUs (EFPTs)
- how fast does Level2 stop, i.e. how fast are input and processing queues emptied?
- collect event finish times and queue size distribution for the last N events
- using a FIFO queue of size N
- implement histogram or persistent output
- combine these into a turn-off curve for Level2
Plans for the running phase of the tests (programm of tests)
- General PESA/non-DB related:
- simple validation of (L2+)EF+algo partition on available L.S. (100, 400, 800, 1200 nodes)
- do simple measurements of TDAQ transitions times & detailed performance instrumentation (NOT-DB related) before coming to DB-related
- DB-related:
- run partition on given scale (100, 400, ...) for given mechanism "A" ("B", "C", ...) of DB-replication and do detailed DB-connection measurements (with instrumented lib)
- repeat for varied parameters of given mechanism "A" (B, C, ...) of DB-replication
- levels of replication: 1 server only no proxies, 1 proxy per rack, proxy on each node
- periods of validity/updates of data in proxy/cache
- repeat 3.1/3.2 for various mechanisms of DB-replication:
- A = MySQL DB-proxy (the one by the SLAC group)
- B = similar proxy for Oracle ?
- C = MySQL server on each HLT node ??
- D = SQLite files on disk
- E = still compare to FroNtier (to ensure that BOTH performance and ROBUSTNESS are important on L.S.)
- instrument random time-shift inside PT/Athena algo for connection to DB (each PT/Algo does real connection randomly shifted within given interval); test with 1 central server and compare if this without proxies can be equivalent to using proxies
Level2-specific issues (Haimo)
Manpower (who);
17. July 2006:
- first demo of running Level2 with multiple algorithms and per-event histograms (WW,HZ)
01. September 2006:
- HLT image available with at least TDAQ-01-04-01, Athena 11.0.6, HLT-03-00-04 (Jiri M.)
- Test histograms implemented either in HLT release or as external patch (WW, HZ)
- automation test suite ready for first medium scale trials (AdA, HZ)
01. October 2006 (or
actual LST start date minus 4 weeks):
- new HLT image with TDAQ-01-06-00, Athena 12.0.X (JM, WW)
- to be reviewed
Plans for the running phase of the tests
none yet
Event Filter specific issues (Serge, Andrea)
Planning and schedule for the test preparation
Milestones / Manpower
now - 15 August 2006:
- integrate HLT(EF) and DB-proxy systems (Serge + Amedeo)
15 August - 1 September 2006:
- implement timing instrumentations/histos in EF code for detailed performance (Serge + Andrea N.)
1 September - 15 September 2006:
- implement & test dummy-Calib. Athena algo to measure runtime Db read-write, to check if performance will bring limitations on functionalities (Serge + Unknown).
15 September - 1 october 2006:
- Medium / Lab.32 tests of partitions with PESA algo & DB-proxies, using PartitionMaker & FarmTools, as examining readiness for the LST.
EF-specific tests programm:
- EF-Calibr. related studies
- try special dummy-test EF-Calibr. (Athena) algorithm which can
- used for Phys./Calibr. complex routing in EF and also
- do direct connection to external DB reading/writing SOME type of data (can be either calibr. or lumi or metadata, etc., which does not go via DataFlow path
- try varies load of this algo (amount / frequency of reading/writing)
--
SergeSushkov - 17 Jul 2006