Investigation of T0 Reconstruction Jobs

ATLAS Reconstruction, Openstack VM

In this example the data was stored in the Wigner Computing Centre. The command that was run was:

time Reco_tf.py --AMI=f603 --autoConfiguration=everything --conditionsTag=CONDBR2-BLKPA-2015-09 --skipEvents=0 --maxEvents=1 --DBRelease=current --outputAODFile AOD.hc_20063983.AOD.pool.root.CERN-PROD_MCORE.1447857534.5ee5 --inputBSFile root://eosatlas.cern.ch//eos/atlas/atlaseosscratchdisk/user/grzehorz/data15_13TeV.00276731.physics_Main.daq.RAW._lb0233._SFO-5._0001.data > output.txt &

with variations in skipEvents and maxEvents. The input dataset is 2.6 GB in size, containing 3025 events, leading to an average event size of ~ 860 kB.

The monitoring examples below show the results of the above command, whereas the profile of the Job looked similar in each test-iteration. All tests were run on machines, where similar Jobs were run before, so the CVMFS cache was already filled and being used.

TCPdump

The network traffic was captured using tcpdump and plotted with wireshark, showing for the 'baseline' without any Job running on the machine:

baseline.png

With a Job running, processing 1 event:

graph_1evt_lala.png

And with a Job processing 50 events:

50evts.png

The difference between the 1 event and the 50 event graphs is clearly the ~50 little bumps in the 50 event plot, indicating that the individual events are read (streamed) one after the other as input during Job execution.

Filtering out the background from the Jobs results in the following:

filter_background4.png

The black curve is the tcpdump of everything, the blue curve has the filters applied. Zooming out gives a better overview:

filter_background5.png

Here it is easier to see how the blue curve follows the black one at the peaks, but is lower in total. The third zoom factor shows no difference between the two:

filter_background6.png

Further investigation showed that the 'little bumps' come from a different source, than all the rest, corroborating the idea that they are the physics events being read. Separating this in Wireshark through filters gives:

filter_everything_but_events.png

The red line is all the traffic, the black line only traffic from one address. Looking this up in he 'conversations' tab shows the black curve originating from one source (using port 1095 (xroot)) and the remaining traffic to be coming from many different sources, that all use the same port (3128) and are called the same (caproxylbp.cern.ch):

conversations_100evts.png

This conversations tab was sorted by Bytes (highest item, most traffic) and lalalalala.cern.ch was the machine the test jobs were run on. In this case 100 events were processed. The top connection is the one with the black bumps from the previous plot and the ~84 MB divided by the number of events corresponds to 840 kB per event, which is roughly what would be expected as data input (see beginning).

Furthermore, zooming in on a 'bump' shows the broadness of ~0.6 s, slightly varying with the height.

1_evt_broadnesspng1.png

Ganglia

The machine the tests ran on is monitored by Ganglia. Below the CPU and the network (Bytes received) usage are shown for one Job. The peak in the network activity roughly corresponds to the drop in the CPU usage that can be observed in the first plot.

correlate_cpu_network.png

correlate_cpu_network1.png

This overall monitoring was used to see coarse profiles and correlations. In order to get a better resolution, sysstat (sar) was used.

SAR

The command that was run in order to monitor multiple aspects of a Job can be found here: sar1.sh.

As mentioned above, there looks to be a correlation between network activity and CPU usage. A deeper look shows:

cpu_user_network.png

The conclusion is that in the beginning the CPU is waiting for input and therefore idle, while the data is being received over the network. This is being confirmed to some degree by the next plot:

cpu_io_network.png

This doesn't account for all the IOwait time, especially looking towards the end of the Job. Another explanation could be found in the disk IO activity:

cpu_user_disk.png

In this plot no read from disk can be observed. What is happening towards the end of the Job is the following: The Job is transforming RAW data to AOD. This is not done in a single step, but it first transforms RAW to ESD and afterwards ESD to AOD. The high IOwait times towards the end happen when the second step of the Job starts. The above plot also doesn't explain very well, why the CPU time is low towards the end.

The new second step in the process leads to a lot of faults in the Paging Activity (see below). The black curve of major faults follows the trend of the red curve, only on a much smaller scale. It indicates that memory has to be loaded or reassigned when the new process starts.

paging_activity_100_evts.png

This can also be seen in the 'free memory' plot below. After the first part of the Job is finished it looks like the memory is completely freed and has to be reloaded for the second part of the Job.

memory_free_100evt.png

kbcommit.png

What can also be seen is that the CPU does not become as idle as it looks towards the end, from the user point of view, but that the system uses some part of it.

cpu_all_100_evts.png

In accordance with the rise in system CPU time, there is an increased amount in more context switches, see below.

context_switches_100_evts.png

The last plot shows the Block Wait time. In particular the red curve gives the average wait time, that is higher in the expected region. In accordance with that the pink utilisation curve goes up to 100% for that period.

block_wait_100_evts.png

CVMFS cache

In the beginning it was mentioned that all these Jobs were executed with an already used CVMFS cache.

The cache was cleared with sudo cvmfs_config wipecache. 'df' shows afterwards

Filesystem 1K-blocks Used Available Use% Mounted on
aufs 19554904 10292532 8245988 56% /
tmpfs 4088892 12 4088880 1% /dev/shm
/dev/vda3 62730876 17552216 41969000 30% /test
cvmfs2 4096000 125175 3970826 4% /cvmfs/cvmfs-config.cern.ch
cvmfs2 4096000 125175 3970826 4% /cvmfs/atlas.cern.ch
cvmfs2 4096000 125175 3970826 4% /cvmfs/atlas-condb.cern.ch
cvmfs2 4096000 125175 3970826 4% /cvmfs/sft.cern.ch
cvmfs2 4096000 125175 3970826 4% /cvmfs/atlas-nightlies.cern.ch

There remains some data in the cache, which wasn't erased.

A cleared cache before Job execution influences the network traffic in the following way:

io_cacheclear_split.png

As in a wireshark plot before, the black curve depicts only the physics data, whereas the red curve describes the whole traffic. Compared to the traffic from before with a cache, the traffic before the actual reconstruction process is increased, due to the fetching of data that is not available from the cache.

'df' after Job execution shows:

Filesystem 1K-blocks Used Available Use% Mounted on
aufs 19554904 11495188 7043332 63% /
tmpfs 4088892 12 4088880 1% /dev/shm
/dev/vda3 62730876 17894404 41626812 31% /test
cvmfs2 4096000 1317537 2778464 33% /cvmfs/cvmfs-config.cern.ch
cvmfs2 4096000 1317537 2778464 33% /cvmfs/atlas.cern.ch
cvmfs2 4096000 1317537 2778464 33% /cvmfs/atlas-condb.cern.ch
cvmfs2 4096000 1317537 2778464 33% /cvmfs/sft.cern.ch
cvmfs2 4096000 1317537 2778464 33% /cvmfs/atlas-nightlies.cern.ch

Profile

All in all the whole reconstruction Job can be split in to several processing steps that, put together, explain the observed behaviour:

  1. High network activity, low CPU, memory activity, no disk write
  2. Medium network (bumps), high CPU activity, no memory activity, disk write
  3. No network, medium CPU activity, memory activity, no disk write
  4. Medium network, low CPU activity, no disk write
  5. No network, medium CPU activity (high/low/high),disk write

These 5 steps translate to the following Job processes:

  1. Initialisation or startup phase
  2. Reconstruction (RAW to ESD)
  3. Initialisation/startup second step
  4. Reconstruction (ESD to AOD)
  5. Finalisation or cleanup phase
Whereas in the 1. step the environment is setup, by fetching all the necessary data. Mostly the code is already in CVMFS and only the detector geometry and calibrations matching to the particular dataset have to be fetched remotely.

In the 2. step the data is read event by event and reconstructed from RAW to AOD data. In the above plots this was shown for 100 events, where it was already dominating the consumed time. Typically it is better (less time consuming) to reconstruct even more events per Job, in order to minimise the time spent on the startup (and cleanup) phase, because their time doesn't vary with the number of events.

The 3. step looks similar to the 1. step but because of its short duration it is harder to separate. It is also starting up a new reconstruction process

The 4. step is then reconstructing ESD to AOD data, with the difference that it is not reading as much remote data.

In the 5. step the results of the 4. step are stored to disk, Job statistics/reports are generated and the Job exits.

In total steps 3+4 are shorter by more than a factor of 2 than the 1.+2. step.

-- GerhardFerdinandRzehorz - 2016-01-21

Topic attachments
I Attachment History Action Size Date Who Comment
Unix shell scriptsh sar1.sh r1 manage 0.5 K 2016-01-28 - 12:18 GerhardFerdinandRzehorz  
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2016-02-04 - GerhardFerdinandRzehorz
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback