DB12 dependency on Python Version and OS
Open questions (that possibly can get an answer)
- Manfred: is it possible to enable batch queues on the various test environments to compare with the performance of Alice/Atlas/CMS/LHCb applications? Are user jobs also being boosted when using newer Python releases? - Remark (Manfred): At GridKa the static benchmark scores are available to users using the MJF interface. Submit benchmark jobs to the default ARCs; there are currently no dedicated batch queues configured to submit benchmark jobs.
- Manfred: AFAIK experiments are using private Python implementations from their CVMFS areas. Which Python release is used by production jobs? (Remark (Manfred): GridKa doesn't look into user jobs and is therefore not aware of the Python release being used inside the jobs. This question should become answered by the experiments.)
DB12.py running in different containers (D. Giordano)
Done on a physical server at CERN: Haswell
specs here
Configurations tested
Used docker containers to run with different python versions. In addition measured directly the performance of DB12 on the python 2.7.5 installed on this server.
Versions tested:
type |
OS |
version |
Containers |
lc6-base |
Python 2.6.6 |
Server |
CC7 |
Python 2.7.5 |
Container |
cc7-base |
Python 2.7.5 |
Container |
python:2.7 (from dockerhub) |
Python 2.7.13 |
Container |
python:3 (from dockerhub) |
Python 3.6.0 |
GCC versions
OS |
lib |
GCC version |
cern/slc6-base |
/usr/lib64/libpython2.6.so.1.0 |
GCC 4.4.7 20120313 (Red Hat 4.4.7-17) |
cern/cc7-base |
/lib64/libpython2.7.so.1.0 |
GCC 4.8.5 20150623 (Red Hat 4.8.5-11 |
python:2.7 |
/usr/local/lib/libpython2.7.so.1.0 |
GCC: (Debian 4.9.2-10) 4.9.2 |
python:3 |
/usr/local/lib/libpython3.6m.so.1.0 |
GCC: (Debian 4.9.2-10) 4.9.2 |
DB12 running approach
Run DB12 version available in github
https://github.com/DIRACGrid/DB12/blob/master/DIRACbenchmark.py
with the flag --extra-iteration (DB12 modified to run 2 extra-iterations in order to make sure that all benchmarked processes finish when the machine is still fully loaded)
NB: had to fix few components in the python script to comply with python 3.6.0
Example of command line
- docker run -it --rm --name my-running-script -v /root/DB12:/usr/src/myapp -w /usr/src/myapp $IMAGE python DIRACbenchmark.py --iterations=1 --extra-iteration
- with $IMAGE in cern/slc6-base cern/cc7-base python:2.7 python:3
Results
Results are reported in the
attached table.
- It is shown that the python 2.7.5 is ~10% faster than python 2.6.
- 2.7.13 is even 18% faster, but could depend also on the different OS of that container.
The study is performed benchmarking the whole node (32 processes) and half of it (16 processes).
Ratio between DB12 values for 16 and 32 processes shows the usual lack of gain in running multi-threading. the values are comparable within few %
Measurements with 16 processes are less stable. It always seems that there are slower threads respect to the average.
This can be noticed also looking at the average value respect to the median:
- Example: python DIRACbenchmark.py --iterations=1 --extra-iteration 16
- DB12 output
- (16, 244.2583650301175, 15.266147814382343, 14.986596160813637, 16.622340425531913)
- 8.78117316473 9.52743902439 12.0714630613 14.409221902 16.5453342158 16.5672630881 16.5782493369 16.6223404255 16.6223404255 16.6333998669 16.6333998669 16.6444740346 16.6444740346 16.655562958 16.655562958 16.6666666667
DB12 C++ (D. Giordano)
DB12 implementation in C/C++ used to decouple python from the random number generation.
Results
Results are reported in the
attached table.
Done on the same physical server at CERN used for the python study. Adopt containers to run on different OS (SLC6, CC7, Debian 8)
- No major score change across configurations.
- +3% in CC7 (was 9% for the python version)
- +5% in Debian (was 18% in python 2.7.13)
- Ratio DB12 (32 cores)/DB12 (16 cores) is ~1.5 => 50% gain with HT=ON
Specs of the Bare metal server
Architecture: |
x86_64 |
CPU op-mode(s): |
32-bit, 64-bit |
Byte Order: |
Little Endian |
CPU(s): |
32 |
On-line CPU(s) list: |
0-31 |
Thread(s) per core: |
2 |
Core(s) per socket: |
8 |
Socket(s): |
2 |
NUMA node(s): |
2 |
Vendor ID: |
GenuineIntel |
CPU family: |
6 |
Model: |
63 |
Model name: |
Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz |
Stepping: |
2 |
CPU MHz: |
1631.906 |
BogoMIPS: |
4793.86 |
Virtualization: |
VT-x |
L1d cache: |
32K |
L1i cache: |
32K |
L2 cache: |
256K |
L3 cache: |
20480K |
NUMA node0 CPU(s): |
0-7,16-23 |
NUMA node1 CPU(s): |
8-15,24-31 |
DB12np.py A python version based on Numpy (from V. Innocente)
Vincenzo has slightly modified the original DIRACBenchmark.py code to adopt Numpy and profit of the optimized (C based) implementation.
* Repository:
https://github.com/VinInn/pyTools/blob/master/DB12np.py
Comparison of perf profile
A comparison of the perf profile for the three implementations is here reported.
- perf record has been used to record the percentages of calls to the shared objects.
- The data file has been analysed with perf report.
- The percentage of calls (Overhead) to the distinct shared objects has been then computed (summing up on the different symbols)
- e.g. the three entries of mtrand.so are summed together.
# Overhead Command Shared Object Symbol
# ........ ....... ................... ..............................................
#
23.15% python libm-2.17.so [.] __ieee754_log_avx
13.19% python mtrand.so [.] rk_random
12.89% python mtrand.so [.] rk_gauss
5.83% python mtrand.so [.] rk_double
- Results are reported in the attached table.
- It is evident that
- DB12 numpy and C++ versions are dominated by calls to libm-2.17.so where the log function is (__ieee754_log_avx)
- On the contrary the original DIRACBenchmark (DB12 standard in the table) is dominated by calls to libpython2.7 (86% of the time) and this is dominated by PyEval_EvalFrameEx (for half of the time)
Comparison of DB12 flavors on grid nodes at GridKa
Hardware model |
Benchmark copies |
Ratio |
HS06 |
DB12 |
DB12-cpp |
DB12-np |
E5-2630v4 |
20 |
1.0 |
333 |
276 |
338 |
372 |
32 |
1.6 |
390 |
290 |
436 |
445 |
40 |
2.0 |
416 |
289 |
500 |
498 |
E5-2630v3 |
16 |
1.0 |
278 |
241 |
272 |
303 |
24 |
1.5 |
328 |
246 |
335 |
356 |
32 |
2.0 |
352 |
230 |
401 |
392 |
E5-2660v3 |
20 |
1.0 |
374 |
334 |
380 |
409 |
32 |
1.6 |
447 |
330 |
494 |
501 |
40 |
2.0 |
467 |
329 |
567 |
551 |
E5-2665 |
16 |
1.0 |
261 |
155 |
200 |
214 |
24 |
1.5 |
305 |
173 |
290 |
289 |
32 |
2.0 |
322 |
194 |
348 |
334 |
E5630 |
8 |
1.0 |
112 |
67 |
123 |
112 |
12 |
1.5 |
132 |
73 |
146 |
129 |
16 |
2.0 |
|
81 |
165 |
143 |
6168 (2 sockets) |
24 |
1.0 |
193 |
147 |
179 |
206 |
6174 (4 sockets) |
48 |
1.0 |
430 |
347 |
412 |
490 |
6376 (4 sockets) |
32 |
0.5 |
331 |
223 |
302 |
319 |
64 |
1.0 |
499 |
330 |
548 |
537 |
Commands used to run the DB12 benchmarks:
Benchmark |
Command sequence |
Comments |
DB12 |
/usr/sbin/DIRACbenchmark.py -i 10 --extra-iteration $n_copies |
Python script from mjf-db12 package |
DB12-cpp |
DB12.exe -n $n_copies |
DB12-np |
~/benchmarks/dirac/DB12np.py -i 10 --extra-iteration $n_copies |
Note: there was a typos (transposed digits) in the first release of the table (the correct DB12 score of the E5-2630v3 system running 24 benchmark copies in parallel is 246, not 264 (fixed 2017-04-25), and another typo in the HS06 score of the E5-2660v3 running 40 copies (fixed 2017-04-26).
--
DomenicoGiordano - 2017-03-24