-- MichalHusejko - 18 Jun 2014

Lattice QCD - MPI analysis

Description

Using mpiP profiling library we investigated performance bottlenecks of OpenQCD application. Library allows to show what is percentage of time spent in communication versus time spent in computation. Moreover it shows percentage of time spent in each of MPI calls.

Tests description

Tests were performed using:

  • 128 cores (11 nodes, with 8 cores used on 11th computing node)
  • 192 cores (16 nodes)
  • 216 cores (18 nodes)
  • 384 cores (32 nodes)
  • 432 cores (36 nodes)
We also performed additional tests to investigate scaling of MPI_Bcast operation. They were performed using similar number of computing nodes.

Measurings were performed using OSU Micro-Benchmarks.

We measured latency (i.e. how long does it take to perform MPI_Bcast operation) and bandwidth of MPI_Bcast (what is average number of MB/s transferred by process). In the latter case we performed tests using 1 MPI process per computing node, and 12 MPI processes per computing node.

Results

We observe that with growing number of processes more and more time is spent in communication between processes (the number grows from 47% for 128 processes to 73% for 432 processes).

The main contributor to MPI time are MPI_Wait and MPI_Bcast, with MPI_Wait taking less percentage of MPI Time with growing number of processes and MPI_Bcast taking higher one.

We investingated size of messages sent during MPI_Bcast and found out that most of them are spend during broadcasting small messages. Below is sample report from 432 cores run.

---------------------------------------------------------------------------
@--- Aggregate Collective Time -------------------
---------------------------------------------------------------------------
Call                 MPI Time %             Comm Size             Data Size
Bcast                      10.2        256 -      511          8 -       15
Bcast                      8.79        256 -      511        256 -      511
Bcast                      4.52        256 -      511         32 -       63
Allreduce                  3.65        256 -      511          0 -        7
Bcast                      2.52        256 -      511        128 -      255
Bcast                      2.48        256 -      511         16 -       31
Bcast                      1.22        256 -      511         64 -      127
Bcast                     0.478        256 -      511          0 -        7

---------------------------------------------------------------------------
@--- Aggregate Collective Time (sorted by data size) --------------
---------------------------------------------------------------------------
Call                 MPI Time %             Comm Size             Data Size
Bcast                     0.478        256 -      511          0 -        7
Bcast                      10.2        256 -      511          8 -       15
Bcast                      2.48        256 -      511         16 -       31
Bcast                      4.52        256 -      511         32 -       63
Bcast                      1.22        256 -      511         64 -      127
Bcast                      2.52        256 -      511        128 -      255
Bcast                      8.79        256 -      511        256 -      511
Allreduce                  3.65        256 -      511          0 -        7

After performing MPI_Bcast tests we can conclude that for N MPI processes latency for small MPI_Bcast grows like O(c*(log N)), where c is latency for single network link. This behaviour is expected - internally (in MVAPICH) MPI_Bcast for small messages is implemented using k-nomial tree algorithm. As log N coefficient matches theoretical lower bound for this problem the easiest possible improvement (without changing application code) is to have single link latency as small as possible.

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2016-05-26 - AritzBrosaIartza
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    HardwareLabs/HardwareLabsPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback