Apr 16, 2010

Parallel code performance study

I have a parallel code using LETKF for data assimilation with shallow water equation as the model problem. It runs on Intel blades Nehalem cpu 1.5GHz, 8 cores sharing 4MB L3 cache, and ~24 GB memory, 10G Ether Network.

A test case: MPICH2-1.3, 1024x1024 size problem, 16 processes on 8 nodes, 2 processes per node, binding to core 0, 1.
Some profiling tools are used.
----------------------------------------
1. Timing with MPI_Wtime()
LETKF time at timestep 48 = 53.640148 comm = 0.001367
LETKF time at timestep 98 = 53.694284 comm = 0.002735
LETKF time at timestep 148 = 53.624503 comm = 0.003942
Finished analysis at 481.756978 with 16 processes
Total LETKF iter Time taken = 160.958935 :: Average LETKF Time taken = 53.652978
Total comm Time for LETKF taken = 0.003942 :: Average comm time taken = 0.001314
Total LETKF time taken = 160.962877 :: Average LETKF time taken = 53.654292
Total Model Time taken = 78.915314 :: Average Model Time taken = 0.526102
Total IO Time taken = 230.198382 :: Average IO Time taken = 1.534656
----------------------------------------
2. Timing with gprof.
To do that, compile and link with -g -pg. Run the code; a file named 'gmon.out' is generated; post-process it with 'gprof <my program name> gmon.out'

Flat profile:
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total          
 time   seconds   seconds    calls  Ts/call  Ts/call  name   
 42.72    104.10   104.10          MPIDI_CH3I_Progress
 27.01    169.91    65.81           single_time_step
 18.78    215.66    45.75           MPID_nem_tcp_connpoll
  4.60    226.87    11.21            MPID_nem_vc_terminate
  2.15    232.10     5.23             MPID_nem_network_poll
  2.13    237.29     5.19             MPID_nem_tcp_sm_init
  1.00    239.72     2.43             point_associates
  0.41    240.73     1.01             matrix_sqrt_invert_once
  0.36    241.61     0.88             mat2fvec
  0.20    242.09     0.48             vector_scale
  0.16    242.49     0.40             eigen_decomposition
  0.13    242.80     0.31             matrix_transpose
  0.10    243.04     0.24             matrix_invert
  0.06    243.19     0.15             initial_vector
  0.05    243.31     0.12             letkf_mod
 
 %         the percentage of the total running time of the
time       program used by this function.

cumulative a running sum of the number of seconds accounted
 seconds   for by this function and those listed above it.

 self      the number of seconds accounted for by this
seconds    function alone.  This is the major sort for this
           listing.

calls      the number of times this function was invoked, if
           this function is profiled, else blank.

 self      the average number of milliseconds spent in this
ms/call    function per call, if this function is profiled,
       else blank.

 total     the average number of milliseconds spent in this
ms/call    function and its descendents per call, if this
       function is profiled, else blank.

name       the name of the function.  This is the minor sort
           for this listing. The index shows the location of
       the function in the gprof listing. If the index is
       in parenthesis it shows where it would appear in
       the gprof listing if it were to be printed.
----------------------------------------------
3. MPI performance monitor with IPM
which profile is stored in an xml, and can be converted into html for graphic view
##IPMv0.982#####################################################################
#
# command : unknown (running)
# host    : intel01                        mpi_tasks : 16 on 8 nodes
# start   : 04/16/10/09:02:57              wallclock : 479.626000 sec
# stop    : 04/16/10/09:10:57              %comm     : 62.54
# gbytes  : 0.00000e+00 total              gflop/sec : -3.33593e-02 total
#
#                           [total]         <avg>           min           max
# wallclock                  7672.59       479.537       479.517       479.626
# user                       6970.61       435.663       323.966       452.923
# system                      556.12       34.7575          9.68         47.97
# mpi                        4799.27       299.955       23.8315        348.25
# %comm                                    62.5393       4.96982       72.6102
# gflop/sec               -0.0333593   -0.00208496   -0.00208496   -0.00208496
# gbytes                           0             0             0             0
#                            [time]       [calls]        <%mpi>      <%wall>
# MPI_Barrier                4196.35          2400         87.44        54.68
# MPI_Sendrecv               560.118        821280         11.67         7.30
# MPI_Gather                 42.7768          2400          0.89         0.56
###############################################################################
# region : ipm_noregion        [ntasks] = 16
#
#                           [total]         <avg>           min           max
# entries                         32             2             2             2
# wallclock              5.93667e+07   3.71042e+06        384240    4.8597e+06
# user                       6970.61       435.663        323.97        452.92
# system                     556.131       34.7582        9.6795        47.969
# mpi                        4799.27       299.954        23.831        348.25
# %comm                                 0.00808412    0.00615256     0.0896497
#                            [time]       [calls]        <%mpi>      <%wall>
# MPI_Barrier                4196.35          2400         87.44         0.01
# MPI_Sendrecv               560.118        821280         11.67         0.00
# MPI_Gather                 42.7768          2400          0.89         0.00




------------------------------------------------------------------
4. CPU, memory, process instrument with an unknow lib procmon.a

System load avg. = 1.99
utime + stime= 478.955
involuntary context switches =25741
voluntary context switches=62
CMD                         THCNT   PID   TID %CPU %MEM     TIME    SZ PSR
./main 1022 1022 tests/         1  5995  5995 99.3  2.8 00:07:57 1087732 0
*** READING /proc/5995/statm ***
total_program_size, memorykb, shared_pages, code_pages, data_pages, lib_pages, dirty_pages
284696 253338 1054 306 0 271934 0
-----------------------------------------------------------
5. Cache Instrumentation with PinTool
Not sure about if this works well for parallel code.

ITLB:
   Load Hits:         3017757
   Load Misses:          1699
   Load Accesses:     3019456
   Load Miss Rate:      0.06%

  Store Hits:               0
  Store Misses:             0
  Store Accesses:           0
  Store Miss Rate:       nan%

  Total Hits:         3017757
  Total Misses:          1699
  Total Accesses:     3019456
  Total Miss Rate:      0.06%
  Flushes:                  0
  Stat Resets:              0

DTLB:
   Load Hits:         1213480
   Load Misses:          1020
   Load Accesses:     1214500
   Load Miss Rate:      0.08%

  Store Hits:               0
  Store Misses:             0
  Store Accesses:           0
  Store Miss Rate:       nan%

  Total Hits:         1213480
  Total Misses:          1020
  Total Accesses:     1214500
  Total Miss Rate:      0.08%
  Flushes:                  0
  Stat Resets:              0

L1 Instruction Cache:
   Load Hits:         3016101
   Load Misses:          3355
   Load Accesses:     3019456
   Load Miss Rate:      0.11%

  Store Hits:               0
  Store Misses:             0
  Store Accesses:           0
  Store Miss Rate:       nan%

  Total Hits:         3016101
  Total Misses:          3355
  Total Accesses:     3019456
  Total Miss Rate:      0.11%
  Flushes:                  0
  Stat Resets:              0

L1 Data Cache:
   Load Hits:          692391
   Load Misses:          9721
   Load Accesses:      702112
   Load Miss Rate:      1.38%

  Store Hits:          482670
  Store Misses:         29718
  Store Accesses:      512388
  Store Miss Rate:      5.80%

  Total Hits:         1175061
  Total Misses:         39439
  Total Accesses:     1214500
  Total Miss Rate:      3.25%
  Flushes:                  0
  Stat Resets:              0

L2 Unified Cache:
   Load Hits:            9292
   Load Misses:          3784
   Load Accesses:       13076
   Load Miss Rate:     28.94%

  Store Hits:           28007
  Store Misses:          1711
  Store Accesses:       29718
  Store Miss Rate:      5.76%

  Total Hits:           37299
  Total Misses:          5495
  Total Accesses:       42794
  Total Miss Rate:     12.84%
  Flushes:                  0
  Stat Resets:              0

L3 Unified Cache:
   Load Hits:             315
   Load Misses:          3469
   Load Accesses:        3784
   Load Miss Rate:     91.68%

  Store Hits:               0
  Store Misses:          1711
  Store Accesses:        1711
  Store Miss Rate:    100.00%

  Total Hits:             315
  Total Misses:          5180
  Total Accesses:        5495
  Total Miss Rate:     94.27%
  Flushes:                  0
  Stat Resets:              0