A test case: MPICH2-1.3, 1024x1024 size problem, 16 processes on 8 nodes, 2 processes per node, binding to core 0, 1.
Some profiling tools are used.
----------------------------------------
1. Timing with MPI_Wtime()
LETKF time at timestep 48 = 53.640148 comm = 0.001367LETKF time at timestep 98 = 53.694284 comm = 0.002735
LETKF time at timestep 148 = 53.624503 comm = 0.003942
Finished analysis at 481.756978 with 16 processes
Total LETKF iter Time taken = 160.958935 :: Average LETKF Time taken = 53.652978
Total comm Time for LETKF taken = 0.003942 :: Average comm time taken = 0.001314
Total LETKF time taken = 160.962877 :: Average LETKF time taken = 53.654292
Total Model Time taken = 78.915314 :: Average Model Time taken = 0.526102
Total IO Time taken = 230.198382 :: Average IO Time taken = 1.534656
----------------------------------------
2. Timing with gprof.
To do that, compile and link with -g -pg. Run the code; a file named 'gmon.out' is generated; post-process it with 'gprof <my program name> gmon.out'Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
42.72 104.10 104.10 MPIDI_CH3I_Progress
27.01 169.91 65.81 single_time_step
18.78 215.66 45.75 MPID_nem_tcp_connpoll
4.60 226.87 11.21 MPID_nem_vc_terminate
2.15 232.10 5.23 MPID_nem_network_poll
2.13 237.29 5.19 MPID_nem_tcp_sm_init
1.00 239.72 2.43 point_associates
0.41 240.73 1.01 matrix_sqrt_invert_once
0.36 241.61 0.88 mat2fvec
0.20 242.09 0.48 vector_scale
0.16 242.49 0.40 eigen_decomposition
0.13 242.80 0.31 matrix_transpose
0.10 243.04 0.24 matrix_invert
0.06 243.19 0.15 initial_vector
0.05 243.31 0.12 letkf_mod
% the percentage of the total running time of the
time program used by this function.
cumulative a running sum of the number of seconds accounted
seconds for by this function and those listed above it.
self the number of seconds accounted for by this
seconds function alone. This is the major sort for this
listing.
calls the number of times this function was invoked, if
this function is profiled, else blank.
self the average number of milliseconds spent in this
ms/call function per call, if this function is profiled,
else blank.
total the average number of milliseconds spent in this
ms/call function and its descendents per call, if this
function is profiled, else blank.
name the name of the function. This is the minor sort
for this listing. The index shows the location of
the function in the gprof listing. If the index is
in parenthesis it shows where it would appear in
the gprof listing if it were to be printed.
----------------------------------------------
3. MPI performance monitor with IPM
which profile is stored in an xml, and can be converted into html for graphic view##IPMv0.982#####################################################################
#
# command : unknown (running)
# host : intel01 mpi_tasks : 16 on 8 nodes
# start : 04/16/10/09:02:57 wallclock : 479.626000 sec
# stop : 04/16/10/09:10:57 %comm : 62.54
# gbytes : 0.00000e+00 total gflop/sec : -3.33593e-02 total
#
# [total] <avg> min max
# wallclock 7672.59 479.537 479.517 479.626
# user 6970.61 435.663 323.966 452.923
# system 556.12 34.7575 9.68 47.97
# mpi 4799.27 299.955 23.8315 348.25
# %comm 62.5393 4.96982 72.6102
# gflop/sec -0.0333593 -0.00208496 -0.00208496 -0.00208496
# gbytes 0 0 0 0
# [time] [calls] <%mpi> <%wall>
# MPI_Barrier 4196.35 2400 87.44 54.68
# MPI_Sendrecv 560.118 821280 11.67 7.30
# MPI_Gather 42.7768 2400 0.89 0.56
###############################################################################
# region : ipm_noregion [ntasks] = 16
#
# [total] <avg> min max
# entries 32 2 2 2
# wallclock 5.93667e+07 3.71042e+06 384240 4.8597e+06
# user 6970.61 435.663 323.97 452.92
# system 556.131 34.7582 9.6795 47.969
# mpi 4799.27 299.954 23.831 348.25
# %comm 0.00808412 0.00615256 0.0896497
# [time] [calls] <%mpi> <%wall>
# MPI_Barrier 4196.35 2400 87.44 0.01
# MPI_Sendrecv 560.118 821280 11.67 0.00
# MPI_Gather 42.7768 2400 0.89 0.00
------------------------------------------------------------------
4. CPU, memory, process instrument with an unknow lib procmon.a
System load avg. = 1.99
utime + stime= 478.955
involuntary context switches =25741
voluntary context switches=62
CMD THCNT PID TID %CPU %MEM TIME SZ PSR
./main 1022 1022 tests/ 1 5995 5995 99.3 2.8 00:07:57 1087732 0
*** READING /proc/5995/statm ***
total_program_size, memorykb, shared_pages, code_pages, data_pages, lib_pages, dirty_pages
284696 253338 1054 306 0 271934 0
-----------------------------------------------------------
5. Cache Instrumentation with PinTool
Not sure about if this works well for parallel code.ITLB:
Load Hits: 3017757
Load Misses: 1699
Load Accesses: 3019456
Load Miss Rate: 0.06%
Store Hits: 0
Store Misses: 0
Store Accesses: 0
Store Miss Rate: nan%
Total Hits: 3017757
Total Misses: 1699
Total Accesses: 3019456
Total Miss Rate: 0.06%
Flushes: 0
Stat Resets: 0
DTLB:
Load Hits: 1213480
Load Misses: 1020
Load Accesses: 1214500
Load Miss Rate: 0.08%
Store Hits: 0
Store Misses: 0
Store Accesses: 0
Store Miss Rate: nan%
Total Hits: 1213480
Total Misses: 1020
Total Accesses: 1214500
Total Miss Rate: 0.08%
Flushes: 0
Stat Resets: 0
L1 Instruction Cache:
Load Hits: 3016101
Load Misses: 3355
Load Accesses: 3019456
Load Miss Rate: 0.11%
Store Hits: 0
Store Misses: 0
Store Accesses: 0
Store Miss Rate: nan%
Total Hits: 3016101
Total Misses: 3355
Total Accesses: 3019456
Total Miss Rate: 0.11%
Flushes: 0
Stat Resets: 0
L1 Data Cache:
Load Hits: 692391
Load Misses: 9721
Load Accesses: 702112
Load Miss Rate: 1.38%
Store Hits: 482670
Store Misses: 29718
Store Accesses: 512388
Store Miss Rate: 5.80%
Total Hits: 1175061
Total Misses: 39439
Total Accesses: 1214500
Total Miss Rate: 3.25%
Flushes: 0
Stat Resets: 0
L2 Unified Cache:
Load Hits: 9292
Load Misses: 3784
Load Accesses: 13076
Load Miss Rate: 28.94%
Store Hits: 28007
Store Misses: 1711
Store Accesses: 29718
Store Miss Rate: 5.76%
Total Hits: 37299
Total Misses: 5495
Total Accesses: 42794
Total Miss Rate: 12.84%
Flushes: 0
Stat Resets: 0
L3 Unified Cache:
Load Hits: 315
Load Misses: 3469
Load Accesses: 3784
Load Miss Rate: 91.68%
Store Hits: 0
Store Misses: 1711
Store Accesses: 1711
Store Miss Rate: 100.00%
Total Hits: 315
Total Misses: 5180
Total Accesses: 5495
Total Miss Rate: 94.27%
Flushes: 0
Stat Resets: 0