Feb 7, 2010

mpdtrace and using multiple nodes to run mpi

In a cluster, which the user may need to launch the mpd manually, here are descriptions of to-dos. The situation that one may consider to do so, is when you launch a executable program on multiple nodes, however, there is only the local node is used. Or you see some error like MPI connection or communication error. It is the time to check if all the nodes listed in the hosts file are able to communicate with each other. Error like: mpiexec: unable to start all procs; may have invalid machine names remaining specified hosts.

Following description works for MPICH2, using mpiexec ot mpirun. That is how I tested. 

On the node where you launch the program, type
>> mpdtrace -l
It gives you  <node name>_<port>(IP)
blade50_51094 (IP **)
blade49_56382 (IP **)
blade47_35763 (IP **)
blade48_49526 (IP **)
blade51_53029 (IP **) 

If not all nodes are listed here, for example, if blade46 is in the hosts file, and is available. ssh to blade46, type
>>mpd -h blade50 -p 51094 &
If you want to start more mpd
>>mpd -h blade50 -p 51094 -n &
Then the blade46 can be used.

To clean up mpd daemon, use mpdcleanup

Besides, if you want to launch m consecutive ranks on the same node, use mpd --ncpus=m
For example:
mpd --ncpus=2 &
or
mpd --ncpus=2 -h blade50 -p 51094 &


QUOTE"
If an mpd is started with the --ncpus option, then when it is its turn to start a process, it will start several application processes rather than just one before handing off the task of starting more processes to the next mpd in the ring. For example, if the mpd is started with
mpd --ncpus=4
then it will start as many as four application processes, with consecutive ranks, when it is its turn to start processes. This option is for use in clusters of SMP's, when the user would like consecutive ranks to appear on the same machine. (In the default case, the same number of processes might well run on the machine, but their ranks would be different.) (A feature of the --ncpus=[n] argument is that it has the above effect only until all of the mpd's have started n processes at a time once; afterwards each mpd starts one process at a time. This is in order to balance the number of processes per machine to the extent possible.)
"END of QUOTE

1 comment:

Anonymous said...

I got the same error, and your solution worked. Thanks!