Running MPI jobs on Nimbus

Running

Nimbus comes pre-loaded with a slightly modified version of MPICH that knows about beomap and bproc. These modifications allow MPI applications to run conveniently on a Nimbus cluster without a need for various wrapper applications (such as mpirun)

The number of processes spawned, and where they are run is controlled through environment variables that are read by the beomap system. In brief, they are:

NP

NP is the simplest way to run an MPI application. Beomap will select NP available CPUs in the system, and dispatch child processes to those nodes. The root process will be run on the master unless otherwise specified in other variables..
ALL_CPUS=1

This will utilize all available CPUS in the cluster.
NO_LOCAL=1

Setting NOLOCAL to any value will modify the behavior of NP sop that it does not dispatch the root process to the master node. This is a useful option for a busy system where more than one or two jobs may run on a subset of the CPUs. Otherwise, root processes on the Master will become a bottleneck as they contend for limited time on the CPUS. If the application do no real work on in the root process, but only coordinates the other processes, this option may be unnecessary.
BEOWULF_JOB_MAP

BEOWULF_JOB_MAP may be used INSTEAD of NP in order to have fine grained control over which CPUS are used, and which rank runs on what CPU. BEOWULF_JOB_MAP should be set to a colon (:) separated list of node numbers to run the job on. Ranks will be assigned in the order listed. Thus, BEOWULF_JOB_MAP="-1:0:0:1:2" will run the root rank on the master, ranks 1 and 2 on node 0, rank 3 on node 1 and rank 4 on node 2. NOTE that unless otherwise specified, the job will run immediately, regardless of how busy those CPUs are.
NO_OVER=1

NO_OVER specifies that the job should not overschedule. That is, only idle CPUS will be scheduled. If it is not possible to meet the requirements specified by NP or BEOWULF_JOB_MAP, the job will not be run. This variable is normally used by a modified version of the at daemon to hold a job in queue until it can be scheduled as requested.

Should something go wrong and you need to kill the job, all of the processes will be seen with the ps command on the master. You can then use the kill command as if the jobs are local. This is in sharp contrast to 1st generation Beowulf where you would need to rsh around the cluster killing off runaway processes.

Compiling and linking

You may use the mpicc wrapper just as the unmodified MPICH normally does. Otherwise, all that is necessary is to link your application with bproc and mpi using -lbproc -lmpi in your command line. That's actually all that mpicc does for you anyway.

Programming

For maximum compatibility, the bproc modified MPICH is meant to meet the normal assumptions of an MPI program. It also offers a few features that can be quite useful, though caution is advised to avoid breaking compatibility with old school systems.

The primary difference is in the exact behavior of MPI_Init. On 1st generation Beowulf systems, and many others, some sort of wrapper (mpirun for example) is responsible for using rsh to run a copy of the program on all desired nodes. The command line arguments are pre-pended with MPI arguments that assign rank, and specify how the child processes should connect back to the root rank.

On a Nimbus system, the program simply starts executing on the master as any non-MPI program would. Inside MPI_Init, a call is made to beomap to get a list of available CPUs which will meet the requirements of the environment variables documented above. Then, a series of bproc_rfork calls are mad to fork child processes and migrate them to their intended CPU. MPI_Init will return in each process of the job.

As long as MPI_Init is the first thing done in the program (as it should be), there is no difference. If other initialization is done first, there may be subtle differences in behavior. A few of the differences may be fatal. Foremost, files must NOT be opened before MPI_Init. the act of migration causes all open file handles other than stdin, stdout, and stderr to become invalid. Other cases where subtle behavior changes may be noted include fetching a random seed value from /dev/random. On a bproc system, that fetch will only happen on the root rank. other ranks will get a copy of the seed value instead. Don't do this unless you would then bcast the value anyway.

It may be tempting to do some setup steps like the above before calling MPI_Init as an elegant way to skip a bcast or 12. While it IS elegant, it is not portable, thus should probably be avoided.

To emphasize: MAKE MPI_Init the first thing your program does unless you have thought very carefully about unintended consequences.

Note that the semantics of MPI_Init in Nimbus meet standards requirements. The differences are in the undefined gray areas.