Jobs that run on multiple nodes generally use a parallel programming API called MPI (Message Passing Interface), which allows processes on multiple nodes to communicate with high throughput and low latency (especially over Talapas' InfiniBand network). MPI is a standard and has multiple implementations—several are available on Talapas, notably Open MPI and Intel MPI. The choice between these is largely a matter of personal taste and the specific needs of the situation.

The SLURM scheduler has built-in support for MPI jobs. Jobs can be run in a generic way, or if needed, you can use extra parameters to carefully control how MPI processes are mapped to the hardware.

General Principles

Most parts of job setup are the same for all MPI flavors. Notably, you'll want to decide how many simultaneous tasks (processes) you want to run your job across.

Specifying Node and Task Counts

The most common method is to determine how many nodes you want the job to run on and how many tasks should be run on each node. Specify the sbatch script parameters like so

#SBATCH --nodes=3
#SBATCH --ntasks-per-node=28

This will result in an MPI job with 84 processes, with 28 processes on each of 3 nodes. In this case, we've chosen 28 tasks per node with the knowledge that each node has 28 CPU cores. Although it's not necessary to use all cores on a node, this is often efficient, since more communication between processes happens on the same node. That said, if the processes need more RAM than the default, you might need to run fewer tasks per node and specify a larger amount of memory per task.

Alternatively, you can simply specify the number of tasks and let SLURM place them on available nodes as it sees fit, like so

#SBATCH --ntasks=84

The primary advantage of this approach is that the job will probably be scheduled sooner, since SLURM is free to use any available cores, rather than having to arrange for nodes with sufficient free cores to become available. Depending on the I/O properties of the job, the job might run more slowly in this configuration, and runtime will vary a bit depending on exactly how the slots are spread across nodes. If it works for your job, though, this could be a huge win, in terms of getting your job started sooner.

Whichever method you use, also consider the effect of job size on your wait time. In particular, the more CPU cores you ask for, the longer you are likely to wait for your job to start. For some jobs, there is a minimum CPU core count (because of the requirements of the software). For others, the core count might be relatively arbitrary. Usually adding more cores would be expected to make the job run more quickly. Using fewer cores might lead to earlier job completion, though, if it results in your job starting significantly sooner.

Specifying Memory

For single-node jobs, it's common to use the SLURM --mem flag to specify the entire amount of memory the job will be allocated. For multi-node jobs, though you will probably find it more intuitive and predictable to specify the amount of memory available to each individual task, like so

#SBATCH --mem-per-cpu=12G

This is strictly only needed if the job will require more than the default amount of RAM, but it's always a good idea.

Specifying SLURM Invocation

SLURM provides two slightly different ways to invoke your MPI program. The preferred way is to invoke it directly with the srun command within your sbatch script. This provides a few minor additional features and is arguably a bit simpler.

The alternative is to invoke it using the mpirun program within your sbatch script. This is a bit more generic, but might work better in some cases. Note that you will probably need to include the SLURM_CPU_BIND setting below to prevent multiple tasks from being bound to a single CPU core, which would make your program run much more slowly.

See the SLURM MPI guide for more information.

Open MPI Examples

For Open MPI, you can use either the Intel or GNU compilers. In this example, we'll assume the former. To set up for job submission, load these modules

module load intel/17
module load openmpi
module load mkl

Then create a batch script like this

#!/bin/bash
#SBATCH --partition=short    ### Partition
#SBATCH --job-name=HelloMPI  ### Job Name
#SBATCH --time=00:10:00      ### WallTime (10 minutes)
#SBATCH --nodes=3            ### Number of Nodes
#SBATCH --ntasks-per-node=28 ### Number of tasks (MPI processes)
#SBATCH --account=hpcrcf     ### Account used for job submission

mpirun -np $SLURM_NTASKS ./hello_mpi

Note carefully that starting an Open MPI job directly with srun is not currently supported. Doing so might not produce an obvious error, but in some cases will simultaneously start many independent single-process jobs instead of a single MPI job with all processes working together. At best this will be very slow, and at worst the output may be incorrect.

Intel MPI Examples

In order to run an Intel MPI job, you will first need an executable that's been compiled with an Intel MPI compiler, like mpiicc or mpiifort. (Alternatively, if you have a Java or Python program, you'll need to set up the proper usage of the Intel MPI library interfaces.) In order to access the compilers and also to set up to submit a job, load these modules

module load intel/17
module load intel-mpi
module load mkl

Then create a batch script. To use the recommended (srun) approach, use a script like this

#!/bin/bash
#SBATCH --partition=long
#SBATCH --job-name=HelloMPI
#SBATCH --time=00:10:00
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=28
#SBATCH --account=hpcrcf

srun ./hello_mpi

Alternatively, to use the mpirun approach, use a script like this. Be aware that we've seen problems with this, so the above approach is what you should use, if at all possible.

#!/bin/bash
#SBATCH --partition=long     ### Partition
#SBATCH --job-name=HelloMPI  ### Job Name
#SBATCH --time=00:10:00      ### WallTime
#SBATCH --nodes=3            ### Number of Nodes
#SBATCH --ntasks-per-node=28 ### Number of tasks (MPI processes)
#SBATCH --account=hpcrcf     ### Account used for job submission

# failure to unset this will likely result in a hung job
unset I_MPI_PMI_LIBRARY

# avoid binding all tasks to single CPU core
export SLURM_CPU_BIND=none

# -IB to ensure comms use InfiniBand
mpirun -IB -np $SLURM_NTASKS ./hello_mpi

Pitfalls

Choosing parameters for MPI job submission can unfortunately be rather complicated. One pitfall you may encounter is accidentally failing to make use of all requested CPU cores, leading to needlessly long job times and wasted resources. To verify that all is well, check that you are getting significant speedups with increasing process count. If your jobs don't run faster when you add cores, something is probably wrong. You can also log into the compute nodes while your job is running to observe the processes (e.g., that compute-bound processes are using 100% CPU).

One combination that we've seen work quite poorly is specifying --nodes and --ntasks (with no --ntasks-per-node). This seems to sometimes lead to the above wasted-resource problem.

If you have concerns, please reach out to us—we'd be happy to check for problems and make recommendations.

Talapas Knowledge Base (INVALID - END OF LIFE)

How-to Submit a MPI Job