Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
stylenone

General Principles

Jobs that run on multiple nodes generally use a parallel programming API called MPI (Message Passing Interface), which allows processes on multiple nodes to communicate with high throughput and low latency (especially over Talapas' InfiniBand network).  MPI is a standard and has multiple implementations—several are available on Talapas, notably Intel MPI and Open MPI MPICH. 

The SLURM Slurm scheduler has built-in support for MPI jobs.  Jobs  Jobs can be run in a generic way, or if needed, you can use extra parameters to carefully control how MPI processes are mapped to the hardware.

...

Specifying Node and Task Counts

...

You can simply specify the number of tasks and let Slurm place them on available nodes as it sees fit, for example:

Code Block
languagetext
#SBATCH --partition=compute
#SBATCH --ntasks=500
#SBATCH --ntasks-per-core=1
#SBATCH --mem-per-cpu=500m
#SBATCH --constraint=7713

With this approach the job will probably be scheduled sooner since Slurm is free to use any available cores. It’s recommended to keep the job tied to cores of the same type through use of Slurm’s --constraint flag.

Another method is to specify how many the number nodes you want the job to run on and how many tasks should be to run on each node, see the example below. Also see sbatch script parameters for more information.

Code Block
languagetext
#SBATCH --partition=compute_intel
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=28
#SBATCH --mem-per-cpu=500m
#SBATCH --constraint=intel,e5-2690

This will result in an MPI job with 84 processes , (28 processes on each of the 3 nodes).  In  In this case, we've specified 28 tasks per node with the knowledge that our constraint of ‘intel,e5-2690’ will allocate only nodes which have 28 CPU cores to the job. 

See sinfo -o "%10R %8D %25N %8c %10m %40f %35G"for a complete list of nodes properties and features relative to the partitions.

Although it's not necessary to use all cores on a node, this is often efficient, since more communication between processes happens on the same node.  That said, if the processes need more RAM than the default, you might need to run fewer tasks per node and specify a larger amount of memory per task.

Alternatively, you can simply specify the number of tasks and let SLURM place them on available nodes as it sees fit, for example:

Code Block
languagetext
#SBATCH --partition=compute
#SBATCH --ntasks=84
#SBATCH --constraint=amd,milan

The primary advantage of this approach is that the job will probably be scheduled sooner, since SLURM is free to use any available cores, rather than having to arrange for nodes with sufficient free cores to become available.  It’s recommended to keep the job tied to cores of the same type (though not required) through use of the --constraint flag.

Depending on the I/O properties of the job and number of nodes allocated, the job might run more slowly in this configuration, and runtime will vary a bit depending on exactly how the slots are spread across nodes.  If it works for your job, though, this could be a huge win, in terms of getting your job started sooner.

Whichever method you use, also consider the effect of job size on your wait time.  In particular, the more CPU cores you ask for, the longer you are likely to wait for your job to start.  For some jobs, there is a minimum CPU core count (because of the requirements of the software).  For others, the core count might be relatively arbitrary.  Usually adding more cores would be expected to make the job run more quickly.  Using fewer cores might lead to earlier job completion, though, if it results in your job starting significantly sooner.

...

their partitions.

Also see /packages/racs/bin/slurm-show-features

Memory

For single-node jobs , it's common to use the SLURM Slurm --mem flag to specify the entire amount of memory to allocate to the job will be allocated

For multi-node jobs , though you will probably find it more intuitive and predictable to specify the amount of memory available to each individual task, like so

...

languagetext

...

use the Slurm --mem-per-cpu

...

This is strictly only needed if the job will require more than the default amount of RAM, but it's always a good idea.

Specifying SLURM Invocation

SLURM flag to specify the amount of memory available to allocate to each task.

Slurm Invocation

Slurm provides two slightly different ways to invoke your MPI program. 

  1. The preferred way is to invoke it directly with the srun command within your sbatch script. 

  2. The An alternative is to invoke it using the mpirun/mpiexec program within your sbatch script. 

See the SLURM Slurm MPI guide for more information.

MPI compilers

Intel

...

To access the Intel OneAPI MPI compilers, such as, mpicc or mpiifort:

Code Block
languagetext
module load intel-oneapi-compilers/2023.1.0
module load intel-oneapi-mpi/2021.9.0
mpiccmpiicc helloworld_mpi.c -o helloworld_mpi.x

Next, create a batch script. For example, to use the recommended srun approachBatch script example useingsrun:

Code Block
languagebash
#!/bin/bash
#SBATCH --account=racs
#SBATCH --partition=compute
#SBATCH --job-name=intel-mpi
#SBATCH --output=intel-mpi.out
#SBATCH --error=intel-mpi.err
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=28
#SBATCH --ntasks-per-core=1
module load intel-oneapi-compilers/2023.1.0
module load intel-oneapi-mpi/2021.9.0
srun ./helloworld_mpi.x

...

MPICH

GNU + MPICH

To access Open MPICH MPI compilers,

Code Block
module load openmpigcc/413.1.1 or0
module load openmpimpich/4.1.5

...

1
mpicc helloworld_mpi.c -o helloworld_mpich.x
Code Block
#!/bin/bash
#SBATCH --account=racs
#SBATCH --partition=compute,computelong
#SBATCH --job-name=mpich-mpi-test
#SBATCH --output=mpich-mpi-test.out
#SBATCH --error=mpich-mpi-test.err
#SBATCH --ntasks=200
#SBATCH --ntasks-per-core=1
#SBATCH --mem-per-cpu=500m
#SBATCH --constraint=7713
module load gcc/13.1.0 
module load mpich/4.1.1
srun ./helloworld_mpich.x

Pitfalls

Choosing parameters for MPI job submission can unfortunately be rather complicated.  One pitfall you may encounter is accidentally failing to make use of all requested CPU cores, leading to needlessly long job times and wasted resources.  To verify that all is well, check that you are getting significant speedups with increasing process count.  If your jobs don't run faster when you add cores, something is probably wrong.  You can also log into the compute nodes while your job is running to observe the processes and check that compute-bound processes are using 100% CPU; the htop command is useful here.

...

If you have concerns, please reach out to us—we'd be happy to check for problems and make recommendations.

Also see sbatch script parameters for more information.

Filter by label (Content by label)
showLabelsfalse
max5
spacesTCP
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel in ( "slurm" , "submitting" , "mpi" , "jobs" , "kb-how-to-article" ) and type = "page" and space = "TCP"
labelsslurm jobs submitting mpi

...