A hybrid job is one that uses both multiple processes and multiple threads within a process. Typically, MPI is used to start the multiple processes, and each process will use a multithreading library like OpenMP to perform computation with multiple threads.
Starting such jobs is very similar to starting an MPI job. See the How-to Submit a MPI Job and How-to Submit an OpenMP Job pages for details.
The main difference is that you will have to tell SLURM that you intend for each task to be bound to multiple CPU cores (rather than just one). If you omit this step, all threads for a process will be bound to the same core, and it's unlikely that you will see any speedup from threading.
To do this, add a parameter like this to your sbatch script
#SBATCH --cpus-per-task=28
In this case, we're choosing 28 cores with the knowledge that our hardware has 28 cores per node. A smaller number could be chosen, but the number cannot be larger than the number of cores available on the hardware.
Note also that if you specify the --ntasks-per-node
parameter, the product of these two parameters cannot be greater than the number of available cores on the hardware.
Extended Example
In this section, we'll show a simple example of a hybrid MPI/OpenMP program. The program prints some useful diagnostic information and performs a trivial CPU-bound computation. It uses the Intel MPI implementation, but it could be adapted to use OpenMPI instead.
Here's the code:
// MPI/OpenMP example with diagnostic output
// example compile command:
// module load intel intel-mpi
// mpiicpc -mt_mpi -qopenmp test-verbose.cpp -o test-verbose
#include <errno.h>
#include <error.h>
#include <iostream>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include "mpi.h"
#include <omp.h>
int main(int argc, char *argv[]) {
int usemem_gb = 0;
if (argc > 1)
usemem_gb = atoi(argv[1]);
if (errno || usemem_gb < 0)
error(1, errno,
"argument (if present) must be integer number of GB to allocate");
if (usemem_gb > 0) {
const size_t usemem_bytes = size_t(usemem_gb) * 1024 * 1024 * 1024;
void *mem = malloc(usemem_bytes);
if (!mem)
error(1, errno, "malloc failed (not enough memory?)");
// writing memory to force it into RSS
memset(mem, 86, usemem_bytes);
// NB: not freeing memory until exit()
}
MPI::Init(argc, argv);
int size = MPI::COMM_WORLD.Get_size();
int rank = MPI::COMM_WORLD.Get_rank();
bool am_master = (rank == 0);
char name[MPI_MAX_PROCESSOR_NAME];
int namelen;
MPI::Get_processor_name(name, namelen);
cpu_set_t cpuset;
if (sched_getaffinity(0, sizeof cpuset, &cpuset))
error(1, errno, "sched_getaffinity failed");
int cpucount = CPU_COUNT(&cpuset);
if (am_master) {
std::cout << "rank " << rank << " of " << size << " running on " << name
<< " alloc " << usemem_gb << "GB" << " cpu count "
<< cpucount << " mask ";
for (int cpu=0; cpu < CPU_SETSIZE; cpu++)
if (CPU_ISSET(cpu, &cpuset))
std::cout << cpu << ",";
std::cout << std::endl;
for (int i=1; i < size; i++) {
MPI::Status stat;
MPI::COMM_WORLD.Recv(&rank, 1, MPI_INT, i, 1, stat);
MPI::COMM_WORLD.Recv(&size, 1, MPI_INT, i, 1, stat);
MPI::COMM_WORLD.Recv(&namelen, 1, MPI_INT, i, 1, stat);
MPI::COMM_WORLD.Recv(name, namelen + 1, MPI_CHAR, i, 1, stat);
std::cout << "rank " << rank << " of " << size << " running on " << name
<< " alloc " << usemem_gb << "GB";
int cpucount_i;
MPI::COMM_WORLD.Recv(&cpucount_i, 1, MPI_INT, i, 1, stat);
std::cout << " cpu count " << cpucount_i << " mask ";
for (int cpu_i=0; cpu_i < cpucount_i; cpu_i++) {
int cpu;
MPI::COMM_WORLD.Recv(&cpu, 1, MPI_INT, i, 1, stat);
std::cout << cpu << ",";
}
std::cout << std::endl;
}
} else {
MPI::COMM_WORLD.Send(&rank, 1, MPI_INT, 0, 1);
MPI::COMM_WORLD.Send(&size, 1, MPI_INT, 0, 1);
MPI::COMM_WORLD.Send(&namelen, 1, MPI_INT, 0, 1);
MPI::COMM_WORLD.Send(name, namelen + 1, MPI_CHAR, 0, 1);
MPI::COMM_WORLD.Send(&cpucount, 1, MPI_INT, 0, 1);
for (int cpu=0; cpu < CPU_SETSIZE; cpu++)
if (CPU_ISSET(cpu, &cpuset))
MPI::COMM_WORLD.Send(&cpu, 1, MPI_INT, 0, 1);
}
// junk computation to spin the CPUs in OpenMP threads
int result=0;
int iam=0, np=1;
#pragma omp parallel default(shared) private(iam, np)
{
np = omp_get_num_threads();
iam = omp_get_thread_num();
int s=iam*rank;
// each rank-thread to do approx equal share a fixed-size workload
for (long int i=0; i<(1024*1024*1000/(np*size)); i++)
for (int j=0; j<10000; j++)
s += j;
result += s;
}
// output the result to keep compiler from optimizing it away
if (rank == 0)
std::cout << "(meaningless) result " << result << std::endl;
// synchronize to wait for all workers to finish
// (maybe use MPI::Barrier instead?)
int value=0;
if (am_master) {
for (int i=1; i < size; i++) {
MPI::Status stat;
MPI::COMM_WORLD.Recv(&value, 1, MPI_INT, i, 1, stat);
}
} else {
MPI::COMM_WORLD.Send(&value, 1, MPI_INT, 0, 1);
}
MPI::Finalize();
if (usemem_gb > 0)
sleep(90); // long enough for SLURM to kill us if RAM usage too high
return 0;
}
The program takes one optional, integer parameter. If specified, each task will allocate that number of gigabytes of RAM. This is useful for testing the effects of SLURM memory allocation.
Here is an example SLURM sbatch
script to invoke the program:
#!/bin/bash
#SBATCH --partition=preempt
#SBATCH --job-name=test-verbose-srun
#SBATCH --output=%x.out
#SBATCH --time=4:00:00
#SBATCH --account=hpcrcf
###SBATCH --nodes=2
###SBATCH --ntasks-per-node=28
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=1G
module purge
module load slurm
module load intel
module load intel-mpi
module load mkl
# print some diagnostic info
echo env for main task
env | sort | egrep -e '^(SLURM|OMP|I_MPI)_'
echo
camask=$(fgrep -ie cpus_allowed: /proc/self/status | awk '{ print $2 }' | tr -d ,)
ca=$(python -c 'print(bin(0x'${camask}').count("1"))')
echo Cpus_allowed count for main task before mpi startup is $ca
echo
# end diagnostic info
# could choose another mpi lib config, but release_mt is the default
# source mpivars.sh {debug,release}{,_mt}
# https://software.intel.com/en-us/mpi-developer-reference-linux-interoperability-with-openmp
# export I_MPI_PIN_DOMAIN=omp
# possibly redundant?
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# https://software.intel.com/en-us/mpi-developer-reference-linux-other-environment-variables
# export I_MPI_DEBUG=5,host,level
\time srun ./test-verbose ${1+"$@"}
Note that only the --ntasks
flag is used to determine the number of MPI worker processes. This means that the individual tasks may be spread across available nodes as SLURM sees fit. This is usually a good approach if the program does little inter-worker communication. And indeed, this example program does very little–most of its computation is performed by each thread without communication.
If the MPI program does a lot of inter-worker communication, it's generally better to specify a layout of tasks on nodes. By uncommenting the --nodes
and --ntasks-per-node
flags (and commenting out the --ntasks
flag), we could specify that two compute nodes are to be used, each with 28 tasks. The advantage of packing lots of tasks onto a minimal number of nodes is that most of the inter-worker communication happens locally on the node, which is faster and more efficient. The disadvantage is that it may take longer for SLURM to schedule such a job, since we're being more demanding about what we require.
With either approach, the SLURM srun
command will ensure that the total number of tasks is communicated to our program. We don't need to pass this in explicitly.
The --cpus-per-task
flag specifies how many CPU cores are assigned to each task. This can make the program run much faster if the tasks are able to make use of multiple CPUs. In this example, we do so using OpenMP. SLURM will constrain each task to use only the CPUs assigned to it (see below). Note that we tell OpenMP how many CPUs are available to each task by setting the $OMP_NUM_THREADS
variable.
This script prints some diagnostic information, but these lines could be removed or commented out.
Sample Timings
Here are some sample times (in seconds) for a workload split across n MPI workers each using c OpenMP threads, all on Skylake nodes. These times are from single runs, so include noticeable noise. For the shorter timings, startup times play a significant role. Broadly speaking, though, we see almost linear speedup as more cores are added. Since the computations in this example program are entirely independent, the particular combination of worker count and thread count makes little difference.
| c=1 | c=2 | c=4 | c=8 | c=16 | c=32 |
---|
n=1 | 4479 | 2350 | 1272 | 642 | 318 | 160 |
n=2 | 2348 | 1272 | 637 | 319 | 160 | 85 |
n=4 | 1183 | 637 | 322 | 160 | 87 | 41 |
n=8 | 639 | 319 | 160 | 81 | 42 | 22 |
n=16 | 309 | 161 | 81 | 56 | 21 | 13 |
n=32 | 162 | 83 | 42 | 24 | 23 | 8 |
Sample Output
For the n=4, c=16 case, here is some sample output:
env for main task
I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/17.11/lib/libpmi.so
I_MPI_ROOT=/packages/intel/17/linux/mpi
SLURM_CHECKPOINT_IMAGE_DIR=/var/slurm/checkpoint
SLURM_CLUSTER_NAME=slurm_cluster
SLURM_CPUS_ON_NODE=32
SLURM_CPUS_PER_TASK=16
SLURM_GTIDS=0
SLURM_JOBID=6112602
SLURM_JOB_ACCOUNT=hpcrcf
SLURM_JOB_CPUS_PER_NODE=32(x2)
SLURM_JOB_GID=2000
SLURM_JOB_ID=6112602
SLURM_JOB_NAME=test-verbose-srun
SLURM_JOB_NODELIST=n[177,180]
SLURM_JOB_NUM_NODES=2
SLURM_JOB_PARTITION=preempt
SLURM_JOB_QOS=normal
SLURM_JOB_UID=1665
SLURM_JOB_USER=mkc
SLURM_LOCALID=0
SLURM_MEM_PER_CPU=1024
SLURM_NNODES=2
SLURM_NODEID=0
SLURM_NODELIST=n[177,180]
SLURM_NODE_ALIASES=(null)
SLURM_NPROCS=4
SLURM_NTASKS=4
SLURM_PRIO_PROCESS=0
SLURM_PROCID=0
SLURM_SUBMIT_DIR=/gpfs/home/mkc/mpi/test
SLURM_SUBMIT_HOST=talapas-ln2
SLURM_TASKS_PER_NODE=2(x2)
SLURM_TASK_PID=349764
SLURM_TOPOLOGY_ADDR=ibcore[1-4].ib8.n177
SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.node
SLURM_WORKING_CLUSTER=slurm_cluster:hpc-hn1:6817:8192
Cpus_allowed count for main task before mpi startup is 32
rank 0 of 4 running on n177 alloc 0GB cpu count 16 mask 6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,
rank 1 of 4 running on n177 alloc 0GB cpu count 16 mask 1,3,5,7,9,11,13,15,17,19,21,23,27,31,33,38,
rank 2 of 4 running on n180 alloc 0GB cpu count 16 mask 1,6,8,10,12,14,16,20,22,24,28,30,32,34,36,38,
rank 3 of 4 running on n180 alloc 0GB cpu count 16 mask 3,5,9,11,15,17,19,21,23,25,27,29,31,35,37,39,
We can see the various SLURM parameters passed in as environment variables.
Also, note that each MPI workers is restricted to its own set of CPUs, with no overlap. (In the main task on n177, before MPI starts, the initial CPU mask has 32 CPUs, but when MPI starts, it allots 16 each to the two workers that end up running on n177.)
Finally, adding a parameter (16) to have our program use more RAM per CPU than we have requested from SLURM, the job will fail with an error like this
slurmstepd: error: Detected 1 oom-kill event(s) in step 6139636.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: n126: task 8: Out Of Memory
slurmstepd: error: Detected 2 oom-kill event(s) in step 6139636.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
0.01user 0.01system 0:06.34elapsed 0%CPU (0avgtext+0avgdata 5116maxresident)k
0inputs+0outputs (0major+1619minor)pagefaults 0swaps
slurmstepd: error: Detected 2 oom-kill event(s) in step 6139636.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
Related articles
Filter by label
There are no items with the selected labels at this time.