Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
minLevel1
maxLevel7

Here is information on memory on Talapas hosts, how to request it with SLURM, etc.

...

Almost certainly different quantities will be added as new hardware is acquired over time. 

To see the actual available memory for hosts in each partition, you can use a command like this

Code Block
sinfo -e -O partition,cpus,memory,nodes

...

Virtual Memory

Currently the compute nodes have no swap space configured.  In almost all scenarios, swapping on compute nodes is unhelpful, or even counterproductive.

...

For every SLURM job you should specify the amount of memory the job needs, with the with the --mem or --mem-per-cpu flags.  If you don't do so, default values are used, but these are often less than ideal.

The --mem flag specifies the amount of RAM requested for each node in the job.  Unless you're an advanced user, this flag is probably more intuitive than the --mem-per-cpu flag.  See the documentation on the sbatch and srun man pages.

For memory, the default is an allocation proportional amount of available RAM from the node.  So, for example, if you request one CPU core on a node that has 28 cores and 104GB of RAM available for job use, the default will be about 3.7GB of memory.  A corollary is that if you request all of the cores for a node, you'll get all of the available RAM by default.

...

SLURM uses several mechanisms to enforce the memory limit specified for a job.  Currently, the main mechanism is a periodic check of memory usage.  About once per minute, the node SLURM daemon will check to see how much RAM (RSS) is in use.  If that amount is over the limit, the job will be sent a SIGTERM (followed by a SIGKILL if necessary), and the job will be put into the CANCELLED state (which can be viewed using the sacct command).  From the user point of view, in this case it will typically appear that the processes in question were killed by a signal.  Depending on how the programs being run handle signals, however, there might be other, less-than-obvious error messages as a result.  If your job is dying unexpectedly, consider the possibility that it is exceeding the amount of memory requested for it.

...

There are only a relatively small number of fat nodes, so in general you can expect to wait longer before your job is scheduled.

Partition default memory settings

View default memory settings for each partition, run

Code Block
/packages/racs/bin/slurm-show-def-memVirtual Memory

To override the defaults, use slurm’s --mem flag