Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

Here is information on memory on Talapas hosts, how to request it with SLURM, etc.

...

For every SLURM job you should specify the amount of memory the job needs, with the with the --mem or --mem-per-cpu flags.  If you don't do so, default values are used, but these are often less than ideal.

The --mem flag specifies the amount of RAM requested for each node in the job.  Unless you're an advanced user, this flag is probably more intuitive than the --mem-per-cpu flag.  See the documentation on the sbatch and srun man pages.

For memory, the default is an allocation proportional amount of available RAM from the node.  So, for example, if you request one CPU core on a node that has 28 cores and 104GB of RAM available for job use, the default will be about 3.7GB of memory.  A corollary is that if you request all of the cores for a node, you'll get all of the available RAM by default.

...

SLURM uses several mechanisms to enforce the memory limit specified for a job.  Currently, the main mechanism is a periodic check of memory usage.  About once per minute, the node SLURM daemon will check to see how much RAM (RSS) is in use.  If that amount is over the limit, the job will be sent a SIGTERM (followed by a SIGKILL if necessary), and the job will be put into the CANCELLED state (which can be viewed using the sacct command).  From the user point of view, in this case it will typically appear that the processes in question were killed by a signal.  Depending on how the programs being run handle signals, however, there might be other, less-than-obvious error messages as a result.  If your job is dying unexpectedly, consider the possibility that it is exceeding the amount of memory requested for it.

...