...
The default time limit for submitted jobs is seven days and the default memory is about 4200MB (same as the short
compute
partition). However, because all of the compute nodes are available for scheduling, you can request any combination of resources that can be satisfied by any of our compute nodes. So, for example, you could request 800GB of memory–this would result in the job being run on one of our "fat" nodes, since only those nodes have that much memory. Similarly, you could request one or more GPUs, which would cause the job to be scheduled only on a node that had GPUs. As always, the fewer resources you request, the sooner your job is likely to run.
...
Multi-node MPI jobs are a distinct case. There is often a choice between using whole nodes versus allowing SLURM to place them wherever is expedient. Although the latter option is attractive for other reasons, empirically, it seems to increase the chances of preemption. As a conjecture, this might be because some tasks are placed on the popular club partitions (e.g., 'shortcompute'). Because small jobs are submitted to these partitions frequently, there is a higher chance that one will collide with the preempt job. And the loss of even a single task (CPU) usually results in a crash of the entire MPI job.
...