Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

If you are new to cluster computing, you You may be wondering why you can't simply run your applications in the same way that you do on your laptop or desktop. First, as a shared resource we need a method of coordinating the work of many researchers simultaneously so that users aren't stepping on each others' toes.

Second, when you first log into Talapas you will be connected to one of our "login nodes". These nodes are essentially a lobby in which users can do file management, write scripts, and submit jobs. They

Info

login nodes are not an appropriate place to run applications or conduct simulations.

A good rule of thumb is: If it takes more than one second to complete, it's not appropriate for a login node.

...

Instead Instead, these tasks should be conducted on a "compute node." These are purpose-built for running intensive computations and can only be accessed via the SLURM job scheduler. SLURM will ensure that the compute nodes are allocated in a fair and equitable manner that prevents resource conflicts. The primary method by which you will run simulations on Talapas will be to "submit a job."

Step-by-step guide

Create a job script. A job script is a description of the computational resources your job requires and the executables you wish to run. Lets look at a "hello world" example of a job script:

hello.srun

...

languagebash

Components of a job Slurm job script

The Shell

Code Block
#!/bin/bash

...

Slurm resource requests

See Slurm sbatch documentation for more information.

Code Block
#SBATCH --

...

account=<myPIRG>
#SBATCH --job-name=HiWorld

...


...

#SBATCH --output=Hi.out

...

#SBATCH --error=Hi.err

...

#SBATCH --partition=computelong
#SBATCH --time=0-00:01:00

...

#SBATCH --ntasks=1
#SBATCH --

...

cpus-per-task=1
#SBATCH --mem-per-cpu=500M

Program dependencies

Load all software your job requires, for example python

Code Block
module load python3/3.11.4

Call the program or job steps

Code Block
./a.out

Slurm job script

Example of a job script:

hello.sbatch

  1. Code Block
    languagebash
    #!/bin/bash
    
    #SBATCH --account=<myPIRG>
    #SBATCH --job-name=hiworld
    #SBATCH --output=hi.out
    #SBATCH --error=hi.err
    #SBATCH ### Number of nodes needed for the job
    --partition=compute
    #SBATCH --time=0-00:01:00
    #SBATCH --ntasks=1
    #SBATCH --ntaskscpus-per-nodetask=1 
       ### Number of tasks to be launched per Node
    #SBATCH --mem-per-accountcpu=<myPIRG>500M
    
    module    ### Account used for job submissionload python3/3.11.4
    
    ./a.out							# run your actual program

    Above we see the contents of our SLURM script (aka job script) called hello.srun.  sbatch(The script name and suffix is arbitrary–use whatever name you like).)  Notice that the script begins with #!/bin/bash. This line tells Linux which shell interpreter to use when executing the script.  Here we  We used bash (the Bourne Again Shell) and it's by far the most common choice, but other interpreters could be used (e.g., tcsh, python, etc.).  Whatever your choice, every script should begin with interpreter directive.

    Next, we see a collection of specially formatted comments, each beginning with #SBATCH followed by option definitions.  These are used by the sbatch command to set job options.  (As comments, they are ignored by bash.)  This  This allows us to describe our job to the scheduler and ensure that we reserve the appropriate resources (cores, memory, etc.time) for an appropriate amount of time.While the specified --time needs to be long enough for the job to complete (lest it be killed when time runs out), it's also good not to needlessly overestimate the amount of time required in the provided --time specification.  Shorter jobs are more likely to run sooner, as they can fill in between longer jobs that aren't yet runnable.
    Note that the script suffix we used is unimportant. You can name your job scripts whatever you wish.Submit your job to

Submit your job

o the scheduler using the sbatch command.

Code Block
languagetext

...

$ sbatch hello.

...

sbatch

...

Submitted batch job 20190

...

The job has been submitted and is assigned the job number 20190 which will serve as its primary identifier.

Check

...

job

...

status

Use the squeue command.

Code Block
languagetext

...

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20190 computelo  HiWorld   duckID CG       1:09      1 n074

...


...

 

...

    

...

 

...

       

...

...

Here we see that our job, number 20190, is in the CG (completing state).

...

Jobs in the system

...

may be R (running) or PD (pending

...

).

Jobs are pending when there are insufficient resources available to accommodate the request as specified in the job script.

...

To view only your jobs

...

:

Code Block
$ squeue -u duckid

If necessary, cancel your job using the scancel command followed by the job number of the job you wish to cancel.

  1. Code Block
    languagetext
    [duckID@login1 helloworld]$ scancel 20190
    [duckIDo@login1 helloworld]$ squeue -u cmaggio
                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    [duckID@login1 helloworld]$

...

Filter by label (Content by label)
showLabelsfalse
max5
spacesTCP
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel in ( "slurm" , "submitting" , "jobs" ) and type = "page" and space = "TCP"
labelsslurm jobs submitting

...