Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

If you are new to cluster computing, you may be wondering why you can't simply run your applications in the same way that you do on your laptop or desktop. First, as a shared resource we need a method of coordinating the work of many researchers simultaneously so that users aren't stepping on each others' toes.

Second, when you first log into Talapas you will be connected to one of our "login nodes". These nodes are essentially a lobby in which users can do file management, write scripts, and submit jobs. They are not an appropriate place to run applications or conduct simulations. A good rule of thumb is: If it takes more than one second to complete, it's not appropriate for a login node. 

Instead, these tasks should be conducted on a "compute node." These are purpose-built for running intensive computations and can only be accessed via the SLURM job scheduler. SLURM will ensure that the compute nodes are allocated in a fair and equitable manner that prevents resource conflicts. The primary method by which you will run simulations on Talapas will be to "submit a job."

Step-by-step guide

  1. Create a job script. A job script is a description of the computational resources your job requires and the executables you wish to run. Lets look at a "hello world" example of a job script:

    hello.srun
    #!/bin/bash
    #SBATCH --partition=long        ### Partition (like a queue in PBS)
    #SBATCH --job-name=HiWorld      ### Job Name
    #SBATCH --output=Hi.out         ### File in which to store job output
    #SBATCH --error=Hi.err          ### File in which to store job error messages
    #SBATCH --time=0-00:01:00       ### Wall clock time limit in Days-HH:MM:SS
    #SBATCH --nodes=1               ### Number of nodes needed for the job
    #SBATCH --ntasks-per-node=1     ### Number of tasks to be launched per Node
    #SBATCH --account=<myPIRG>      ### Account used for job submission
    
    ./a.out							# run your actual program

    Above we see the contents of our SLURM script (aka job script) called hello.srun.  (The name is arbitrary–use whatever name you like.)  Notice that the script begins with #!/bin/bash. This line tells Linux which shell interpreter to use when executing the script.  Here we used bash (the Bourne Again Shell) and it's by far the most common choice, but other interpreters could be used (e.g., tcsh, python, etc.).  Whatever your choice, every script should begin with interpreter directive.

    Next, we see a collection of specially formatted comments, each beginning with #SBATCH followed by option definitions.  These are used by the sbatch command to set job options.  (As comments, they are ignored by bash.)  This allows us to describe our job to the scheduler and ensure that we reserve the appropriate resources (cores, memory, etc.) for an appropriate amount of time.

    While the specified --time needs to be long enough for the job to complete (lest it be killed when time runs out), it's also good not to needlessly overestimate the amount of time required in the provided --time specification.  Shorter jobs are more likely to run sooner, as they can fill in between longer jobs that aren't yet runnable.

    Note that the script suffix we used is unimportant. You can name your job scripts whatever you wish.

  2. Submit your job to the scheduler using the sbatch command.

    [duckID@ln1 helloworld]$ sbatch hello.srun 
    Submitted batch job 20190
    [duckID@ln1 helloworld]$

    Our job has been submitted and is assigned the job number 20190 which will serve as its primary identifier.

  3. Check on your job using the squeue command.

    [duckID@ln1 helloworld]$ squeue
                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 20190      long  HiWorld   duckID CG       1:09      1 n074
          20123_[1-35]      long RSA_09_c    user1 PD       0:00      1 (ReqNodeNotAvail, UnavailableNodes:hpc-hn2,ln[1-2],n[005,120,122])
          20017_[5-20]   longgpu pressure    user2 PD       0:00      1 (ReqNodeNotAvail, UnavailableNodes:hpc-hn2,ln[1-2],n[005,120,122])
               20017_4   longgpu pressure    user2  R 1-03:53:46      1 n110
               20017_3   longgpu pressure    user2  R 1-06:34:31      1 n109
                 19468   longfat     bash    user3  R 11-21:16:00      1 n123
               20017_2   longgpu pressure    user2  R 1-21:25:54      1 n119
              19995_20   longgpu pressure    user2  R 3-10:28:49      1 n106
              19995_19   longgpu pressure    user2  R 3-19:37:26      1 n104
               19995_3   longgpu pressure    user2  R 4-05:01:45      1 n111
               19995_4   longgpu pressure    user2  R 4-05:01:45      1 n112
               19995_5   longgpu pressure    user2  R 4-05:01:45      1 n113
              19995_11   longgpu pressure    user2  R 4-05:01:45      1 n100
               20017_0   longgpu pressure    user2  R 2-03:41:35      1 n107
               20017_1   longgpu pressure    user2  R 2-03:41:35      1 n118
                 20189       fat build-R-    user4  R       7:42      1 n121
                 20188       fat build-R-    user4  R      13:00      1 n124
                 20177      defq make-sil    user4  R    1:43:44     72 n[006-073,075-078]
    [duckID@ln1 helloworld]$

    Here we see that our job, number 20190, is in the CG (completing state). Notice that other jobs in the system are in the R (running) or PD (pending state). Jobs are pending when there are insufficient resources available to accommodate the request as specified in the job script. In this case, the system was scheduled for maintenance, and the wall clock limit specified in those jobs would have allowed them to run into the maintenance period. The jobs will run once the maintenance is complete, and the reservation is removed from the system. To view only your jobs, use the option flag -u followed by your userID, e.g. squeue -u duckID

  4. If necessary, cancel your job using the scancel command followed by the job number of the job you wish to cancel.

    [duckID@ln1 helloworld]$ scancel 20190
    [duckIDo@ln1 helloworld]$ squeue -u cmaggio
                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    [duckID@ln1 helloworld]$

 You are now ready to submit jobs on the Talapas cluster. For instructions on more advanced topics like parallel jobs, job arrays, and interactive jobs, see the other "How-to" guides on this site.

Filter by label

There are no items with the selected labels at this time.



  • No labels