Here are This guide provides the basics of connecting to and using the Talapas HPC cluster.
Getting a Talapas Account
If you don't already have an account, see the Request Access page to get one.
Logging on to Talapas
Your For UO users, your username on Talapas will be your Duck ID. (That is, if your email address is alice@uoregon.edu, your Talapas username will be "alice".) Your password is the same university-wide, and can be managed at the UO password reset page.
For non-UO users, you will have received a username and password in the email granting you access.
Talapas currently has two login nodes:
talapas-ln1.uoregon.edu
talapas-ln2.uoregon.edu
These hosts are entirely equivalent. You can use whichever seems less busy, or use hostname talapas-login.uoregon.edu
to be sent to one randomly.
If you are logging in from a Linux or Mac OS X workstation, open a terminal and type
Code Block |
---|
ssh myusername@talapas-ln1.uoregon.edu |
If you are logging in from Windows, download an SSH client like Putty or MobaXterm and do the equivalent.
If you are logging in from outside of the UO network or from behind a firewall, you might need to connect to the UO VPN first.
See How-to Log Into Talapas for more details.
Note |
---|
The login nodes are for light tasks needed to set up and submit your work. They're not for running significant applications, simulations, etc. |
...
Processes that use a lot of memory or CPU will be killed. Run heavier tasks on the compute nodes instead. This will keep the login nodes |
...
Job Submission with SLURM
...
responsive for everyone. |
Transferring Files to and from Talapas
If you're accessing Talapas from Linux or Mac OS X, you can use scp
or rsync
to transfer files. For example, type
Code Block |
---|
scp chr1.fasta myusername@talapas-ln1.uoregon.edu:. |
will copy the named file to your Talapas home directory.
There are also GUI tools available for file transfer via SCP or SFTP. Filezilla is available for all common platforms.
Also, /wiki/spaces/HPC/pages/1769536 is available, and is particularly useful for the transfer of large files.
Storing your Data
As a Talapas user, you have these directories available for storing and manipulating your files.
- Your Linux home directory. This is the directory you start in when you log in. It's shared across all Talapas hosts. This directory is for small configuration files, etc. The quota is limited to 10GB and cannot be expanded.
- Your project directory,
/projects/GROUP/USERNAME
, where GROUP is the name of your PIRG (and also your Linux primary group). This is the directory you should use for large data files, etc. This directory shares a large quota with your PIRG. - Your shared PIRG directory,
/projects/GROUP/shared
. This directory is for data that you want to share with other members of your PIRG. It also shares the PIRG quota, which can be expanded as needed. - The host temporary directory,
/tmp
. Unlike the other directories, this directory is on a local host disk drive and is not shared across hosts. It's for temporary data during a job. Using this directory for intermediary data may speed up your job. Please delete files you create here at the end of your job.
Note |
---|
You are responsible for backing up your data on Talapas. This data is not backed up by RACS. Some snapshotting is performed for all directories except for the host temporary directory, and in some cases lost files can be recovered from a snapshot. This cannot be relied on as a backup, however. |
See Storage for more details.
Software
Talapas uses the Lmod environment module software to provide access to the various installed software packages. Type
Code Block |
---|
module spider blast |
to see the available modules that are named like "blast", for example.
To load a module so that you can use its commands, type
Code Block |
---|
module load NAME |
where NAME is the specific name of the module you'd like to load. (You can specify the version as well, or omit it to get the default version.)
This may also load other modules needed by the module in question. If you'd like to see all modules currently loaded, type
Code Block |
---|
module list |
Note that the module is loaded just for the current shell instance. You'll need to do this again for each new shell.
If you need to unload a module, type
Code Block |
---|
module unload NAME |
(Lmod seperates modules into groups in order to avoid having incompatible modules loaded simultaneously. To search for just modules that can be loaded without switching groups, use module avail
instead of module spider
.)
See How-to Use LMOD for more details.
Running a Batch Job
Talapas uses the SLURM job scheduler and resource manager to provides a way to submit large computational tasks to the cluster in a batch fashion. You can also use it to get an interactive shell on a compute node suitable for interactive tasks not suitable for the login nodes.
Batch Job Submission
To run a job on Talapas you must first create a SLURM job script describing the resources your job requires and the executables to be run. You then submit your job script to the scheduler using the sbatch command. If the necessary resources are currently available, your job will run immediately. If not, your job will be placed in the job queue and will be run when the necessary resources become available. To check on the status of your job, use the squeue command. To cancel a job you've submitted, use the scancel command. To list the partitions on the cluster and see their status use the sinfo command.
Interactive Shells
To start an interactive shell on Talapas, use the srun command. In most regards, this works the same as a batch job submission, but will open a shell connection to the selected compute node (as with ssh), and you can then execute commands, etc.
SLURM Partitions
Partitions are essentially "queues" that provide different resource sets to their jobs. There are partitions for "short" and "long" jobs, partitions that provide GPU nodes or not, and partitions that provide large RAM nodes.
Talapas is run on a dual club/condo model. Members of the compute club have access to all University-owned compute resources while condo users have access to the condo partition corresponding the resources they have purchased. (Note that users may be members of both the condo and compute club). For a list of partitions and which PIRGs (Principal Investigator Research Groups) have access to them see the Partition List.
Storage
Storage space on Talapas is made available via the Talapas storage club and can be purchased by a PIRG. Storage is accounted for according to the group ownership of the file and it's important that ownership is correctly attributed.
Except for local scratch space, all directories are on shared GPFS storage, visible on all cluster hosts.
Home Space
Each user on Talapas is assigned a personal home directory located at /home/<username>
. This is limited to 10GB. They also have an individual PIRG directory at /home/<PIRG>/<username>
for larger datasets, etc.
By default, permissions on both are set to drwx------
, i.e., user only.
Project Space
Each PIRG on the system has a shared project space located at /projects/<PIRG>/shared
. By default, permissions are set to drwxrws---
, i.e., group permissions.
Local Scratch
Each compute node has a local scratch disk. The size of the local disk depends on the type of compute node and can be found on the Machine Specifications page. Please have your jobs remove their scratch files when they finish.
Software
Talapas uses the Lmod environment module software to control the Linux environment variables and provide multiple software versions. Users can run the module spider command to search for particular software packages on the system. The module avail command will show a list of packages whose dependencies are currently loaded. Use module load to add a software package to your environment and module unload to remove it
As a basic example, here is a simple batch script to run the hostname
command on one compute node. To run it, first create a file hostname.batch
containing these lines:
Code Block | ||||
---|---|---|---|---|
| ||||
#!/bin/bash
#SBATCH --partition=short ### queue to submit to
#SBATCH --job-name=myhostjob ### job name
#SBATCH --output=hostname.out ### file in which to store job stdout
#SBATCH --error=hostname.err ### file in which to store job stderr
#SBATCH --time=5 ### wall-clock time limit, in minutes
#SBATCH --mem=100M ### memory limit, per cpu, in MB
#SBATCH --nodes=1 ### number of nodes to use
#SBATCH --ntasks-per-node=1 ### number of tasks to launch per node
#SBATCH --cpus-per-task=1 ### number of cores for each task
hostname |
Then, to actually submit the job for execution, type
Code Block |
---|
sbatch hostname.batch |
When a job is submitted, SLURM considers all of the resources requested (e.g., CPU cores, memory, time, GPUs) together with the set of currently running and queued jobs. Once compute nodes are available to run your job, it will be run.
Note that your job will not run until the resources you have requested are available. This means that you may be able to get your job to run sooner by requesting fewer resources. For example, if your job will run in 1000MB, it's better not to request 10000MB, as it may take significantly longer for those resources to become available.
Batch jobs are run in a manner that's not dependent on your current terminal, so once a job is submitted, it no longer depends on your being logged in, etc. Job output will be available in the specified files when it has completed. You can monitor your job's progress using the squeue
and sacct
commands. For example
Code Block | ||
---|---|---|
| ||
$ sbatch hostname.batch
Submitted batch job 12345
# checking with squeue, we see that the job has started running (state is 'R')
$ squeue -j 12345
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12345 short myhostjob joec R 5:08 1 n003
# later, we see that the job has successfully completed
$ sacct -j 12345
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
12345 myhostjob+ short joegrp 1 COMPLETED 0:0
12345.batch batch joegrp 1 COMPLETED 0:0
12345.exte+ extern joegrp 1 COMPLETED 0:0 |
If needed, you can also kill a job using its job id. This will terminate it if it has already started running, or remove it from the queue if it hasn't started yet.
Code Block |
---|
scancel 12345 |
SLURM uses a number of queues, which it calls paritions, to run jobs with different properties. Normal jobs will use the short partition, but there are other partitions for jobs that need to run longer than a day, need more memory, or need to use GPUs, for example. You can use the sinfo
command to see detailed information about the partitions and their states.
See Partition List for information on the available partitions.
Running an Interactive Job
For long-running computational tasks, batch jobs are the way to go. But sometimes you will want to get an interactive shell on a compute node. This is useful for quickly testing your scripts, or for running interactive programs like MATLAB or SAS that are too computationally intensive to run on a login node.
You can start a shell like this
Code Block | ||
---|---|---|
| ||
srun --pty --partition=short --mem=1024M --time=240 bash |
Most of the flags work just as in batch scripts, above. This command requests a single core (the default) on a node in the short partition, with 1024MB of memory, for four hours. (After four hours, the shell and all commands running under it will be killed.)
This will often start your shell immediately, but if all nodes are in use, you might have to wait a while. The shell will start once resources become available.
In most regards, this works like a typical remote shell. If your local workstation crashes, for example, the shell will be logged out. For this reason, batch jobs are better when feasible.
The above command is good for running single programs. If you'll be doing something that will invoke multiple parallel processes or threads, like a parallel make or multi-threaded program, you might want to add a --cpus-per-task=N
flag to allocate more cores.
Running Graphical Interactive Jobs
If your interactive job will involve a program like rstudio, which uses X11 to provide graphical output, you can use a command like this to forward your X11 connection back to your workstation X server
Code Block | ||
---|---|---|
| ||
xrun --partition=short --mem=1024M --time=4 bash |
(Although the other parameters are similar, note that time is in hours rather than minutes for this command.)
In order for this to work correctly, you will need to be running an X server on your local workstation, and you'll have to forward X traffic when you connect to a login node. This is typically done by using ssh -Y
or its equivalent.
Requesting Space and Time
For every SLURM job you should specify the amount of memory the job needs (with --mem
and related flags) and the amount of time it needs (with --time
). If you don't do so, default values are used, but these are often less than ideal.
As a general rule, the less resources your job requires, the sooner it will run. This is because smaller/shorter jobs are easier for SLURM to schedule sooner. (It also benefits other users by allowing SLURM to use cluster resources more efficiently.) That said, if you specify less space or time than your job needs, it will be killed before it can complete, since these specifications are enforced. So, you want to err somewhat on the high side. For any given application, you might have to experiment some to get this right.
For memory, the default is to be allocated a proportional amount of available RAM from the node. So, for example, if you request one CPU core on a node that has 28 cores and 125GB of RAM, the default will be about 4.4GB of memory. (A corollary is that if you request all of the cores for a node, you'll get all of the available RAM by default.) Some nodes have more cores and/or RAM, so this may vary, but 4.4GB is currently the smallest default for all Talapas nodes. See Machine Specifications for more detailed information about the available nodes.
If your job needs more than the default, you must explicitly specify a larger value. Alternatively, if your job needs less, you might want to specify less, increasing the odds that it will be scheduled sooner.
For time, the default varies by partition, but is generally the maximum available for the partition. For the short partition, the default is currently 24 hours. For the long partition, it's currently two weeks. If your job will take significantly less time, you can specify a shorter duration, to increase the odds that it will be scheduled sooner.
In order to help estimate your job's requirements, you can use sacct to see what prior jobs required
Code Block | ||
---|---|---|
| ||
sacct -j JOBID --format=JobID,JobName,ReqMem,MaxRSS,Elapsed |
where JOBID is the numeric ID of a job that previously complete. This will produce output like this
Code Block | ||
---|---|---|
| ||
sacct -j 301111 --format=JobID,JobName,ReqMem,MaxRSS,Elapsed
JobID JobName ReqMem MaxRSS Elapsed
------------ ---------- ---------- ---------- ----------
301111 myjobname 3800Mc 16:00:28
301111.batch batch 3800Mc 197180K 16:00:30
301111.exte+ extern 3800Mc 2172K 16:00:29 |
This job used about 193MB (197180K) of RAM at its peak, and ran for a bit over 16 hours.
Getting Help
If you run into trouble, see the more detailed documentation on this site, or feel free to submit a ticket describing your problem. It will be very helpful if you can provide the job ID in question, together with the script you're trying to run and any error messages, etc.
Related articles
Filter by label (Content by label) | ||
---|---|---|
|
...