Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
minLevel1
maxLevel6
outlinefalse
typelist
printablefalse
Info

The new HPC cluster has the same namebeen updated, Talapas (pronounced from Talapas to Talapas2

Pronounced tah-lah-pus) but with newer hardware and paas

Newer hardware, operating system, and infrastructure.

Notable updates

  • New operating Operating system - Red Hat Enterprise Linux 8 (RHEL8)

  • New kernel Kernel - 4.18

  • New processors Processors - 3rd generation Intel (Ice Lake) and AMD (Milan)

  • New GPUs - Nvidia 80GB Ampere A100s

  • Faster memory Memory - DDR4 3200MT/s and Intel Optane memory in the ‘memory’ memory partitions

  • More storage Storage - 250GB home directories and 10TB scratch space for job I/O per PIRG

Login

Duckids

Talapas uses UO Identity Access Management system, Microsoft Active Directory, for authentication which requires all users to have a valid UO Duckid.

...

Talapas VPN

A virtual private network (VPN) connection is recommended to access the cluster. This adds an extra layer of security. The Talapas VPN .

Instructions here: Article - Getting Started with UO VPN (uoregon.edu)

We have a Talapas profile in UO VPN which should provide all the same capabilities as UO VPN as well as adding access to Talapas.Instructions here: Article - Getting Started with UO VPN (uoregon. edu)

Use “uovpn.uoregon.edu/talapas" as the connection URL and your duckid and password.

...

  • login1.talapas.uoregon.edu

  • login2.talapas.uoregon.edu

  • login3.talapas.uoregon.edu

  • login4.talapas.uoregon.edu

Load balancer

If you can’t use the OU VPN, you can also connect to the login loadbalancer load balancer at login.talapas.uoregon.edu

A load balancer is used to redirect SSH connections to different login nodes to spread the load. The load balancers choice of login node is “sticky” in that repeated connections from your IP address will go to the same login node - as long as there has been some activity within the last 24 hours.

Slurm

List of shared partitions

Code Block
compute
computelong
gpu
gpulong
interactive
interactivegpu
memory
memorylong

Job control

  • Default memory is set to 4GB per CPU; use the --mem or --mem-per-cpu flag to adjust. A slurm account is still required for each job, use --account=<your-PIRG>

  • There is no default partition, you must a specify partition(s) with --partition.

  • The slurm account (PIRG) is still required, specify with --account.

Partition list

  • compute

  • computelong

  • gpu

  • gpulong

  • interactive

  • interactivegpu

  • memory

  • memorylong

GPU

3 GPU memory sizes are available: 10GB, 40GB, 80GB.

...

  • default memory per CPU is 4GB; use the --mem=<size> or --mem-per-cpu or --mem-per-gpu to adjust as needed

Slurm features

Each node in the cluster has at a minimum processor make, generation, and model Slurm feature tags. For example,

Code Block
n0173 amd,milan,7713

Nodes with GPUs include Slurm feature tags with GPU model and GPU memory size. For example,

Code Block
n0172 amd,milan,7413,a100,gpu-40gb

Nodes with large memory include Slurm feature tags with memory size. For example,

Code Block
n0142 intel,icelake,6348,mem-4tb

Request a node based on processor

Nodes with AMD and Intel processors are available on Talapas2.

Constrain a job to allocate a node with legacy Intel broadwell processor,

Code Block
#SBATCH --constraint=intel,broadwell

Request a node based on GPU feature

Nodes with 10GB, 40GB, 80GB GPU memory are available on Talapas2.

Constrain a job to allocate a node with 10GB of GPU memory,

Code Block
#SBATCH --gpus=1
#SBATCH --constraint=gpu-10gb

For the complete list of GPU features available run,

Code Block
/packages/racs/bin/slurm-show-features | grep gpu
CUDA A100 MIG slicing

Due to limitations with CUDA MIG slicing, it appears that a job can only use one slice (GPU) per host. That means one GPU per job unless MPI is being used to orchestrate GPU usage on multiple hosts. See NVIDIA Multi-Instance GPU User Guide :: NVIDIA Tesla Documentation. On nodes which have 80GB GPUs MIG mode is not enabled. Request these nodes using, --constraint=gpu-80gb,no-mig

Storage

/home

  • /home/<user>: store your data here. Your home directory now has 250GB quota.

/projects

  • /projects/<pirg>/<user>: store your PIRG specific work files here.

  • /projects/<pirg>/shared: store PIRG collaboration data here.

  • /projects/<pirg>/scratch: store job specific I/O data here. Each PIRG has 10TB quota and this directory is purged every 30 days (any data older than 30 days is deleted).

Processor architectures

...

Request node based on memory feature

Nodes with 1TB, 2TB, 4TB memory are available on Talapas2.

Constrain a job to allocate a node with 1TB of memory,

Code Block
#SBATCH --constraint=mem-1tb

For the complete list of features run,

Code Block
/packages/racs/bin/slurm-show-features

Processor architectures

Talapas2 is comprised of nodes from multiple seperate separate purchases over the course of several years. Therefore, it has several generations of processors from multiple (Intel and AMD) vendors.

Here is the current architecture layout (this will change as condo nodes are moved to the new cluster):

...

is subject to change):

Code Block
compute: AMD

...

 Milan and Intel Broadwell
gpu: AMD Milan

...


interactive: Intel Broadwell and IceLake

...


interactivegpu: AMD Milan

...


memory: Intel IceLake and Intel Broadwell
login: Intel Broadwell
preempt: All

Storage

/home

/home/<user>: store your data here. Your home directory now has 250GB quota.

/projects

/projects/<pirg>/<user>: store your PIRG specific work files here.

/projects/<pirg>/shared: store PIRG collaboration data here.

Software

Some existing software will run fine on the new cluster.

But, with the operating system update to RHEL8 there will likely be problems that require re-compiling or cases where software requires rebuilding.

Generally, issues would be due to differences with the new shared libraries in RHEL 8 and perhaps new CPU architecturesRHEL8. If you compile software in a way that specifically assumes one architecture (i.e. Intel IceLake) it might not run on all nodes.

Conda

In addition to the original ‘miniconda’ instance, we now have a ‘miniconda-t2’ instance. To avoid compatibility issues, we will create and update Conda environments only in the latter instance on the new cluster. (Similarly, we won’t make updates on the original instance on the new cluster.) Talapas2 uses miniconda-t2 and new conda environments will be built with this base environment. If you have personal conda environments, you might wish to follow a policy like this as wellneed/want to recreate them using miniconda-t2. Note that using existing Conda environments on either cluster should work fine - - it’s making changes that might cause problems.

Spack

Similarly, in addition to our original ‘racs-spack’ Spack instance, there is now a new ‘spack-rhel8’ instance. An additional factor is that most Spack software is compiled locally, whereas Conda software is generally compiled upstream. Also, by default, Spack will compile software to assume the CPU architecture of the host it’s compiling on. So, as above, if you compile software on a new login node, it won’t necessarily run on all compute nodes.

One solution is to specify a CPU architecture that’s compatible with all of our existing hosts. We think something like this will work:Talapas2 uses spack-rhel8 and software provided centrally by this platform will built using this instance with gcc 13.1.0 on Intel Broadwell nodes.

Code Block
spack install your-package <package>@version %gcc@13.1.0 arch=linux-rhel8-broadwell

...

Open OnDemand

Open a new window in Updated Open OnDemand is on Talapas2. Use Google Chrome or Firefox and navigate to:,

https://ondemand.talapas.uoregon.edu/

Use your DuckID to log in.

Globus

Technical Differences

These probably won’t affect you, but they are visible differences that you might notice.

  • Talapas2 domain name is talapas.uoregon.edu

  • Hostnames now use the long form. (e.g., “login1, login1.talapas.uoregon.edu”)edu

  • Use the long form of hostnames to access other campus hosts. That is, somehost, some-other-host.uoregon.edu.

  • Linux group names have changed to reflect their Active Directory name and are now longer. For User IDs (UID) centrally managed in Active Directory (AD)

  • Linux groups IDs (GID) are centrally managed in Active Directory (AD). And the group names are longer, for example, is.racs.pirg.hpcrcfracs instead of hpcrcf.

Coming soon

  • New oneAPI Intel compilersjust racs