Release Notes for the new (2023) Talapas

The new HPC cluster has the same name, Talapas (pronounced TA-la-pas) but with newer hardware and operating system.

Notable updates

New operating system - Red Hat Enterprise Linux 8 (RHEL8)
New processors - 3rd generation Intel (Ice Lake) and AMD (Milan)
New GPUs - Nvidia 80GB A100s
Faster memory - DDR4 3200MT/s and Optane memory in the ‘memory’ partitions
More storage - 250GB home directories and 10TB scratch space for job I/O per PIRG

Duckids

Talapas uses UO Identity Access Management system, Microsoft Active Directory, for authentication which requires all users to have a valid UO Duckid.

Links are provided below for external collaborators or graduating researchers to continue their access to the cluster.

External collaborators (2 options):
- https://service.uoregon.edu/TDClient/2030/Portal/Requests/ServiceDet?ID=20228
- https://hr.uoregon.edu/courtesy-campus-associate-and-other-unpaid-appointments
Graduating researchers:
- https://hr.uoregon.edu/courtesy-campus-associate-and-other-unpaid-appointments

Talapas VPN

A virtual private network (VPN) connection is required to access the cluster. This adds an extra layer of security. The Talapas VPN should provide all the same capabilities as UO VPN as well as adding access to Talapas.

Instructions here: Article - Getting Started with UO VPN (uoregon.edu)

Use “uovpn.uoregon.edu/talapas" as the connection URL and your duckid and password.

Advanced users might want to use OpenConnect, OpenConnect VPN client. This would support connection using a command such as,

sudo openconnect --protocol=anyconnect uovpn.uoregon.edu/talapas

Note: do not repeatedly attempt to log in when you’re getting error messages. As with other uses of your DuckID at UO, if you generate a large number of login failures, all DuckID access (including things like e-mail) will be locked University-wide. Similarly, be aware of automated processes like cron jobs that might trigger this situation without your notice.

Blocked ports

Note that inbound access to the new cluster is only allowed for SSH and Open OnDemand. All other ports are blocked.

Load balancer

The preferred method of access is via login.talapas.uoregon.edu

The new cluster has 3 login nodes. A load balancer is used to redirect SSH connections to different login nodes to spread the load. The load balancers choice of login node is “sticky” in that repeated connections from your IP address will go to the same login node - as long as there has been some activity within the last 24 hours.

Slurm

Job control

Default memory is set to 4GB per CPU; use the --mem or --mem-per-cpu flag to adjust.
There is no default partition, you must specify partition(s) with --partition.
The slurm account (PIRG) is still required, specify with --account.

Partition list

compute
computelong
gpu
gpulong
interactive
interactivegpu
memory
memorylong

GPU

3 GPU memory sizes are available, 10GB, 40GB, 80GB. Specify the GPU size with --constraint, for example: --constraint=gpu-10gb

CUDA A100 MIG slicing

Due to limitations with CUDA MIG slicing, it appears that a job can only use one slice (GPU) per host. That means one GPU per job unless MPI is being used to orchestrate GPU usage on multiple hosts. See NVIDIA Multi-Instance GPU User Guide :: NVIDIA Tesla Documentation.

Storage

/home

/home/<user>: store your data here. Your home directory now has 250GB quota.

/projects

/projects/<pirg>/<user>: store your PIRG specific work files here.
/projects/<pirg>/shared: store PIRG collaboration data here.
/projects/<pirg>/scratch: store job specific I/O data here. Each PIRG has 10TB quota and this directory is purged every 30 days (any data older than 30 days is deleted).

Processor architectures

Talapas is comprised of nodes from multiple seperate purchases over the course of several years. Therefore, it has several generations of processors from multiple (Intel and AMD) vendors.

Here is the current architecture layout (this will change as condo nodes are moved to the new cluster):

login nodes: Intel Broadwell and IceLake (moving to all Broadwell soon)
compute: AMD Milan
gpu: AMD Milan
interactive: Intel Broadwell and IceLake
interactivegpu: AMD Milan
memory: Intel IceLake

Software

Some existing software will run fine on the new cluster. But, with the operating system update to RHEL8 there will likely be problems that require re-compiling or rebuilding.

Generally, issues would be due to differences with the new shared libraries in RHEL 8 and perhaps new CPU architectures. If you compile software in a way that specifically assumes one architecture (i.e. Intel IceLake) it might not run on all nodes.

Conda

In addition to the original ‘miniconda’ instance, we now have a ‘miniconda-t2’ instance. To avoid compatibility issues, we will create and update Conda environments only in the latter instance on the new cluster. (Similarly, we won’t make updates on the original instance on the new cluster.) If you have personal conda environments, you might wish to follow a policy like this as well. Note that using existing Conda environments on either cluster should work fine--it’s making changes that might cause problems.

Spack

Similarly, in addition to our original ‘racs-spack’ Spack instance, there is now a new ‘spack-rhel8’ instance. An additional factor is that most Spack software is compiled locally, whereas Conda software is generally compiled upstream. Also, by default, Spack will compile software to assume the CPU architecture of the host it’s compiling on. So, as above, if you compile software on a new login node, it won’t necessarily run on all compute nodes.

One solution is to specify a CPU architecture that’s compatible with all of our existing hosts. We think something like this will work:

spack install your-package  arch=linux-rhel8-broadwell

If you’re using your own Spack instance, you might want to take similar measures.

Technical Differences

These probably won’t affect you, but they are visible differences that you might notice.

Hostnames now use the long form. (e.g., “login1.talapas.uoregon.edu”)
Use the long form of hostnames to access other campus hosts. That is, somehost.uoregon.edu.
Linux group names have changed to reflect their Active Directory name and are now longer. For example, is.racs.pirg.hpcrcf instead of hpcrcf.

Coming soon

New oneAPI Intel compilers