The new cluster has newer, better hardware, and runs RHEL 8 (a newer version of the base operating system). Notably, there will be a number of Nvidia A100 GPUs available--these are much faster than the existing K80s. Although some things have changed, most changes are for the better, and most software should continue to “just work”.
The least you need to know
...
See
...
Talapas login nodes are now behind a load balancer. This means that ‘tmux’, ‘screen’, and other long-running server processes will no longer work as before. See below.
...
The partitions have changed. You can see them with the ‘sinfo’ command, and the naming is intuitive. The time limits are currently as on the existing Talapas.
...
Default memory for all jobs is now 4GB. If your job needs more, you will need to explicitly request it.
...
the
...
In some cases, RHEL shared library changes or other things may break existing software. File a ticket, and we’ll get it fixed ASAP.
Logging in to the new Talapas
Talapas VPN is now required
In order to access the new cluster in any way, your laptop/etc. will need to be on the Talapas VPN. This VPN is a lot like the UO VPN, if you’ve used that before, and is intended to provide the same capabilities as the UO VPN, but also provides access to Talapas.
To do this, follow the instructions here: Article - Getting Started with UO VPN (uoregon.edu) but use “uovpn.uoregon.edu/talapas
" as the connection URL. The username and password are your standard DuckID and its password.
[Some advanced users might want to use OpenConnect (OpenConnect VPN client.) instead. This would support connection using a command like:
Code Block |
---|
sudo openconnect --protocol=anyconnect uovpn.uoregon.edu/talapas |
If you’re an ordinary user, you can ignore this option.]
An important detail is that access to Talapas VPN will be removed if your access to Talapas is removed. So, for example, if you’re a student only using Talapas for a course, at some point after the course has ended, your access will be removed. You will see error messages like “login failed” at this point when trying to connect to Talapas VPN. The fix is to switch back to UO VPN, if desired, or to just stop using VPNs.
Most crucially, do not repeatedly attempt to log in when you’re getting error messages. As with other uses of your DuckID at UO, if you generate a large number of failures, all DuckID access (including things like e-mail) will be locked University-wide, and you will have to talk to IT about getting it unlocked again. Similarly, be aware of automated processes like cron jobs that might trigger this situation without your notice.
Blocked ports
Note that inbound access to Talapas is only allowed for SSH and (eventually) Open OnDemand. All other ports are blocked.
Talapas now uses a load balancer
The preferred method of accessing the new Talapas is via “login.talapas.uoregon.edu
".
The new Talapas uses a load balancer, which will redirect your SSH connection to a particular login node in a somewhat arbitrary way. In particular, connections from a particular IP address will go to a login node chosen on the basis of being up and having a light load. The choice of login node is “sticky”. That is, further connections from your IP address will go to the same login node, as long as there has been some activity within the last 24 hours.
This has some implications for workflow. First, tools like ‘tmux’ and ‘screen’ will no longer work reliably in some cases. In particular, if you have a ‘tmux’ session that you’re using at the University, and you try to connect to it from home (which will have a different IP address), it probably won’t work. As a distinct case, if you have no usage for 24 hours on a host, even on-campus, the “sticky” effect will expire, and trying to connect to your ‘tmux’ session probably won’t work. It’s also worth noting that your ‘tmux’ server won’t be killed--it will just hang around in an orphaned state. If this happens, you can send a ticket to RACS, and we’ll kill it for you.
Not yet available but coming soon
Open OnDemand
The new Intel compilers (the existing compilers are down/gone due to licensing issues)
More A100s
cron jobs
Notable issues
CUDA MIG slicing (on A100s)
Due to limitations with CUDA MIG slicing, it appears that a job can only use one slice (GPU) per host. That means one per job, unless MPI is being used to orchestrate GPU usage on multiple hosts. See NVIDIA Multi-Instance GPU User Guide :: NVIDIA Tesla Documentation.
RHEL 8 libcrypto botch vs miniconda
Red Hat added a patch to their libcrypto libraries that collides with miniconda. See SSL library conflicts on CentOS 8 · Issue #10241 · conda/conda (github.com).
So, for example, you might see things like this:
Code Block |
---|
$ module load miniconda
$ emacs
emacs: symbol lookup error: /lib64/libk5crypto.so.3: undefined symbol: EVP_KDF_ctrl, version OPENSSL_1_1_1b
$ ssh localhost
ssh: symbol lookup error: /lib64/libk5crypto.so.3: undefined symbol: EVP_KDF_ctrl, version OPENSSL_1_1_1b
$ curl https://www.google.com
curl: symbol lookup error: /lib64/libk5crypto.so.3: undefined symbol: EVP_KDF_ctrl, version OPENSSL_1_1_1b |
Not all distribution commands will fail, but quite a few do. For now, the workaround is to either only load miniconda for the commands within that you need, or to unload it before running a command that exhibits the bug. For example, something like this:
Code Block |
---|
curl yada
(module load miniconda && conda activate myfavoriteenv && mycommandinthatenv someargs)
curl yada |
or something like this:
Code Block |
---|
module load miniconda
conda activate myfavoriteenv
(module purge && curl yada)
mycommandinthatenv someargs
(module purge && curl yada) |
Obviously, both are pretty awful. We’ll look for a proper fix, but it might be a while.
Technical Differences
These probably won’t affect you, but they are visible differences that you might notice.
...
Hostnames now use the long form. (e.g., “login1.talapas.uoregon.edu”)
...
You may need to use the long form of hostnames to access other campus hosts. That is, using “somehost” may not work, but “somehost.uoregon.edu” will.
...
Linux group names have changed and are now longer. For example, “is.racs.pirg.bgmp” instead of “bgmp”. Since this information is now coming from the campus Active Directory server, there are a number of other mysterious AD groups included. You can just ignore these.
...
Currently, lookup of group names can be quite slow, taking 30 seconds or longer. We’ll work on speeding this up.
...
Generally, RACS is discouraging the use of POSIX ACLs on the new cluster. You can still use them yourself, but there are now other alternatives. If you’re tempted to use ACLs to solve a problem, consider asking about the alternatives.
...
new knowledge base for the release notes: