Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Open OnDemand

  • The new Intel compilers (the existing compilers are down/gone due to licensing issues)

  • More A100s

  • cron jobs

Notable issues

CUDA MIG slicing (on A100s)

Due to limitations with CUDA MIG slicing, it appears that a job can only use one slice (GPU) per host. That means one per job, unless MPI is being used to orchestrate GPU usage on multiple hosts. See NVIDIA Multi-Instance GPU User Guide :: NVIDIA Tesla Documentation.

RHEL 8 libcrypto botch vs miniconda

Red Hat added a patch to their libcrypto libraries that collides with miniconda. See SSL library conflicts on CentOS 8 · Issue #10241 · conda/conda (github.com).

So, for example, you might see things like this:

Code Block
$ module load miniconda
$ emacs
emacs: symbol lookup error: /lib64/libk5crypto.so.3: undefined symbol: EVP_KDF_ctrl, version OPENSSL_1_1_1b
$ ssh localhost
ssh: symbol lookup error: /lib64/libk5crypto.so.3: undefined symbol: EVP_KDF_ctrl, version OPENSSL_1_1_1b
$ curl https://www.google.com
curl: symbol lookup error: /lib64/libk5crypto.so.3: undefined symbol: EVP_KDF_ctrl, version OPENSSL_1_1_1b

Not all distribution commands will fail, but quite a few do. For now, the workaround is to either only load miniconda for the commands within that you need, or to unload it before running a command that exhibits the bug. For example, something like this:

Code Block
curl yada
(module load miniconda && conda activate myfavoriteenv && mycommandinthatenv someargs)
curl yada

or something like this:

Code Block
module load miniconda
conda activate myfavoriteenv
(module purge && curl yada)
mycommandinthatenv someargs
(module purge && curl yada)

Obviously, both are pretty awful. We’ll look for a proper fix, but it might be a while.

Technical Differences

These probably won’t affect you, but they are visible differences that you might notice.

  • Hostnames now use the long form. (e.g., “login1.talapas.uoregon.edu”)

  • You may need to use the long form of hostnames to access other campus hosts. That is, using “somehost” may not work, but “somehost.uoregon.edu” will.

  • Linux group names have changed and are now longer. For example, “is.racs.pirg.bgmp” instead of “bgmp”. Since this information is now coming from the campus Active Directory server, there are a number of other mysterious AD groups included. You can just ignore these.

  • Currently, lookup of group names can be quite slow, taking 30 seconds or longer. We’ll work on speeding this up.

  • Generally, RACS is discouraging the use of POSIX ACLs on the new cluster. You can still use them yourself, but there are now other alternatives. If you’re tempted to use ACLs to solve a problem, consider asking about the alternatives.

  • In RHEL 8, the distribution executables seem to be fully stripped, removing all debug symbols. There’s probably an alternate way to add this, and we’ll look for it eventually.