-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cotainr still looks for old base images on LUMI #67
Comments
This is probably a LUMI packaging/configuration issue, since this repository doesn't specify a |
sounds likely. where would you suggest I file it instead? |
@kaare-mikkelsen i believe @joasode is aware of this and looking into it. |
Yes, https://lumi-supercomputer.eu/user-support/need-help/ is the right place to report this. However, as @TheBlackKoala noted we are already working with the LUMI User Support Team to sort this out, so no need to open a ticket this time. The core issue here is that there are currently no officially supported LUMI base images available following the recent LUMI maintenance break. We are looking at possible workarounds until such base images become available. Hopefully, we will have some recommendations ready later today or tomorrow. |
Some contextThere was a big system update to LUMI a couple of weeks ago. It included an update to the KFD/AMDGPU driver on LUMI to align with ROCm 6.0. This driver version officially supports ROCm 5.6-6.2. All the previous LUMI ROCm base images used with WorkaroundsYou can always manually pick a base image for use with cotainr via
A note on PyTorch versionsAs for PyTorch versions, we generally recommend torch<=2.3 since 2.4 currently causes crashes on LUMI in certain situations. Also, we have seen degraded performance for some PyTorch and Tensorflow training workflows following the maintenance break on LUMI. We are still investigating this. |
On LUMI, the following LUMI ROCm base images are now available under
On LUMI, you may use these with the We'll try to update the cotainr installation on LUMI ASAP to use |
As of this PR to the LUMI Software stack, |
when building using --system=lumi-g, I get the following error:
SingularitySandbox.err:-: FATAL: Unable to build from /appl/local/containers/sif-images/lumi-rocm-rocm-5.6.1.sif: unable to open file /appl/local/containers/sif-images/lumi-rocm-rocm-5.6.1.sif: open /appl/local/containers/sif-images/lumi-rocm-rocm-5.6.1.sif: no such file or directory
Which is of course pretty reasonable.
Using
--base-image=/appl/local/containers/sif-images/lumi-pytorch-rocm-6.0.3-python-3.12-pytorch-v2.3.1.sif
seems to work fine.
I am importing cotainr from CrayEnv.
The text was updated successfully, but these errors were encountered: