Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cotainr still looks for old base images on LUMI #67

Closed
kaare-mikkelsen opened this issue Sep 25, 2024 · 8 comments
Closed

cotainr still looks for old base images on LUMI #67

kaare-mikkelsen opened this issue Sep 25, 2024 · 8 comments

Comments

@kaare-mikkelsen
Copy link

when building using --system=lumi-g, I get the following error:

SingularitySandbox.err:-: FATAL: Unable to build from /appl/local/containers/sif-images/lumi-rocm-rocm-5.6.1.sif: unable to open file /appl/local/containers/sif-images/lumi-rocm-rocm-5.6.1.sif: open /appl/local/containers/sif-images/lumi-rocm-rocm-5.6.1.sif: no such file or directory

Which is of course pretty reasonable.

Using

--base-image=/appl/local/containers/sif-images/lumi-pytorch-rocm-6.0.3-python-3.12-pytorch-v2.3.1.sif

seems to work fine.
I am importing cotainr from CrayEnv.

@akx
Copy link
Contributor

akx commented Sep 26, 2024

This is probably a LUMI packaging/configuration issue, since this repository doesn't specify a systems.json that'd contain lumi-g (other than as an example in the readme and documentation).

@kaare-mikkelsen
Copy link
Author

sounds likely. where would you suggest I file it instead?

@akx
Copy link
Contributor

akx commented Sep 26, 2024

@TheBlackKoala
Copy link
Contributor

@kaare-mikkelsen i believe @joasode is aware of this and looking into it.

@Chroxvi
Copy link
Contributor

Chroxvi commented Sep 26, 2024

Yes, https://lumi-supercomputer.eu/user-support/need-help/ is the right place to report this. However, as @TheBlackKoala noted we are already working with the LUMI User Support Team to sort this out, so no need to open a ticket this time. The core issue here is that there are currently no officially supported LUMI base images available following the recent LUMI maintenance break. We are looking at possible workarounds until such base images become available. Hopefully, we will have some recommendations ready later today or tomorrow.

@Chroxvi
Copy link
Contributor

Chroxvi commented Sep 26, 2024

Some context

There was a big system update to LUMI a couple of weeks ago. It included an update to the KFD/AMDGPU driver on LUMI to align with ROCm 6.0. This driver version officially supports ROCm 5.6-6.2. All the previous LUMI ROCm base images used with cotainr build --system=lumi-g ... on LUMI were deprecated following the maintenance break since they where either based on the now too old ROCm 5.4-5.5 and/or built against a now incompatible version of the Cray Libfabric network stack on LUMI (which is used for fully hardware accelerated RCCL via the aws-ofi-rccl plugin when scaling to multiple compute nodes). The old base images are still available on LUMI under /appl/local/containers/prior-sep2024-update/sif-images/. The ROCm 5.6.x images may still work depending on your specific use case. So far the LUMI User Support Team has not release any new base image for ROCm 5.6-6.2 built against the new Cray Libfabric network stack on LUMI. Only a few new PyTorch containers are currently available under /appl/local/containers/sif-images/. It is unclear when new LUMI ROCm base images will be available. As soon as they become available, we will update the cotainr installation in the CrayEnv stack on LUMI to make them available via --system=lumi-g.

Workarounds

You can always manually pick a base image for use with cotainr via --base-image=<some_base_image_URI> instaed of using --system. Until --system=lumi-g works again on LUMI, here are some suggested base images to use for different use cases. These are all suboptimal in different ways, so please only use them until a proper set of LUMI base images become available:

Use case Base image Notes
Need ROCm >=6.0, but only scales to a single node on LUMI docker://rocm/dev-ubuntu-22.04:6.0.2-complete (or another tag, if needed) Falls back to communication via sockets across multiple nodes which doesn't scale very well.
Need ROCm 5.7 or 6.0 and needs to scale to multiple nodes /appl/local/containers/sif-images/lumi-pytorch-rocm-5.7.3-python-3.12-pytorch-v2.2.2.sif or /appl/local/containers/sif-images/lumi-pytorch-rocm-6.0.3-python-3.12-pytorch-v2.3.1.sif Cotainr will install an additional conda environment into these containers in addition to the one they already provide. The resulting image becomes very large and bloated, but should otherwise work.
Need ROCm =5.6 /appl/local/containers/prior-sep2024-update/sif-images/lumi-rocm-rocm-5.6.1.sif This may or may not work depending on your use case.

A note on PyTorch versions

As for PyTorch versions, we generally recommend torch<=2.3 since 2.4 currently causes crashes on LUMI in certain situations. Also, we have seen degraded performance for some PyTorch and Tensorflow training workflows following the maintenance break on LUMI. We are still investigating this.

@Chroxvi
Copy link
Contributor

Chroxvi commented Oct 9, 2024

On LUMI, the following LUMI ROCm base images are now available under /appl/local/containers/sif-images/:

  • lumi-rocm-rocm-5.7.3.sif
  • lumi-rocm-rocm-6.0.3.sif
  • lumi-rocm-rocm-6.1.3.sif
  • lumi-rocm-rocm-6.2.0.sif
  • lumi-rocm-rocm-6.2.1.sif
  • lumi-rocm-rocm-6.2.2.sif

On LUMI, you may use these with the --base-image option, e.g. cotainr build my_container.sif --base-image=/appl/local/containers/sif-images/lumi-rocm-rocm-6.0.3.sif --conda-env=my_conda_env.yml.

We'll try to update the cotainr installation on LUMI ASAP to use lumi-rocm-rocm-6.0.3.sif when specifying --system=lumi-g.

@joasode
Copy link
Contributor

joasode commented Oct 23, 2024

As of this PR to the LUMI Software stack, --system=lumi-g now looks for the new base image /appl/local/containers/sif-images/lumi-rocm-rocm-6.0.3.sif

@joasode joasode closed this as completed Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants