-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dockerfile and accompanying documentation #970
Conversation
Codecov Report
@@ Coverage Diff @@
## main #970 +/- ##
==========================================
- Coverage 91.85% 91.77% -0.09%
==========================================
Files 74 72 -2
Lines 10712 10485 -227
==========================================
- Hits 9840 9623 -217
+ Misses 872 862 -10
Flags with carried forward coverage won't be shown. Click here to find out more. see 7 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
another container to try: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good so far. only the mpi4py suggestion jumps out to me
@Mystic-Slice @shahpratham @JedrzejMosiezny if you happen to have time, can you try this out as well? Thanks a lot! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bhagemeier for providing this. My main problem with this PR is that I don't understand / can't follow the README.
docker/README.md
Outdated
The [Dockerfile](./Dockerfile) guiding the build of the Docker image is located in this | ||
directory. It is typically most convenient to `cd` over here and run the Docker build as: | ||
|
||
$ docker build . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instructions to run the container needed here
docker/README.md
Outdated
The resulting image (ID) should then be tagged for subsequent upload (push) to a | ||
repository, for example: | ||
|
||
$ docker tag ea0a1040bf8a ghcr.io/helmholtz-analytics/heat:1.2.0_torch1.11_cuda11.5_py3.9 | ||
$ docker push ghcr.io/helmholtz-analytics/heat:1.2.0_torch1.11_cuda11.5_py3.9 | ||
|
||
Please ensure that you push the same tag that you just created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for the devs or the users? I'm not sure what the use case is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me it is also unclear whether I have to run this when I just want to use Heat in containerized version or whether this just explains what had to be done in order to create the container.
docker/README.md
Outdated
## Building for HPC | ||
|
||
With HeAT being a native HPC library, one would naturally want to build the container | ||
image also for HPC systems, such as the ones available at [Juelich Supercomputing Centre | ||
(JSC)](https://www.fz-juelich.de/jsc/ "Juelich Supercomputing Centre"). | ||
|
||
HPC centres may run a choice of Apptainer or Singularity, which may incur limitations to | ||
the flexibility of building images. For instance, the Singularity Image Builder (SIB) | ||
does not work with the arguments mentioned above, such that these will have to be | ||
avoided. | ||
|
||
However, SIB is capable of using just about any available Docker image from any | ||
registry, such that a specific Singularity image can be built by simply referencing the | ||
available image. SIB is thus used as a conversion tool. | ||
|
||
A simple `Dockerfile` (in addition to the one above) to be used with SIB could look like | ||
this: | ||
|
||
FROM ghcr.io/helmholtz-analytics/heat:1.2.0_torch1.11_cuda11.5_py3.9 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it. I would like to start containerized Heat from source on a specific branch. What am I supposed to do? Should I load some modules first?
FROM ghcr.io/helmholtz-analytics/heat:1.2.0_torch1.11_cuda11.5_py3.9
-bash: FROM: command not found
Added a working multi-node example on the Docker readme and a singularity definition file. Both worked on HoreKa. Would be good if someone could confirm that it works on other systems as well. |
The Dockerfile provides some flexibility in selecting which version of HeAT should be inside the Docker image. Also, one can choose whether to install from source or from PyPI.
Some code sections had a mix of spaces and tabs, which have now been convertd into tabs.
Use pytorch 1.11 Fix problem with CUDA package repo keys
NVidia images come with support for HPC systems desirable for our uses. They work a little differently internally and required some changes. The tzdata configuration configures the CET/CEST timezone, which seems to be required when installing additional packages. There is an issue with pip caches in the image, which led to the final cache purge to fail in the PyPI release based build. This is fixed through a final invocation of true.
6a69b56
to
64b474f
Compare
for more information, see https://pre-commit.ci
docker/README.md
Outdated
Dockerfile. This method does not support build arguments, so version, branch and type of installation have to | ||
changed in the definition file. | ||
|
||
$ singularity build heat_1.2.0_torch.11_cuda11.5_py3.9.sif heat-singularity-image.def |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this one on CARO, but got
FATAL: You must be the root user, however you can use --remote or --fakeroot to build from a Singularity recipe file
Option --remote
however yields
FATAL: Unable to submit build job: no authentication token, log in with `singularity remote login`
and option --fakeroot
yields
FATAL: could not use fakeroot: no mapping entry found in /etc/subuid for hopp_fa
Is this just caused by the configuration on our cluster or can you add some hints here how to resolve this problem in general?
docker/README.md
Outdated
The resulting image (ID) should then be tagged for subsequent upload (push) to a | ||
repository, for example: | ||
|
||
$ docker tag ea0a1040bf8a ghcr.io/helmholtz-analytics/heat:1.2.0_torch1.11_cuda11.5_py3.9 | ||
$ docker push ghcr.io/helmholtz-analytics/heat:1.2.0_torch1.11_cuda11.5_py3.9 | ||
|
||
Please ensure that you push the same tag that you just created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me it is also unclear whether I have to run this when I just want to use Heat in containerized version or whether this just explains what had to be done in order to create the container.
docker/README.md
Outdated
A simple `Dockerfile` (in addition to the one above) to be used with SIB could look like | ||
this: | ||
|
||
FROM ghcr.io/helmholtz-analytics/heat:1.2.0_torch1.11_cuda11.5_py3.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to replace the Dockerfile from above (both have the same name)?
docker/README.md
Outdated
|
||
The invocation to build the image would be: | ||
|
||
$ sib upload ./Dockerfile heat_1.2.0_torch.11_cuda11.5_py3.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am just a bit confused (and could not try out the commands because sib
is not available on CARO), but where do I upload and from where do I download here?
docker/README.md
Outdated
$ sib build --recipe-name heat_1.2.0_torch.11_cuda11.5_py3.9 | ||
$ sib download --recipe-name heat_1.2.0_torch.11_cuda11.5_py3.9 | ||
|
||
### Apptainer (formerly singularity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the variant most likely to be used in HPC environments, we maybe should put this paragraph at the beginning and could shift docker and sib to an "expert" section?
When I run
i get the following error:
Is there missing some file |
Thank you for the PR! |
I tried it out. The build was no problem on my workstation, however running resulted in the following problem:
|
Co-authored-by: Claudia Comito <39374113+ClaudiaComito@users.noreply.github.com>
Thank you for the PR! |
1 similar comment
Thank you for the PR! |
I tested on our cluster as well. I needed to modify the above script a bit:
I dont know what |
Thank you for the PR! |
I removed the flag from the docs, missed it when creating the template. |
Thank you for the PR! |
Thank you for the PR! |
Thank you for the PR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bhagemeier @JuanPedroGHM @coquelin77 ! Merging
Thank you for the PR! |
The Dockerfile provides some flexibility in selecting which version of HeAT should be inside the Docker image. Also, one can choose whether to install from source or from PyPI.
Description
Provide a Dockerfile and short README about how to use it.
Issue/s resolved: #897
Changes proposed:
Type of change
Repository structure extension, no code change.
Memory requirements
n/a
Performance
n/a
Due Diligence
Does this change modify the behaviour of other functions? If so, which?
no
skip ci