Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dockerfile and accompanying documentation #970

Merged
merged 20 commits into from
Jun 20, 2023

Conversation

bhagemeier
Copy link
Member

The Dockerfile provides some flexibility in selecting which version of HeAT should be inside the Docker image. Also, one can choose whether to install from source or from PyPI.

Description

Provide a Dockerfile and short README about how to use it.

Issue/s resolved: #897

Changes proposed:

  • add docker/{Dockerfile, README.md}

Type of change

Repository structure extension, no code change.

Memory requirements

n/a

Performance

n/a

Due Diligence

  • All split configurations tested
  • Multiple dtypes tested in relevant functions
  • Documentation updated (if needed)
  • Updated changelog.md under the title "Pending Additions"

Does this change modify the behaviour of other functions? If so, which?

no

skip ci

@ghost
Copy link

ghost commented Apr 28, 2022

👇 Click on the image for a new way to code review

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map legend

@codecov
Copy link

codecov bot commented Apr 28, 2022

Codecov Report

Merging #970 (072d504) into main (6ddc295) will decrease coverage by 0.09%.
The diff coverage is n/a.

❗ Current head 072d504 differs from pull request most recent head 2cd4785. Consider uploading reports for the commit 2cd4785 to get more accurate results

@@            Coverage Diff             @@
##             main     #970      +/-   ##
==========================================
- Coverage   91.85%   91.77%   -0.09%     
==========================================
  Files          74       72       -2     
  Lines       10712    10485     -227     
==========================================
- Hits         9840     9623     -217     
+ Misses        872      862      -10     
Flag Coverage Δ
unit 91.77% <ø> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 7 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@bhagemeier bhagemeier marked this pull request as ready for review October 5, 2022 13:58
@coquelin77
Copy link
Member

another container to try: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch

Copy link
Member

@coquelin77 coquelin77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good so far. only the mpi4py suggestion jumps out to me

@ClaudiaComito
Copy link
Contributor

@Mystic-Slice @shahpratham @JedrzejMosiezny if you happen to have time, can you try this out as well? Thanks a lot!

Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bhagemeier for providing this. My main problem with this PR is that I don't understand / can't follow the README.

docker/README.md Outdated
The [Dockerfile](./Dockerfile) guiding the build of the Docker image is located in this
directory. It is typically most convenient to `cd` over here and run the Docker build as:

$ docker build .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instructions to run the container needed here

docker/README.md Outdated
Comment on lines 20 to 28
The resulting image (ID) should then be tagged for subsequent upload (push) to a
repository, for example:

$ docker tag ea0a1040bf8a ghcr.io/helmholtz-analytics/heat:1.2.0_torch1.11_cuda11.5_py3.9
$ docker push ghcr.io/helmholtz-analytics/heat:1.2.0_torch1.11_cuda11.5_py3.9

Please ensure that you push the same tag that you just created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for the devs or the users? I'm not sure what the use case is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it is also unclear whether I have to run this when I just want to use Heat in containerized version or whether this just explains what had to be done in order to create the container.

docker/README.md Outdated
Comment on lines 28 to 49
## Building for HPC

With HeAT being a native HPC library, one would naturally want to build the container
image also for HPC systems, such as the ones available at [Juelich Supercomputing Centre
(JSC)](https://www.fz-juelich.de/jsc/ "Juelich Supercomputing Centre").

HPC centres may run a choice of Apptainer or Singularity, which may incur limitations to
the flexibility of building images. For instance, the Singularity Image Builder (SIB)
does not work with the arguments mentioned above, such that these will have to be
avoided.

However, SIB is capable of using just about any available Docker image from any
registry, such that a specific Singularity image can be built by simply referencing the
available image. SIB is thus used as a conversion tool.

A simple `Dockerfile` (in addition to the one above) to be used with SIB could look like
this:

FROM ghcr.io/helmholtz-analytics/heat:1.2.0_torch1.11_cuda11.5_py3.9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. I would like to start containerized Heat from source on a specific branch. What am I supposed to do? Should I load some modules first?

FROM ghcr.io/helmholtz-analytics/heat:1.2.0_torch1.11_cuda11.5_py3.9
-bash: FROM: command not found

@JuanPedroGHM
Copy link
Member

Added a working multi-node example on the Docker readme and a singularity definition file. Both worked on HoreKa. Would be good if someone could confirm that it works on other systems as well.

bhagemeier and others added 9 commits March 3, 2023 10:08
The Dockerfile provides some flexibility in selecting which version of HeAT should be inside
the Docker image. Also, one can choose whether to install from source or from PyPI.
Some code sections had a mix of spaces and tabs, which have now been
convertd into tabs.
Use pytorch 1.11
Fix problem with CUDA package repo keys
NVidia images come with support for HPC systems desirable for our uses.
They work a little differently internally and required some changes.

The tzdata configuration configures the CET/CEST timezone, which seems
to be required when installing additional packages.

There is an issue with pip caches in the image, which led to the final
cache purge to fail in the PyPI release based build. This is fixed
through a final invocation of true.
@JuanPedroGHM JuanPedroGHM force-pushed the features/897-containerization branch from 6a69b56 to 64b474f Compare March 3, 2023 09:11
@ClaudiaComito ClaudiaComito added this to the 1.3.0 milestone Mar 29, 2023
docker/README.md Outdated
Dockerfile. This method does not support build arguments, so version, branch and type of installation have to
changed in the definition file.

$ singularity build heat_1.2.0_torch.11_cuda11.5_py3.9.sif heat-singularity-image.def
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this one on CARO, but got

FATAL:   You must be the root user, however you can use --remote or --fakeroot to build from a Singularity recipe file

Option --remote however yields

FATAL:   Unable to submit build job: no authentication token, log in with `singularity remote login`

and option --fakeroot yields

FATAL:   could not use fakeroot: no mapping entry found in /etc/subuid for hopp_fa

Is this just caused by the configuration on our cluster or can you add some hints here how to resolve this problem in general?

docker/README.md Outdated Show resolved Hide resolved
docker/README.md Outdated
Comment on lines 20 to 28
The resulting image (ID) should then be tagged for subsequent upload (push) to a
repository, for example:

$ docker tag ea0a1040bf8a ghcr.io/helmholtz-analytics/heat:1.2.0_torch1.11_cuda11.5_py3.9
$ docker push ghcr.io/helmholtz-analytics/heat:1.2.0_torch1.11_cuda11.5_py3.9

Please ensure that you push the same tag that you just created.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it is also unclear whether I have to run this when I just want to use Heat in containerized version or whether this just explains what had to be done in order to create the container.

docker/README.md Outdated
A simple `Dockerfile` (in addition to the one above) to be used with SIB could look like
this:

FROM ghcr.io/helmholtz-analytics/heat:1.2.0_torch1.11_cuda11.5_py3.9
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to replace the Dockerfile from above (both have the same name)?

docker/README.md Outdated

The invocation to build the image would be:

$ sib upload ./Dockerfile heat_1.2.0_torch.11_cuda11.5_py3.9
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am just a bit confused (and could not try out the commands because sib is not available on CARO), but where do I upload and from where do I download here?

docker/README.md Outdated
$ sib build --recipe-name heat_1.2.0_torch.11_cuda11.5_py3.9
$ sib download --recipe-name heat_1.2.0_torch.11_cuda11.5_py3.9

### Apptainer (formerly singularity)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the variant most likely to be used in HPC environments, we maybe should put this paragraph at the beginning and could shift docker and sib to an "expert" section?

docker/README.md Show resolved Hide resolved
@mrfh92
Copy link
Collaborator

mrfh92 commented Apr 24, 2023

When I run

singularity build heat_1.2.0_torch.11_cuda11.5_py3.9.sif heat-singularity-image.def

i get the following error:

INFO:    User not listed in /etc/subuid, trying root-mapped namespace
INFO:    The %post section will be run under fakeroot
FATAL:   Unable to build from heat-singularity-image.def: while parsing definition: heat-singularity-image.def: failed to parse deffile header: header key cat heat-nvidia.def had no val

Is there missing some file heat-nvidia.def?

@github-actions
Copy link
Contributor

Thank you for the PR!

docker/README.md Outdated Show resolved Hide resolved
docker/README.md Outdated Show resolved Hide resolved
docker/README.md Outdated Show resolved Hide resolved
docker/README.md Outdated Show resolved Hide resolved
@mrfh92
Copy link
Collaborator

mrfh92 commented May 8, 2023

I tried it out. The build was no problem on my workstation, however running resulted in the following problem:

hopp_fa@sc-030122l:~/heat/docker$ singularity run --nv heat.sif /bin/bash
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (492) bind mounts

=============
== PyTorch ==
=============

NVIDIA Release 22.05 (build 37432893)
PyTorch Version 1.12.0a0+8a1a93a

Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 11.7 driver version 515.43.04 with kernel driver version 470.182.03.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

Apptainer> python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import heat 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/hopp_fa/.local/lib/python3.8/site-packages/heat/__init__.py", line 5, in <module>
    from .core import *
  File "/home/hopp_fa/.local/lib/python3.8/site-packages/heat/core/__init__.py", line 5, in <module>
    from .arithmetics import *
  File "/home/hopp_fa/.local/lib/python3.8/site-packages/heat/core/arithmetics.py", line 7, in <module>
    import torch
  File "/home/hopp_fa/.local/lib/python3.8/site-packages/torch/__init__.py", line 751, in <module>
    from .functional import *  # noqa: F403
  File "/home/hopp_fa/.local/lib/python3.8/site-packages/torch/functional.py", line 8, in <module>
    import torch.nn.functional as F
  File "/home/hopp_fa/.local/lib/python3.8/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/home/hopp_fa/.local/lib/python3.8/site-packages/torch/nn/modules/__init__.py", line 2, in <module>
    from .linear import Identity, Linear, Bilinear, LazyLinear
  File "/home/hopp_fa/.local/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 6, in <module>
    from .. import functional as F
  File "/home/hopp_fa/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 18, in <module>
    from .._jit_internal import boolean_dispatch, _overload, BroadcastingList1, BroadcastingList2, BroadcastingList3
  File "/home/hopp_fa/.local/lib/python3.8/site-packages/torch/_jit_internal.py", line 24, in <module>
    import torch.distributed.rpc
  File "/home/hopp_fa/.local/lib/python3.8/site-packages/torch/distributed/__init__.py", line 55, in <module>
    from .distributed_c10d import *  # noqa: F403
  File "/home/hopp_fa/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 188, in <module>
    reduce_op = _reduce_op()
  File "/home/hopp_fa/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 176, in __init__
    for k, v in ReduceOp.__members__.items():
AttributeError: type object 'torch._C._distributed_c10d.ReduceOp' has no attribute '__members__'
>>> import heat as ht
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/hopp_fa/.local/lib/python3.8/site-packages/heat/__init__.py", line 5, in <module>
    from .core import *
  File "/home/hopp_fa/.local/lib/python3.8/site-packages/heat/core/__init__.py", line 5, in <module>
    from .arithmetics import *
  File "/home/hopp_fa/.local/lib/python3.8/site-packages/heat/core/arithmetics.py", line 7, in <module>
    import torch
  File "/home/hopp_fa/.local/lib/python3.8/site-packages/torch/__init__.py", line 231, in <module>
    __all__ += [name for name in dir(_C)
NameError: name '_C' is not defined
>>> 

JuanPedroGHM and others added 2 commits May 8, 2023 15:03
Co-authored-by: Claudia Comito <39374113+ClaudiaComito@users.noreply.github.com>
@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2023

Thank you for the PR!

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2023

Thank you for the PR!

@mrfh92
Copy link
Collaborator

mrfh92 commented May 8, 2023

I tested on our cluster as well. I needed to modify the above script a bit:

#!/bin/bash
#SBATCH --time 0:10:00
#SBATCH --nodes 2
#SBATCH --tasks-per-node 2

module load singularity # make singularity available (may be different on each system) 
srun --mpi="pmi2" singularity exec <path_to_your_heat_image>/heat.sif bash -c "cd ~/heat/examples/lasso; python demo.py"

I dont know what --bind /scratch is supposed to do, but it produced errors on my system. Without it, everything is fine...

@github-actions
Copy link
Contributor

Thank you for the PR!

@JuanPedroGHM
Copy link
Member

JuanPedroGHM commented May 30, 2023

I tested on our cluster as well. I needed to modify the above script a bit:

#!/bin/bash
#SBATCH --time 0:10:00
#SBATCH --nodes 2
#SBATCH --tasks-per-node 2

module load singularity # make singularity available (may be different on each system) 
srun --mpi="pmi2" singularity exec <path_to_your_heat_image>/heat.sif bash -c "cd ~/heat/examples/lasso; python demo.py"

I dont know what --bind /scratch is supposed to do, but it produced errors on my system. Without it, everything is fine...

I removed the flag from the docs, missed it when creating the template.

@github-actions
Copy link
Contributor

Thank you for the PR!

@github-actions
Copy link
Contributor

Thank you for the PR!

@github-actions
Copy link
Contributor

github-actions bot commented Jun 7, 2023

Thank you for the PR!

Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions
Copy link
Contributor

Thank you for the PR!

@ClaudiaComito ClaudiaComito merged commit 966a7a8 into main Jun 20, 2023
@ClaudiaComito ClaudiaComito deleted the features/897-containerization branch June 20, 2023 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide containerized HeAT
5 participants