Make torch available #520

bryanpaget · 2023-09-06T19:36:30Z

Torch has never worked (aside from being able to import on the GPU image in the torch conda env) nor have the CUDA drivers worked (true). This PR is to fix all that.

The PyTorch install command includes CUDA drivers, so I'll try installing those instead of our custom install script.

gputil has nvidia-smi

bryanpaget · 2023-09-11T16:41:56Z

Before (on Prod):

Source: https://gist.github.com/bryanpaget/c88ec337f6c5d35187f979c6b7f27e59#file-cuda-not-available-ipynb

After (on Dev):

Source: https://gist.github.com/bryanpaget/c88ec337f6c5d35187f979c6b7f27e59#file-cuda-available-ipynb

bryanpaget · 2023-09-11T17:57:48Z

Makefile

I removed the CUDA drivers from the PyTorch and Tensorflow Dockerbits because those are handled by pytorch-cuda=11.8 in pytorch.

Also the rstudio-server docker-bit has been added due to rstudio-server being broken out into its own docker-bit.

bryanpaget · 2023-09-11T18:01:46Z

docker-bits/2_pytorch.Dockerfile

These changes are just to update the torch virtual environment, install the required pacakges including pytorch-cuda=11.8 which handles the CUDA drivers.

I also tweaked the pytorch image to use mamba for the clean command

bryanpaget · 2023-09-11T18:19:04Z

Since I removed the manual CUDA installation in favor of the Anaconda packaging, I also had to fix up the Tensorflow image to install cuda in the same way.

# Install Tensorflow
RUN mamba install -n tensorflow --quiet --yes -c anaconda -c conda-forge -c nvidia \
        tensorflow \
        tensorflow-gpu \
        cudatoolkit=11.8 \
        cudnn \
        # gputil has nvidia-smi
        gputil \
        ipykernel \
    && \
        mamba clean --all -f -y && \
        fix-permissions $CONDA_DIR && \
        fix-permissions /home/$NB_USER && \
        source activate tensorflow && \
        python -m ipykernel install --user --name tensorflow --display-name "TensorFlow"

bryanpaget · 2023-09-11T18:20:36Z

If the tensorflow stuff works out with Anaconda, then we can probably do away with the tensorflow image and pytorch image dichotomy and just have separate conda environments on the same jupyterlab-gpu image.

I also tweaked the pytorch image to use mamba for the clean command

tensorflow works for both gpu and cpu

tensorflow tests are failing, I think they expect tensorflow to be installed in the base env, which is what I prefer, so I'll move pytorch into the base env as well.

bryanpaget · 2023-09-12T15:17:54Z

If the tensorflow stuff works out with Anaconda, then we can probably do away with the tensorflow image and pytorch image dichotomy and just have separate conda environments on the same jupyterlab-gpu image.

This doesn't seem to work. I can get PyTorch to work with Anaconda's GPU-related packages, but Tensorflow doesn't seem to like the Anaconda GPU-related packages.

Github Actions won't have a GPU

so I tried removing the version pinning, it wasn't breaking before so I'm not sure what changed.

since it breaks on the newer versions of R but we don't have a consistent version of R across images so it might be helpful to unpin tidymodels so we can let the system resolve a compatible version.

Jose-Matsuda

I see you got caught between a rock and a hard place here, having to update to 2204 in this branch so that when the other branch got merged in this pr wouldnt become obsolete.

Though if I could nitpick id say the remote desktop / rstudio changes are a bit much to have in this PR just because it's not directly related to the purpose of the PR and I guess depending on what gets merged first it might be "lost" anyways since it would be the same as on the main branch.

bryanpaget · 2023-09-18T15:55:56Z

I see you got caught between a rock and a hard place here, having to update to 2204 in this branch so that when the other branch got merged in this pr wouldnt become obsolete.

Though if I could nitpick id say the remote desktop / rstudio changes are a bit much to have in this PR just because it's not directly related to the purpose of the PR and I guess depending on what gets merged first it might be "lost" anyways since it would be the same as on the main branch.

Good point. I will modify my PR so that it is more focused.

tests/jupyterlab-tensorflow/test_tensorflow.py

We are not yet using a Tensorflow conda env.

bryanpaget · 2023-09-19T20:05:40Z

Testing

PyTorch

PyTorch seems to work!

Tensorflow

I'm testing Tensorflow just to make sure I didn't create any issues for that image since PyTorch and Tensorflow were using the same NVIDIA drivers before the update.

* update base image and fix errors * Update 6_remote-desktop.Dockerfile: remove light-locker The light-locker removal command was failing, as light-locker was not installed... so I removed the removal command but when I launch the container on Kubeflow... the screen is locked. * make generate-dockerfiles * trigger CI/CD * remove light-locker later in build process the screen is still locking. :-/ * remove xfce4-screensaver the screen is still locking. :-/ * test(pspp): French UI Test PSPP from Ubuntu Repo to make sure French UI is shown. Remove installer script and update Dockerfile. * add apt-get update * remove pspp.sh * remove pspp.sh * Update build_push.yaml: actions/setup-python@v4 (#490) update actions/setup-python@v2 to actions/setup-python@v4 * Update build_push.yaml: deprecate set-output (#491) * Update build_push.yaml: deprecate set-output update workflow to use $GITHUB_OUTPUT instead of set-output. See: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/ * Update build_push.yaml: update syntax * Update build_push.yaml: update syntax for interpolation * Update build_push.yaml: change notebook name variable name * update remaining set-output commands * Update Makefile: reset set-output commands for Makefile Github Actions was complaining about set-output in the Makefile but I think we have to leave these lines alone since they are used by Make and not by Github Actions. * Update Makefile (#508) * Update rstudio.desktop: /usr/bin/rstudio The previously set location no longer exists. * Update 6_rstudio.Dockerfile: update Rstudio version I've updated rstudio-server to a (hopefully) compatible version. * make generate-dockerfiles * downgrade: rstudio-desktop to 2023.06.0-421 * update(r-studio-desktop.sh): sha256 * update(Rprofile.site): dev repo * update(Rprofile.site): reset R repo * update(6_rstudio): mkdir -p /etc/rstudio builds were failing because this dir did not exist * update set-output to GITHUB_OUTPUT (#501) * update set-output to GITHUB_OUTPUT * create(6_rstudio-server.Dockerfile) * update(6_rstudio): mkdir -p /etc/rstudio * update(0_cpu, 0_cpu_sas): BASE_VERSION=2023-08-07 * update(6_rstudio): remove tidymodels, causing test to fail * update(test_packages): add tidymodels to exclude list * update(test_packages): add more to exclusion list * update(CUDA): update ubuntu1804 to ubuntu2204 * update(test_packages): tidymodels ==> r-tidymodels * update(test_packages): comment out two jupyterlab extensions * update(jupyterlab): jupyterlab-git==0.42.0 * update(jupyterlab): update packages and vscode * update(jupyterlab): refactor dockerfile - switch from conda to mamba - switch some conda statements to pip * make generate-dockerfiles * downgrade tidymodels to 1.0.0 We'll have to wait until the rstudio image is based on R 4.3 before we can upgrade to tidymodels==1.1.0 * Update Makefile: remove buildkit=0 remove buildkit=0 * update (0_Rocker, r-studio-desktop): 2023.06.2-561 * update rstudio-server: 2023.06.2-561 * update(PR): based on comments * update(jupyterlab): jupyter-dash caused build fail so I tried removing the version pinning, it wasn't breaking before so I'm not sure what changed. * update(rstudio): remove pin on tidymodels since it breaks on the newer versions of R but we don't have a consistent version of R across images so it might be helpful to unpin tidymodels so we can let the system resolve a compatible version. * update(get-nvidia-stuff): 1804 to 2204 * Downgrade remote desktop (#519) * revert(remote-desktop): to previous working state * update(r-studio-desktop): revert installer * update(PR): based on comments * update(jupyterlab): jupyter-dash caused build fail so I tried removing the version pinning, it wasn't breaking before so I'm not sure what changed. * update(rstudio): remove pin on tidymodels since it breaks on the newer versions of R but we don't have a consistent version of R across images so it might be helpful to unpin tidymodels so we can let the system resolve a compatible version. * update(get-nvidia-stuff): 1804 to 2204 * Update Dockerfile --------- Co-authored-by: Bryan Paget <bryan.paget@statcan.gc.ca> * make generate-dockerfiles * update branch (#524) * Update build_push.yaml: actions/setup-python@v4 (#490) update actions/setup-python@v2 to actions/setup-python@v4 * Update build_push.yaml: deprecate set-output (#491) * Update build_push.yaml: deprecate set-output update workflow to use $GITHUB_OUTPUT instead of set-output. See: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/ * Update build_push.yaml: update syntax * Update build_push.yaml: update syntax for interpolation * Update build_push.yaml: change notebook name variable name * update remaining set-output commands * Update Makefile: reset set-output commands for Makefile Github Actions was complaining about set-output in the Makefile but I think we have to leave these lines alone since they are used by Make and not by Github Actions. * Update Makefile (#508) * update set-output to GITHUB_OUTPUT (#501) * update set-output to GITHUB_OUTPUT * feat(workflows): integrate hadolint (dockle failed) (#500) * feat(workflow): leverage hadolint * Remove minio (#522) --------- Co-authored-by: Jose Manuel (Ito) <jose.matsuda@statcan.gc.ca> Co-authored-by: Wendy Gaultier <wvgaultier@gmail.com> * revert(6_jupyterlab): revert digression * make generate-dockerfiles * Update 6_jupyterlab.Dockerfile Fix issue with JupyterLab extensions * Update 6_rstudio.Dockerfile: remove tidymodels Tidymodels is a troublesome package, will leave to the user to install, if needed. * make generate-dockerfiles * update(jupyterlab): 4.0.5 I was getting errors saying xxx extension needs to be included in build. * update(jupyterlab): 4.0.5 I was getting errors saying xxx extension needs to be included in build. * update(jupyterlab): 4.0.5 I was getting errors saying xxx extension needs to be included in build. * update(jupyterlab): fix extensions I reworked the package install script to use fewer RUN blocks. I also discovered jupyter_contrib_nbextensions prefers pip over mamba. Local testing suggests I've stopped the jupyterlab build errors. * update(test_packages): add jupyter ext to exclude list I added pillow and pyyaml to the exclude list since they are not imported by name (e.g. import pil, import yaml). * Make torch available (#520) * update(pytorch): remove virtual env * update(pytorch): remove virtual env * update(cpu, pytorch): mamba install pytorch to base * update(pytorch): adjust torch installation * update(pytorch): add ipykernel and conda env * update(pytorch): remove CUDA The PyTorch install command includes CUDA drivers, so I'll try installing those instead of our custom install script. * update(pytorch): add ipykernel * update(pytorch): add gputil gputil has nvidia-smi * update(tensorflow): add cuda to mamba command I also tweaked the pytorch image to use mamba for the clean command * update(tensorflow): add cuda to mamba command I also tweaked the pytorch image to use mamba for the clean command * update(tensorflow): remove tensorflow-gpu tensorflow works for both gpu and cpu * update(gpu-notebooks): remove conda env tensorflow tests are failing, I think they expect tensorflow to be installed in the base env, which is what I prefer, so I'll move pytorch into the base env as well. * update(cpu, pytorch, tensorflow): consistency * update(test_tensorflow): use tensorflow env * update(test_packages): add gputil to exclude list * update(test_packages): add cudnn, cudatoolkit to exclude list * update(pytorch, tensorflow): ipykernel install * revert(cpu): fix cpu conda env * update(tests): gpu available * update(makefile): restore tensorflow build * update(tests): remove GPU test Github Actions won't have a GPU * update(jupyterlab): jupyter-dash caused build fail so I tried removing the version pinning, it wasn't breaking before so I'm not sure what changed. * update(PR): based on comments * update(rstudio): remove pin on tidymodels since it breaks on the newer versions of R but we don't have a consistent version of R across images so it might be helpful to unpin tidymodels so we can let the system resolve a compatible version. * update(get-nvidia-stuff): 1804 to 2204 * revert(2_tensorflow): prev working configuration * update(0_Rocker): remove whitespace delta * update(2_tensorflow): new line * Update test_tensorflow.py: revert test We are not yet using a Tensorflow conda env. --------- Co-authored-by: Bryan Paget <bryan.paget@statcan.gc.ca> * Update test_packages.py: add missing comma * Jupyterlab openmpp poc (#518) * feat: install openmpp as jupyterlab service * fix: generate dockerfiles * chore: trigger auto-deploy * fix: copy oms startup script * fix: copy script in correct docker bit * fix: make script executable * fix: update openm version, fix config * fix: sync issue * fix: prepare openmpp config for prod deployment * fix: move config to start-oms script * Jupyterlab openmpp poc (#518) (#527) * feat: install openmpp as jupyterlab service * fix: generate dockerfiles * chore: trigger auto-deploy * fix: copy oms startup script * fix: copy script in correct docker bit * fix: make script executable * fix: update openm version, fix config * fix: sync issue * fix: prepare openmpp config for prod deployment * fix: move config to start-oms script Co-authored-by: Pat Ledgerwood <32804494+vexingly@users.noreply.github.com> * Jupyterlab openmpp poc (#518) (#528) * feat: install openmpp as jupyterlab service * fix: generate dockerfiles * chore: trigger auto-deploy * fix: copy oms startup script * fix: copy script in correct docker bit * fix: make script executable * fix: update openm version, fix config * fix: sync issue * fix: prepare openmpp config for prod deployment * fix: move config to start-oms script Co-authored-by: Pat Ledgerwood <32804494+vexingly@users.noreply.github.com> * update(jupyterlab): add --openssl-legacy-provider npm build ompp-ui was failing, adding --openssl-legacy-provider to the build command resolved the issue. * update(jupyterlab): add ARG NODE_OPTIONS npm build ompp-ui was failing, adding --openssl-legacy-provider to the build command resolved the issue locally but breaks on Github Actions, will try ARG NODE_OPTIONS=--openssl-legacy-provider. --------- Co-authored-by: Bryan Paget <bryan.paget@statcan.gc.ca> Co-authored-by: Jose Manuel (Ito) <jose.matsuda@statcan.gc.ca> Co-authored-by: Wendy Gaultier <wvgaultier@gmail.com> Co-authored-by: Pat Ledgerwood <32804494+vexingly@users.noreply.github.com>

update(pytorch): remove virtual env

aba74b7

bryanpaget added the auto-deploy Trigger manual CI steps for this PR label Sep 6, 2023

Bryan Paget added 7 commits September 6, 2023 20:29

update(pytorch): remove virtual env

615dafb

update(cpu, pytorch): mamba install pytorch to base

d492f92

update(pytorch): adjust torch installation

211896c

update(pytorch): add ipykernel and conda env

bc43ff8

update(pytorch): remove CUDA

3230cfe

The PyTorch install command includes CUDA drivers, so I'll try installing those instead of our custom install script.

update(pytorch): add ipykernel

c24dcb8

update(pytorch): add gputil

fd3456e

gputil has nvidia-smi

bryanpaget commented Sep 11, 2023

View reviewed changes

bryanpaget marked this pull request as ready for review September 11, 2023 18:02

update(tensorflow): add cuda to mamba command

1debe15

I also tweaked the pytorch image to use mamba for the clean command

Bryan Paget added 10 commits September 11, 2023 19:34

update(tensorflow): add cuda to mamba command

1126a63

I also tweaked the pytorch image to use mamba for the clean command

update(tensorflow): remove tensorflow-gpu

ad4869f

tensorflow works for both gpu and cpu

update(gpu-notebooks): remove conda env

4680033

tensorflow tests are failing, I think they expect tensorflow to be installed in the base env, which is what I prefer, so I'll move pytorch into the base env as well.

update(cpu, pytorch, tensorflow): consistency

34fc30f

update(test_tensorflow): use tensorflow env

4348cc3

update(test_packages): add gputil to exclude list

a69f041

update(test_packages): add cudnn, cudatoolkit to exclude list

92990db

update(pytorch, tensorflow): ipykernel install

39688b0

revert(cpu): fix cpu conda env

c3e888d

update(tests): gpu available

271649d

Bryan Paget added 3 commits September 12, 2023 15:27

update(makefile): restore tensorflow build

08b5969

update(tests): remove GPU test

f369386

Github Actions won't have a GPU

update(jupyterlab): jupyter-dash caused build fail

fbc9b7f

so I tried removing the version pinning, it wasn't breaking before so I'm not sure what changed.

Bryan Paget added 3 commits September 13, 2023 13:59

update(PR): based on comments

7ecd7c8

update(rstudio): remove pin on tidymodels

229e6a1

since it breaks on the newer versions of R but we don't have a consistent version of R across images so it might be helpful to unpin tidymodels so we can let the system resolve a compatible version.

update(get-nvidia-stuff): 1804 to 2204

7d805db

bryanpaget linked an issue Sep 13, 2023 that may be closed by this pull request

Unable to import torch on pytorch image. #514

Closed

bryanpaget mentioned this pull request Sep 13, 2023

Unable to import torch on pytorch image. #514

Closed

Jose-Matsuda approved these changes Sep 14, 2023

View reviewed changes

bryanpaget changed the base branch from master to update-base-image-to-22.04 September 18, 2023 15:09

Merge branch 'update-base-image-to-22.04' into make-torch-available

2032b25

bryanpaget added 3 commits September 18, 2023 17:51

revert(2_tensorflow): prev working configuration

899728d

update(0_Rocker): remove whitespace delta

d184caf

update(2_tensorflow): new line

cc1909d

bryanpaget commented Sep 18, 2023

View reviewed changes

tests/jupyterlab-tensorflow/test_tensorflow.py Outdated Show resolved Hide resolved

Update test_tensorflow.py: revert test

ab4100e

We are not yet using a Tensorflow conda env.

Merge branch 'update-base-image-to-22.04' into make-torch-available

3740283

bryanpaget merged commit 777e07e into update-base-image-to-22.04 Sep 21, 2023

bryanpaget deleted the make-torch-available branch September 26, 2023 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make torch available #520

Make torch available #520

bryanpaget commented Sep 6, 2023 •

edited

Loading

bryanpaget commented Sep 11, 2023

bryanpaget Sep 11, 2023

bryanpaget Sep 11, 2023

bryanpaget commented Sep 11, 2023 •

edited

Loading

bryanpaget commented Sep 11, 2023

bryanpaget commented Sep 12, 2023

Jose-Matsuda left a comment

bryanpaget commented Sep 18, 2023

bryanpaget commented Sep 19, 2023

Make torch available #520

Make torch available #520

Conversation

bryanpaget commented Sep 6, 2023 • edited Loading

bryanpaget commented Sep 11, 2023

Before (on Prod):

After (on Dev):

bryanpaget Sep 11, 2023

Choose a reason for hiding this comment

bryanpaget Sep 11, 2023

Choose a reason for hiding this comment

bryanpaget commented Sep 11, 2023 • edited Loading

bryanpaget commented Sep 11, 2023

bryanpaget commented Sep 12, 2023

Jose-Matsuda left a comment

Choose a reason for hiding this comment

bryanpaget commented Sep 18, 2023

bryanpaget commented Sep 19, 2023

Testing

PyTorch

Tensorflow

bryanpaget commented Sep 6, 2023 •

edited

Loading

bryanpaget commented Sep 11, 2023 •

edited

Loading