Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot Install NVTabular along PyTorch 1.9 #1175

Closed
init27 opened this issue Oct 10, 2021 · 11 comments
Closed

[BUG] Cannot Install NVTabular along PyTorch 1.9 #1175

init27 opened this issue Oct 10, 2021 · 11 comments
Assignees
Labels
bug Something isn't working PyTorch

Comments

@init27
Copy link

init27 commented Oct 10, 2021

Hi Team,
I have been trying to run through the examples and wanted to setup my local environment for it:

Steps/Code to reproduce bug

  • I install NVTabular using conda install -c nvidia -c rapidsai -c numba -c conda-forge nvtabular python=3.7 cudatoolkit=11.0
  • Following which PyTorch 1.9 wouldn't install, I follow the instructions from the PyTorch website for the same: conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

Based on my previous encounters, I know PyTorch 1.9 doesn't like Python <=3.8 due to a torchvision dependecy which is still being looked into, and it appears NVTabular won't install with Python 3.9?

I tried installing PyTorch 1.7 however, it seems that the required version is 1.9 for making the examples work?

Apologies if I'm missing something straightforward here, I've tried different iterations of making the install work but I'm still stuck at getting things up and running.

Could you please point me to the right steps of getting everything up and running?

Thanks in advance for your help!

Aha! Link: https://nvaiinfra.aha.io/features/MERLIN-504

@init27 init27 added the bug Something isn't working label Oct 10, 2021
@jperez999
Copy link
Contributor

Hello @init27,

Based on the details you provided above, I think you may want to try using pip to install torch instead of conda. Unfortunately, conda will require you download separate cuda toolkits, one for nvtabular (11.0) and one for pytorch(10.2). That combination is already unstable and may lead to strange behaviors. If you want to run using conda to install nvtabular you may want to try installing pytorch using pip while in your activated conda environment. You may need to also update your path to include the pip package install location.

Alternatively you could use our prepackaged container hosted publicly on NGC. You would probably be interested in our Merlin Pytorch container, which can be found here: https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-pytorch-training. This has everything you need for all the pytorch examples (excluding inference) to execute successfully.

@init27
Copy link
Author

init27 commented Oct 12, 2021

Hello @jperez999!

Thank you for the reply!

If it's helpful, I also tried installing pytorch with conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia cuda 11.1, but the same install would fail as well.

I will try the pip approach, thank you for suggesting it and for the NGC Links. I was skipping the NGC approach since I'm absolutely new to docker but incase pip is too difficult, I will try the NGC approach!

I had another Q, as I was trying to run the first few example notebooks, I had to install a few packages and ran into an erorr that required me to downgrade the sklearn version. If it helpful, may I raise a PR with a requirements.txt file to work with the examples?

I think there's some value for it, for someone who wants to run through the example notebooks and get a hang of the API. Please let me know.

TIA! :)

@viswa-nvidia viswa-nvidia added P0 bug Something isn't working and removed bug Something isn't working P0 labels Oct 18, 2021
@pintonos
Copy link

pintonos commented Oct 22, 2021

Also having problems installing NVTabular with PyTorch.
I already tried all possible combinations of conda and pip install, but its not possible to use NVTabular as a dataloader via PyTorch:

Traceback (most recent call last):
  File "main.py", line 89, in <module>
    trainer.train()
  File "/home/.../miniconda3/envs/nvtabular10.2/lib/python3.8/site-packages/transformers/trainer.py", line 1093, in train
    train_dataloader = self.get_train_dataloader()
  File "/home/.../miniconda3/envs/nvtabular11.0/lib/python3.8/site-packages/transformers4rec/torch/trainer.py", line 128, in get_train_dataloader
    return T4RecDataLoader.parse(self.args.data_loader_engine).from_schema(
  File "/home/.../miniconda3/envs/nvtabular11.0/lib/python3.8/site-packages/transformers4rec/torch/utils/data_utils.py", line 53, in parse
    return dataloader_registry.parse(class_or_str)
  File "/home/.../miniconda3/envs/nvtabular11.0/lib/python3.8/site-packages/merlin_standard_lib/registry.py", line 265, in parse
    return self[class_or_str]
  File "/home/.../miniconda3/envs/nvtabular11.0/lib/python3.8/site-packages/merlin_standard_lib/registry.py", line 232, in __getitem__
    raise KeyError(
KeyError: 'nvtabular never registered with registry torch.dataloader_loader. Available:\n '

Note: I use NVTabular via the transformers4rec lib.

@benfred
Copy link
Member

benfred commented Oct 22, 2021

We don't have a conda package for python 3.9 - mainly because one of our major dependencies (cudf) only provides conda packages for python 3.7 and 3.8.

We do have a docker container with pytorch, nvtabular, transformers4rec, and all their dependencies here: docker pull nvcr.io/nvidia/merlin/merlin-pytorch-training:21.09 (see https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-pytorch-training for more info). If you're stuck installing nvtabular yourself, can you try the docker container?

@pintonos : can you post what you tried? How are you installing nvtabular?

@pintonos
Copy link

@benfred
In this order (including activating the environment):
conda create -n nvtabular102 -c nvidia -c rapidsai -c numba -c conda-forge nvtabular python=3.8 cudatoolkit=10.2

pip3 install torch==1.10.0+cu102 torchvision==0.11.1+cu102 torchaudio===0.10.0+cu102 -f https://download.pytorch.org/whl/cu102/torch_stable.html

pip install transformers4rec[all]

@benfred
Copy link
Member

benfred commented Nov 3, 2021

@pintonos - cudf doesn't support cuda 10.2, and will require cuda 11.0+ https://github.com/rapidsai/cudf#cudagpu-requirements . nvtabular itself is pretty easy to install (you can install without cudf by going pip install nvtabular - but this will mean running nvtabular only on the CPU, and remove gpu support). The problem is getting cudf installed.

Can you try the docker container to see if that works for you (docker pull nvcr.io/nvidia/merlin/merlin-pytorch-training:21.09). This container includes cuda 11.4 and user mode drivers so should work from your host machine, even if that only has cuda 10.2 installed

@benfred benfred moved this to In Progress in 21.12 Release Nov 4, 2021
@benfred
Copy link
Member

benfred commented Nov 15, 2021

Closing - let me know if you are still having any issues getting this up and running

@benfred benfred closed this as completed Nov 15, 2021
Repository owner moved this from In Progress to Done in 21.12 Release Nov 15, 2021
@NegatioN
Copy link

NegatioN commented Nov 24, 2021

Side-point: I'm curious why there's a cuda 10.2 cudf provided by the rapidsai channel when docs explicitly say it doesn't work? cuda_10.2_py37_gab3b3f653a_0 rapidsai/linux-64. This seems to crash doing a random task I'm trying to run.

Main-point: Doing a simple/naive install of NVTabular+Transformers4rec+Pytorch is extremely frustrating as it stands now, as it by default resolves to cpu packages for Pytorch.
conda install -c pytorch -c fastai -c nvidia -c conda-forge -c anaconda -c rapidsai nvtabular pytorch transformers4rec cudf python=3.7 cudatoolkit=11.*

It's not like Pytorch doesn't support cuda 11, so isn't this simply a misspesification in some conda files somewhere? I would think enough of your target audience uses Pytorch that it would be a priority to have a working conda setup?

Sorry if my wording seems harsh, but I've just spent a few hours trying to work around this. I cannot use the preinstalled container you've previously linked @benfred, as I am already working in a containerized environment.

@rnyak
Copy link
Contributor

rnyak commented Nov 24, 2021

Side-point: I'm curious why there's a cuda 10.2 cudf provided by the rapidsai channel when docs explicitly say it doesn't work? cuda_10.2_py37_gab3b3f653a_0 rapidsai/linux-64. This seems to crash doing a random task I'm trying to run.

Main-point: Doing a simple/naive install of NVTabular+Transformers4rec+Pytorch is extremely frustrating as it stands now, as it by default resolves to cpu packages for Pytorch. conda install -c pytorch -c fastai -c nvidia -c conda-forge -c anaconda -c rapidsai nvtabular pytorch transformers4rec cudf python=3.7 cudatoolkit=11.*

It's not like Pytorch doesn't support cuda 11, so isn't this simply a misspesification in some conda files somewhere? I would think enough of your target audience uses Pytorch that it would be a priority to have a working conda setup?

Sorry if my wording seems harsh, but I've just spent a few hours trying to work around this. I cannot use the preinstalled container you've previously linked @benfred, as I am already working in a containerized environment.

@NegatioN sorry to hear that you are having issues. What's the reason you cannot use merlin docker containers? can you try to do these steps:

docker pull nvcr.io/nvidia/merlin/merlin-pytorch-training:21.11   # current latest container
docker run -it --gpus all -p 9999:8888 -p 9797:9998 -p 9796:8777 --ipc=host nvcr.io/nvidia/merlin/merlin-pytorch-training:21.11 
cd /nvtabular
pip install torch==1.10.0
jupyter-lab --allow-root --ip='0.0.0.0' --NotebookApp.token=''

Then Open a browser from the host OS to access the jupyter-lab server using http://<MachineIP>:9999.

Let us know what's the issue you are facing in launching the container?

@NegatioN
Copy link

Hi @rnyak. The reason I can't use the container you're linking, is that I'm already running from inside a container. I don't want to run the merlin container inside my already running container, that's running on top of Kubernetes. I know it's technically possible, but that would force a lot of complexity on me. So I'm not struggling with getting the container itself to work, it probably works fine. I just want to not use it.

I'm still curious what the issue of installing Pytorch + Cudf (+ Nvtabular + transformers4rec) all from Conda is though? If it works when we do a lot of contrived steps to install it, there shouldn't be any incompatabilities, except not specifying package requirements properly?
If the installation process is so brittle that a Docker image is recommended, surely that means the installation process should be fixed? That way there would surely be more users of these libraries as well. Even if I get this working now in my environment, I'm worried about using it in production if the installation process isn't stable.

@NegatioN
Copy link

NegatioN commented Nov 29, 2021

Bump and tag @rnyak: I guess the major issue is that [conda] Cudf is compatible with Cuda 11.0 & 11.2, while [conda] Pytorch[1.10] is compatible with cuda 11.3. Is there no option for Nvidia to build Cudf against all versions of cuda from 11.0 to 11.4 and release this in the conda repo?

That wouldn't help for potential incompatabilities in the NVTabular & Transformers4Rec repos that might arise with each bump of Pytorch, but shouldn't it make the build-process a lot smoother for pleb end-users like me? And it seems like a nice feature to have, unless it's utterly impossible.

Edit: Seems like there's already a thread for this going on in Cudf rapidsai/cudf#8510

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working PyTorch
Projects
Status: Done
Development

No branches or pull requests

7 participants