Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. #25

Open
PCanavelli opened this issue Feb 6, 2025 · 2 comments

Comments

@PCanavelli
Copy link

Hey there,

First thing first: massive congrats, and even more massive thanks for JoyCaption. This is by far the best general-purpose captioner I have ever tried, especially for creating Flux datasets.

In a local Ubuntu environment (WSL 2, python 3.12), the model runs flawlessly. Unfortunately, I have been stuck with a very strange error when trying to run it from within a Docker image.

When calling the generate method, I get:

  File "/opt/src/captioning/oh/captioner.py", line 67, in __call__
    generate_ids = self.model.generate(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2228, in generate
    result = self._sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 3209, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llava/modeling_llava.py", line 491, in forward
    inputs_embeds = self.get_input_embeddings()(input_ids)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 164, in forward
    return F.embedding(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2267, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet.
If you're using torch.compile/export/fx, it is likely that we are erroneously tracing into a custom kernel. To fix this, please wrap the custom kernel into an opaque custom op. Please see the following for details: https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html
If you're using Caffe2, Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.

This error is wild. I could only find a couple results for it, none of which have helpful solutions. It looks like it comes from the C back-end of Torch. This screams of CUDA / dependencies issues, but I haven't been able to find any details regarding the requirements / install best practices for JoyCaption.

Is this a known issue, and is there a recommended environment (OS + CUDA + dependencies) for running JoyCaption?

For more context: my Dockerfile builds on top of nvidia/cuda:12.4.0-runtime-ubuntu22.04, and my requirements.txt is:

accelerate==1.1.1
bitsandbytes==0.43.3
boto3==1.36.11
diffusers==0.30.2
fastapi==0.115.8
gputil==1.4.0
loguru==0.7.2
nest_asyncio==1.6.0
peft==0.14.0
protobuf==3.20.3
pydantic==2.10.6
pytest==7.4.4
pytest-cov==5.0.0
pytest-lazy-fixture==0.6.3
pytest-loguru==0.3.0
runpod==1.7.0
sentencepiece==0.2.0 
sentry-sdk==2.14.0
torch==2.5.1
torchaudio==2.5.1
torchvision==0.20.1
transformers==4.48.0
uvicorn==0.33.0

I'm omitting the bulk of my code / Dockerfile for brevity, since all those do is basically spin up a FastAPI app and forward requests to the captioner, but let me know if those could actually help.

Also, this happens regardless of the model version (alpha one / two), and doesn't happen when running a standard Llava model in the exact same Docker.

Cheers,

Pierre.

@fpgaminer
Copy link
Owner

Yeah that's a weird one. The only difference between a standard llava model and JoyCaption would be the vision module, which uses so400m instead of openai clip. So maybe it's some weird implementation detail of so400m that's triggering a bug. In either case my gut reaction is that it is an issue with a specific dependency version or the docker container. All too common with PyTorch...

Try giving some of the containers here a try: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda/tags
I've had better luck using those versus the ones hosted on docker hub, for whatever reason. But they tend to use bleeding edge versions, so you might have to pull a few versions to find one that works.

Could also try swapping in different versions of either the pytorch packages or the transformers package.

And finally it could be a mismatch between the cuda version running in the docker container versus the driver on the host system. Usually isn't an issue, but Windows might be more sensitive.

@PCanavelli
Copy link
Author

Hey @fpgaminer, and thanks for the reply

Honestly: weirdest Torch bug I've ever had

I tried using one of the images you linked (nvcr.io/nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04) and regressing down to 12.4, still no change.

Any change to get a pip freeze of the environment you're using? This could save me quite some time trying out way too many version combinations for torch / transformers / diffusers / etc.

I'll gladly post a working Dockerfile + requirements when I get something to work, as I suspect I won't be the only one dealing with this issue when trying to deploy JoyCaption in a container

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants