fix assisted decoding #31401

jiqing-feng · 2024-06-13T07:46:09Z

Hi @gante . This PR is to fix the assisted decoding when the model and assistant model are on different devices.

It can be easily reproduced by:

model = model.to("cuda")
model.generate(**inputs, assistant_model=assistant_model.to("cpu"))

jiqing-feng · 2024-06-14T05:22:52Z

The failed CIs seem not related to my changes

gante

Hi @jiqing-feng! Thank you for opening this PR 🤗

To the best of my knowledge, the changes you're suggesting should not be needed. As such, I've asked a few questions below to understand why we need these changes :)

src/transformers/generation/logits_process.py

src/transformers/generation/utils.py

jiqing-feng · 2024-06-17T01:15:03Z

Hi @gante . Sorry for not making it clear. Could you run this script:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


model_id = "meta-llama/Llama-2-7b-chat-hf"
assistant_model_id = "Felladrin/Llama-68M-Chat-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_model_id, torch_dtype=torch.bfloat16).to("cpu")

prompt = "Assisted decoding is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

model.generate(**inputs, assistant_model=assistant_model, max_new_tokens=8, min_new_tokens=8, do_sample=False)

It will get the error Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!.

Full traceback

Traceback (most recent call last):
  File "/workspace/jiqing/hete_specdecode/test_assisted.py", line 16, in <module>
    model.generate(**inputs, assistant_model=assistant_model, max_new_tokens=8, min_new_tokens=8, do_sample=False)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/jiqing/transformers/src/transformers/generation/utils.py", line 1853, in generate
    result = self._assisted_decoding(
  File "/workspace/jiqing/transformers/src/transformers/generation/utils.py", line 3698, in _assisted_decoding
    candidate_input_ids, candidate_logits = candidate_generator.get_candidates(input_ids)
  File "/workspace/jiqing/transformers/src/transformers/generation/candidate_generator.py", line 229, in get_candidates
    assistant_output = self.assistant_model.generate(**assistant_generation_kwargs, **self.assistant_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/jiqing/transformers/src/transformers/generation/utils.py", line 1896, in generate
    result = self._sample(
  File "/workspace/jiqing/transformers/src/transformers/generation/utils.py", line 2648, in _sample
    next_token_scores = logits_processor(input_ids, next_token_logits)
  File "/workspace/jiqing/transformers/src/transformers/generation/logits_process.py", line 98, in __call__
    scores = processor(input_ids, scores)
  File "/workspace/jiqing/transformers/src/transformers/generation/logits_process.py", line 157, in __call__
    eos_token_mask = torch.isin(vocab_tensor, self.eos_token_id)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument test_elements in method wrapper_CUDA_isin_Tensor_Tensor)

jiqing-feng · 2024-06-17T02:05:20Z

HI @gante . I just found the real issue happens here, pls take a review. Thx!

jiqing-feng · 2024-06-17T07:13:56Z

I would like to add a test for this. Do you know where I should add this test? Thx!

…ormers into assist_device

gante

This makes sense, thank you for digging deeper and iterating @jiqing-feng ! 💛

Regarding tests: it's a bit tricky to test two devices on our CI AFAIK 🤔 @amyeroberts do you have suggestions on how to test it? [TL;DR @jiqing-feng found that assisted generation fails if the two models are on different devices, because the special tokens are copied from the main model to the assistant model]

jiqing-feng · 2024-06-18T05:09:56Z

This makes sense, thank you for digging deeper and iterating @jiqing-feng ! 💛

Regarding tests: it's a bit tricky to test two devices on our CI AFAIK 🤔 @amyeroberts do you have suggestions on how to test it? [TL;DR @jiqing-feng found that assisted generation fails if the two models are on different devices, because the special tokens are copied from the main model to the assistant model]

I think we can just run the test on a device with GPU; there is almost no limitation for CPU because we can run a very tiny model on CPU just for functionality.

amyeroberts · 2024-06-18T09:47:17Z

Regarding tests: it's a bit tricky to test two devices on our CI AFAIK 🤔 @amyeroberts do you have suggestions on how to test it? [TL;DR @jiqing-feng found that assisted generation fails if the two models are on different devices, because the special tokens are copied from the main model to the assistant model]

@gante There's certain tests in our suite which require multiple devices e.g. test_model_parallelization, which we can denote with the require_torch_multi_accelerator and require_torch_multi_gpu decorators.

In this case, I'd suggest having two tests, one for the single accelerator case, and another which only runs in the multi device case.

gante · 2024-06-18T10:57:22Z

derp, ofc a GPU is enough (which has a CPU paired up), what a brain fart on my end :D

@jiqing-feng could you add two tests like the script in this comment of yours to this file? More precisely:

Inside GenerationIntegrationTests;
Using the @slow decorator;
One of the tests with the @require_torch_multi_gpu decorator with each model in a different gpu, another with @require_torch_gpu with the assistant on cpu
Let's use one of our tiny test models like hf-internal-testing/tiny-random-MistralForCausalLM (as both main model and assistant)

jiqing-feng · 2024-06-19T10:37:17Z

Hi @gante . I have added the tests, could you please take a review? Thx!

BTW, the failed CIs seem not related to my changes

jiqing-feng · 2024-06-21T00:48:48Z

Hi @amyeroberts. Could you please take a review? The failed CIs are not related to my changes :)

amyeroberts · 2024-06-21T19:33:12Z

@jiqing-feng Regarding the failing tests, could you rebase on main to include upstream changes? This should resolve the failures on CI

Could you also run and share the output of executing the following in a multi-gpu environment:

RUN_SLOW=1 pytest -k "test_assisted_decoding_in_different_gpu or test_assisted_decoding_in_different_gpu"

gante · 2024-06-22T13:45:24Z

@jiqing-feng rebasing the PR should get CI green 🤗

jiqing-feng · 2024-06-24T01:07:45Z

Hi @amyeroberts . I run the 2 tests individually and got passed, see

I also run your command and got the following output

These failed tests are due to some import error:

jiqing-feng · 2024-06-25T00:36:42Z

Hi @amyeroberts . Do you need more actions before merging? Please let me know, thx!

jiqing-feng · 2024-06-26T02:41:36Z

Hi @amyeroberts @gante . I think this PR should be ready to merge :)

amyeroberts · 2024-06-28T19:28:14Z

@jiqing-feng OK, sorry, I think I messed up with the pytest command. Could you try this instead:

RUN_SLOW=1 pytest tests/generation/test_utils.py::GenerationIntegrationTests::test_assisted_decoding_in_different_gpu
RUN_SLOW=1 pytest tests/generation/test_utils.py::GenerationIntegrationTests::test_assisted_decoding_in_gpu_cpu

jiqing-feng · 2024-07-01T01:05:44Z

@jiqing-feng OK, sorry, I think I messed up with the pytest command. Could you try this instead:

RUN_SLOW=1 pytest tests/generation/test_utils.py::GenerationIntegrationTests::test_assisted_decoding_in_different_gpu
RUN_SLOW=1 pytest tests/generation/test_utils.py::GenerationIntegrationTests::test_assisted_decoding_in_gpu_cpu

All passed

jiqing-feng · 2024-07-02T01:06:51Z

Hi @amyeroberts . The failed CIs are not relate to my changes, would you please review my changes?

jiqing-feng · 2024-07-03T01:02:37Z

Hi @amyeroberts @gante , would you please help to merge this PR? Thx!

amyeroberts

Thanks for fixing!

amyeroberts · 2024-07-03T08:22:51Z

Hi @jiqing-feng, we had to wait for somethings to be resolved upstream and to wait for a new CI run (which I triggered last night)

amyeroberts added the Generation label Jun 13, 2024

fix assisted decoding

9091109

gante reviewed Jun 14, 2024

View reviewed changes

src/transformers/generation/logits_process.py Outdated Show resolved Hide resolved

src/transformers/generation/utils.py Outdated Show resolved Hide resolved

jiqing-feng added 2 commits June 14, 2024 09:02

check None

e2700d7

fix typo

4836a20

Merge branch 'huggingface:main' into assist_device

370dedb

jiqing-feng added 3 commits June 17, 2024 07:04

fix _prepare_special_tokens

ff49a29

Merge branch 'assist_device' of https://github.com/jiqing-feng/transf…

9d72ea8

…ormers into assist_device

fix style

c556ecb

gante approved these changes Jun 17, 2024

View reviewed changes

gante requested a review from amyeroberts June 17, 2024 17:07

jiqing-feng added 3 commits June 19, 2024 14:31

fix lint

120ace0

add tests for assisted decoding

23ddb4b

fix style

63360e7

Merge branch 'huggingface:main' into assist_device

8ff54d5

fix tests check

07eee58

amyeroberts approved these changes Jul 3, 2024

View reviewed changes

amyeroberts merged commit 7f91f16 into huggingface:main Jul 3, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix assisted decoding #31401

fix assisted decoding #31401

jiqing-feng commented Jun 13, 2024

jiqing-feng commented Jun 14, 2024

gante left a comment

jiqing-feng commented Jun 17, 2024 •

edited

Loading

jiqing-feng commented Jun 17, 2024

jiqing-feng commented Jun 17, 2024

gante left a comment •

edited

Loading

jiqing-feng commented Jun 18, 2024

amyeroberts commented Jun 18, 2024

gante commented Jun 18, 2024 •

edited

Loading

jiqing-feng commented Jun 19, 2024 •

edited

Loading

jiqing-feng commented Jun 21, 2024

amyeroberts commented Jun 21, 2024

gante commented Jun 22, 2024

jiqing-feng commented Jun 24, 2024

jiqing-feng commented Jun 25, 2024

jiqing-feng commented Jun 26, 2024

amyeroberts commented Jun 28, 2024

jiqing-feng commented Jul 1, 2024

jiqing-feng commented Jul 2, 2024

jiqing-feng commented Jul 3, 2024

amyeroberts left a comment

amyeroberts commented Jul 3, 2024

fix assisted decoding #31401

fix assisted decoding #31401

Conversation

jiqing-feng commented Jun 13, 2024

jiqing-feng commented Jun 14, 2024

gante left a comment

Choose a reason for hiding this comment

jiqing-feng commented Jun 17, 2024 • edited Loading

jiqing-feng commented Jun 17, 2024

jiqing-feng commented Jun 17, 2024

gante left a comment • edited Loading

Choose a reason for hiding this comment

jiqing-feng commented Jun 18, 2024

amyeroberts commented Jun 18, 2024

gante commented Jun 18, 2024 • edited Loading

jiqing-feng commented Jun 19, 2024 • edited Loading

jiqing-feng commented Jun 21, 2024

amyeroberts commented Jun 21, 2024

gante commented Jun 22, 2024

jiqing-feng commented Jun 24, 2024

jiqing-feng commented Jun 25, 2024

jiqing-feng commented Jun 26, 2024

amyeroberts commented Jun 28, 2024

jiqing-feng commented Jul 1, 2024

jiqing-feng commented Jul 2, 2024

jiqing-feng commented Jul 3, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts commented Jul 3, 2024

jiqing-feng commented Jun 17, 2024 •

edited

Loading

gante left a comment •

edited

Loading

gante commented Jun 18, 2024 •

edited

Loading

jiqing-feng commented Jun 19, 2024 •

edited

Loading