Enable compilation #6

tengomucho · 2024-03-22T09:23:01Z

What does this PR do?

This model implements a workaround to a problem appearing when compilation is used in torch_xla + TPU and direct assignment is used on inputs.
After fixing that, it is possible to enable compilation by default on models that support static cache, such as gemma, leading to an improved performance in decoding after the first token is decoded.

when direct assignment is used in inputs, apparently the compiled model can give wrong results, as explained in this issue: pytorch/xla#6796 The workaround seems to be to use the `index_put_` method.

This improves inference time a lot.

mfuntowicz

LGTM! Niiiiiicezzz

mfuntowicz · 2024-03-26T08:28:07Z

text-generation-inference/server/text_generation_server/generator.py

@@ -493,8 +492,8 @@ def decode(self, batches: List[CachedBatch]) -> Tuple[List[Generation], CachedBa
                            dtype=torch.int64,
                            device=self.model.device,
                        )
-                    attention_mask[i, :] = slot.attention_mask
-                position_ids[i, 0] = slot.cur_position
+                    attention_mask.index_put_([torch.tensor([i])], slot.attention_mask)


Can't we just put 1 here? I.e. the new token in the attention_mask we want to attend to

you are right

arf, nope, it was correct, and I just broke it: I need to put the slot attention mask in the i-th line (corresponding the batch i). I am going to fix it.

It's useless to put another value, and the index was put there by mistake.

tengomucho added 4 commits March 22, 2024 09:15

chore: rename tests with model name for clarity

2a3bc44

fix(generator): remove useless statement setting position_id

8ea5506

fix: avoid direct assignment so compiled model works

4b4f3ff

when direct assignment is used in inputs, apparently the compiled model can give wrong results, as explained in this issue: pytorch/xla#6796 The workaround seems to be to use the `index_put_` method.

feat: model compilation enabled by default when using static cache

8790fd3

This improves inference time a lot.

tengomucho marked this pull request as ready for review March 22, 2024 09:36

tengomucho requested a review from mfuntowicz March 22, 2024 09:36

mfuntowicz approved these changes Mar 26, 2024

View reviewed changes

fix: attention mask should be 1 or 0

dc15e65

It's useless to put another value, and the index was put there by mistake.

tengomucho merged commit edf1b9e into main Mar 26, 2024
1 check failed

mfuntowicz deleted the compilation-on branch March 26, 2024 10:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable compilation #6

Enable compilation #6

tengomucho commented Mar 22, 2024

mfuntowicz left a comment

mfuntowicz Mar 26, 2024

tengomucho Mar 26, 2024

tengomucho Mar 26, 2024

Enable compilation #6

Enable compilation #6

Conversation

tengomucho commented Mar 22, 2024

What does this PR do?

mfuntowicz left a comment

Choose a reason for hiding this comment

mfuntowicz Mar 26, 2024

Choose a reason for hiding this comment

tengomucho Mar 26, 2024

Choose a reason for hiding this comment

tengomucho Mar 26, 2024

Choose a reason for hiding this comment