Add static KV cache and test on Gemma-2B #4

tengomucho · 2024-03-13T16:08:08Z

What does this PR do?

This test adapts TGI server to better take advantage of Pytorch/XLA graphs. Relevant changes:

Model compilation is disabled by default, because XLA by default. This is because it does not always work, and it is sometimes slower than using compilation.
Create tensors on device directly, to avoid copying them.
Added support for static KV cache whenever possible.
Added Gemma-2b example (it uses static KV cache).

All this leads to performance general enhancements, so even if I added a test with a new model test run in 4m40s whereas before they where running in 5m21s

It was previously using n_positions sometimes, but that would not be available on some model configs.

if DBG_DEVICE env var is set, it will used to set the device for the model.

This will avoid loading the model twice.

Make compilation optional, it can be enabled with the environment variable DBG_COMPILE. This is because: 1. There are some models that produce bugs when the model is compiled. (notably gemma). 2. Models inference input params shapes change, triggering recompilation, leading to slow performance. 3. With the added xm.mark_step, performance is actually better when the model is not compiled. XLA builds a graph anyway, so performance is going to be good.

This is to reduce useless gradient calculations.

This will allow to handle passing different params in different model configurations later.

Some models, like Gemma and Llama, support static KV cache in transformers. For these, it is possible to use this feature, leading to much higher performance.

Also manually install accelerate to avoid memory issues when loading gemma.

The test produces different results after some operations are being done in a slightly different order.

mfuntowicz

LGTM - a few comments more for further reflexion moving forward - Congratz!

mfuntowicz · 2024-03-15T15:09:35Z

text-generation-inference/server/text_generation_server/generator.py

        self._id = id
        self._tokenizer = tokenizer
        self.clear()
+        self._device = device


Maybe let's do the conversion from str to torch.device() right away here to ensure we can fail fast if this device doesn't exist and avoid overhead later down the road?

The conversion does not make the check that the device is available. The only ways I found to check if the device is available is to invoke the torch_xla api directly. I can add a check before mapping the model if you wish.

as discussed offline, adding such check is probably useless, given that the check will be done implicitly while mapping the model.

text-generation-inference/server/text_generation_server/generator.py

mfuntowicz · 2024-03-15T15:11:46Z

text-generation-inference/server/text_generation_server/generator.py

+        )
+        # Update mask only if it was set previously
+        if self._mask is not None:
+            self._mask = torch.cat([self._mask, torch.tensor([1], device=self._device, dtype=self._mask.dtype)])


Maybe for later: Does this concatenate can be replaced by an inplace set from 0 to 1 ?

sure, I'll take a note.

having said that: this is handled in a transparent way by models that use static cache, I guess they already do that inside the model.

tengomucho added 3 commits March 13, 2024 14:34

chore(style): run make style

c84ee30

chore(style): update pyproject to avoid ruff warning

c48e7ee

fix(tgi): sequence length should be based on sequence_length config

de3acb4

It was previously using n_positions sometimes, but that would not be available on some model configs.

tengomucho force-pushed the static-compilation branch 2 times, most recently from c342e4f to 5542841 Compare March 13, 2024 16:46

tengomucho added 9 commits March 14, 2024 14:29

feat(modeling): model is immediately loaded on device

c6dc4a7

debug: added env var to debug on CPU

60ad09c

if DBG_DEVICE env var is set, it will used to set the device for the model.

feat(test): reduce overhad when retrieving model

5a94c67

This will avoid loading the model twice.

feat: add @torch.no_grad decorators to decode and prefill

9ff56ad

This is to reduce useless gradient calculations.

chore(generator): create buffers in device to avoid moving them

0999e21

refactor(generator): some model params are passed as dict

bd264d9

This will allow to handle passing different params in different model configurations later.

feat: use static KV cache when available

8b78fd3

Some models, like Gemma and Llama, support static KV cache in transformers. For these, it is possible to use this feature, leading to much higher performance.

fix(CI): added HF_TOKEN to use models that require it

27a2669

Also manually install accelerate to avoid memory issues when loading gemma.

tengomucho force-pushed the static-compilation branch from 5542841 to 27a2669 Compare March 14, 2024 14:33

fix(CI): adapt expected result in do_sample test

be1194b

The test produces different results after some operations are being done in a slightly different order.

tengomucho marked this pull request as ready for review March 14, 2024 17:12

tengomucho requested review from mfuntowicz and shub-kris March 14, 2024 17:15

mfuntowicz approved these changes Mar 15, 2024

View reviewed changes

tengomucho merged commit fdcd7ea into main Mar 15, 2024
1 check passed

mfuntowicz deleted the static-compilation branch March 28, 2024 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add static KV cache and test on Gemma-2B #4

Add static KV cache and test on Gemma-2B #4

tengomucho commented Mar 13, 2024 •

edited

Loading

mfuntowicz left a comment

mfuntowicz Mar 15, 2024

tengomucho Mar 15, 2024

tengomucho Mar 15, 2024

mfuntowicz Mar 15, 2024 •

edited

Loading

tengomucho Mar 15, 2024

tengomucho Mar 15, 2024

Add static KV cache and test on Gemma-2B #4

Add static KV cache and test on Gemma-2B #4

Conversation

tengomucho commented Mar 13, 2024 • edited Loading

What does this PR do?

mfuntowicz left a comment

Choose a reason for hiding this comment

mfuntowicz Mar 15, 2024

Choose a reason for hiding this comment

tengomucho Mar 15, 2024

Choose a reason for hiding this comment

tengomucho Mar 15, 2024

Choose a reason for hiding this comment

mfuntowicz Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

tengomucho Mar 15, 2024

Choose a reason for hiding this comment

tengomucho Mar 15, 2024

Choose a reason for hiding this comment

tengomucho commented Mar 13, 2024 •

edited

Loading

mfuntowicz Mar 15, 2024 •

edited

Loading