TGI: optimize continuous batching and improve export #506

dacorvo · 2024-03-06T10:15:23Z

What does this PR do?

This first modifies the TGI continuous batching implementation to take advantage of transformers-neuronx implementation.

Instead of dropping the KV cache when adding new requests and rebuilding it from cached texts, we simply omit the pending requests when calling model.forward, specifying only the indices of the new requests to prefill.

A llama TGI unit test is specifically added to verify the results are still correct after that change (for Llama and Mistral, transformers-neuronx continuous batching is always on).

For Sagemaker deployment, some disk usage logs are added when fetching/exporting a model.
During export, the model generation config is fetched to provide default values.

JingyaHuang

LGTM, just left a question.

JingyaHuang · 2024-03-06T13:23:50Z

text-generation-inference/server/text_generation_server/model.py

    start = time.time()
    model = NeuronModelForCausalLM.from_pretrained(model_id, export=True, **export_kwargs)
    end = time.time()
    logger.info(f"Model successfully exported in {end - start:.2f} s.")
    logger.info(f"Saving exported model to local storage under {export_path}.")
+    log_cache_size()


Will there be anything cached in HF_HUB_CACHE? I thought there would only be a cache in neuron_cache_path(/var/tmp/neuron-compile-cache) after a fresh export.

We need some space to expand the checkpoints and store the cached artifacts when we export a hub model.
Whenever we use the hubclientlibrary to fetch something (when calling from_pretrained), it ends up under HF_HUB_CACHE, so it is supposed to be writeable.
I therefore use the same destination when exporting.
When we deploy our own containers, we mount a writable /data volume.
With SageMaker, we only have a writable /tmp volume of 30G that can be expanded to 512G.
This is why I added those logs to identifiy issues when exporting the models.

dacorvo added 2 commits March 6, 2024 09:34

test(tgi): refactor tests

d4572a2

test(tgi): add LLama test

f61a79a

dacorvo marked this pull request as ready for review March 6, 2024 10:33

dacorvo requested review from philschmid, michaelbenayoun and JingyaHuang March 6, 2024 10:33

dacorvo force-pushed the optimize_tgi_continuous_batching branch from 04d0948 to c987db5 Compare March 6, 2024 10:35

dacorvo added 3 commits March 6, 2024 12:57

feat(tgi): avoid rebuilding cache on prefill

7613da4

feat(tgi): log disk usage when fetching/exporting model

7350c2b

feat(tgi): fetch generation config during export

121f197

dacorvo force-pushed the optimize_tgi_continuous_batching branch from 7c4ee46 to 121f197 Compare March 6, 2024 13:05

dacorvo changed the title ~~Optimize TGI continuous batching~~ TGI: optimize continuous batching and improve export Mar 6, 2024

JingyaHuang approved these changes Mar 6, 2024

View reviewed changes

dacorvo merged commit 8f84127 into main Mar 6, 2024
1 check passed

dacorvo deleted the optimize_tgi_continuous_batching branch March 6, 2024 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TGI: optimize continuous batching and improve export #506

TGI: optimize continuous batching and improve export #506

dacorvo commented Mar 6, 2024 •

edited

Loading

JingyaHuang left a comment

JingyaHuang Mar 6, 2024

dacorvo Mar 6, 2024

TGI: optimize continuous batching and improve export #506

TGI: optimize continuous batching and improve export #506

Conversation

dacorvo commented Mar 6, 2024 • edited Loading

What does this PR do?

JingyaHuang left a comment

Choose a reason for hiding this comment

JingyaHuang Mar 6, 2024

Choose a reason for hiding this comment

dacorvo Mar 6, 2024

Choose a reason for hiding this comment

dacorvo commented Mar 6, 2024 •

edited

Loading