Tgi correct clear implementation #609

dacorvo · 2024-05-27T07:14:03Z

What does this PR do?

This pull-request fixes an issue when clearing TGI requests.

When a client cancels a TGI request, two different methods can be called on the TGI server:

if the request is cancelled after prefill, then the router asks the server to "filter" the decoding batch from the corresponding request. This is correctly implemented,
if the request is cancelled during prefill, then the router asks the server to clear the whole prefill batch. This was not correctly implemented because in that configuration we cleared all requests, even those not included in that prefill batch.

We now remember to which prefill batch each request belongs to be able to clear only the relevant requests.

Note that this pull-request also fixes a bug with models deployed on TGI that do not support continuous batching (i.e. gpt2 and older models).

When all requests from a prefill batch are cancelled, the router will not send a filter request, but rather a clear cache request with the batch_id. We previously ignored that value and cleared everything.

JingyaHuang

LGTM, thanks for the fix!

This clears a potential issue when clearing TGI requests. When a client cancels a TGI request, two different methods can be called on the TGI server: - if the request is cancelled after prefill, then the router asks the server to "filter" the decoding batch from the corresponding request. This is correctly implemented, - if the request is cancelled during prefill, then the router asks the server to clear the whole prefill batch. This was not correctly implemented because in that configuration we cleared all requests, even those not included in that prefill batch. This is now fixed, basically reproducing TGI Neuron fix: huggingface/optimum-neuron#609

* fix(tgi): remove all the variables from entrypoint.sh * fix(tgi): correct version * fix(tgi): pin numpy version <2.0 * feat(tgi): entrypoint adds GKE specific command * fix(generator): correct CachedBatch serialization when it's None This was generating a tricky error when calling "/health" at the server startup: this was calling prefill and returning None as the cached batch, that was failing to be serialized. * feat(generator): prefill input preparation is done on CPU Doing that on TPU seems to slow down (due to compilation?) and takes a lot of memory. * feat(generator): decode input preparation is done on CPU * feat(generator): support TGI truncate parameter in Request * fix(generator): warmup clears after prefill This allows to correctly handle warmup. * fix(tgi): correct clear implementation This clears a potential issue when clearing TGI requests. When a client cancels a TGI request, two different methods can be called on the TGI server: - if the request is cancelled after prefill, then the router asks the server to "filter" the decoding batch from the corresponding request. This is correctly implemented, - if the request is cancelled during prefill, then the router asks the server to clear the whole prefill batch. This was not correctly implemented because in that configuration we cleared all requests, even those not included in that prefill batch. This is now fixed, basically reproducing TGI Neuron fix: huggingface/optimum-neuron#609 * feat(ci): release TGI images only when release is published * chore(generator): turn log info -> debug on clear --------- Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>

dacorvo added 2 commits May 27, 2024 07:15

fix(tgi): allow wrapper script to catch SIGTERM

cf838d0

fix(tgi): bogus max_new_tokens with static batching

99f1937

dacorvo force-pushed the tgi_correct_clear_implementation branch from fcce2f8 to 0e32b71 Compare May 27, 2024 07:15

fix(tgi): allow clearing requests from a single batch

58eafcb

When all requests from a prefill batch are cancelled, the router will not send a filter request, but rather a clear cache request with the batch_id. We previously ignored that value and cleared everything.

dacorvo force-pushed the tgi_correct_clear_implementation branch from 0e32b71 to 58eafcb Compare May 27, 2024 08:12

dacorvo marked this pull request as ready for review May 27, 2024 08:50

dacorvo requested review from michaelbenayoun and JingyaHuang May 27, 2024 08:50

JingyaHuang approved these changes May 27, 2024

View reviewed changes

dacorvo merged commit 4a21d96 into main May 27, 2024
1 check passed

dacorvo deleted the tgi_correct_clear_implementation branch May 27, 2024 09:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tgi correct clear implementation #609

Tgi correct clear implementation #609

dacorvo commented May 27, 2024 •

edited

Loading

JingyaHuang left a comment

Tgi correct clear implementation #609

Tgi correct clear implementation #609

Conversation

dacorvo commented May 27, 2024 • edited Loading

What does this PR do?

JingyaHuang left a comment

Choose a reason for hiding this comment

dacorvo commented May 27, 2024 •

edited

Loading