Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tgi correct clear implementation #609

Merged
merged 3 commits into from
May 27, 2024
Merged

Conversation

dacorvo
Copy link
Collaborator

@dacorvo dacorvo commented May 27, 2024

What does this PR do?

This pull-request fixes an issue when clearing TGI requests.

When a client cancels a TGI request, two different methods can be called on the TGI server:

  • if the request is cancelled after prefill, then the router asks the server to "filter" the decoding batch from the corresponding request. This is correctly implemented,
  • if the request is cancelled during prefill, then the router asks the server to clear the whole prefill batch. This was not correctly implemented because in that configuration we cleared all requests, even those not included in that prefill batch.

We now remember to which prefill batch each request belongs to be able to clear only the relevant requests.

Note that this pull-request also fixes a bug with models deployed on TGI that do not support continuous batching (i.e. gpt2 and older models).

@dacorvo dacorvo force-pushed the tgi_correct_clear_implementation branch from fcce2f8 to 0e32b71 Compare May 27, 2024 07:15
When all requests from a prefill batch are cancelled, the router will not send
a filter request, but rather a clear cache request with the batch_id.
We previously ignored that value and cleared everything.
@dacorvo dacorvo force-pushed the tgi_correct_clear_implementation branch from 0e32b71 to 58eafcb Compare May 27, 2024 08:12
@dacorvo dacorvo marked this pull request as ready for review May 27, 2024 08:50
Copy link
Collaborator

@JingyaHuang JingyaHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix!

@dacorvo dacorvo merged commit 4a21d96 into main May 27, 2024
1 check passed
@dacorvo dacorvo deleted the tgi_correct_clear_implementation branch May 27, 2024 09:47
tengomucho added a commit to huggingface/optimum-tpu that referenced this pull request Jul 1, 2024
This clears a potential issue when clearing TGI requests.

When a client cancels a TGI request, two different methods can be called
on the TGI server:

- if the request is cancelled after prefill, then the router asks the
  server to "filter" the decoding batch from the corresponding request.
  This is correctly implemented,
- if the request is cancelled during prefill, then the router asks the
  server to clear the whole prefill batch. This was not correctly
  implemented because in that configuration we cleared all requests,
  even those not included in that prefill batch.

This is now fixed, basically reproducing TGI Neuron fix:
huggingface/optimum-neuron#609
tengomucho added a commit to huggingface/optimum-tpu that referenced this pull request Jul 1, 2024
This clears a potential issue when clearing TGI requests.

When a client cancels a TGI request, two different methods can be called
on the TGI server:

- if the request is cancelled after prefill, then the router asks the
  server to "filter" the decoding batch from the corresponding request.
  This is correctly implemented,
- if the request is cancelled during prefill, then the router asks the
  server to clear the whole prefill batch. This was not correctly
  implemented because in that configuration we cleared all requests,
  even those not included in that prefill batch.

This is now fixed, basically reproducing TGI Neuron fix:
huggingface/optimum-neuron#609
tengomucho added a commit to huggingface/optimum-tpu that referenced this pull request Jul 2, 2024
This clears a potential issue when clearing TGI requests.

When a client cancels a TGI request, two different methods can be called
on the TGI server:

- if the request is cancelled after prefill, then the router asks the
  server to "filter" the decoding batch from the corresponding request.
  This is correctly implemented,
- if the request is cancelled during prefill, then the router asks the
  server to clear the whole prefill batch. This was not correctly
  implemented because in that configuration we cleared all requests,
  even those not included in that prefill batch.

This is now fixed, basically reproducing TGI Neuron fix:
huggingface/optimum-neuron#609
tengomucho added a commit to huggingface/optimum-tpu that referenced this pull request Jul 2, 2024
This clears a potential issue when clearing TGI requests.

When a client cancels a TGI request, two different methods can be called
on the TGI server:

- if the request is cancelled after prefill, then the router asks the
  server to "filter" the decoding batch from the corresponding request.
  This is correctly implemented,
- if the request is cancelled during prefill, then the router asks the
  server to clear the whole prefill batch. This was not correctly
  implemented because in that configuration we cleared all requests,
  even those not included in that prefill batch.

This is now fixed, basically reproducing TGI Neuron fix:
huggingface/optimum-neuron#609
tengomucho added a commit to huggingface/optimum-tpu that referenced this pull request Jul 3, 2024
* fix(tgi): remove all the variables from entrypoint.sh

* fix(tgi): correct version

* fix(tgi): pin numpy version <2.0

* feat(tgi): entrypoint adds GKE specific command

* fix(generator): correct CachedBatch serialization when it's None

This was generating a tricky error when calling "/health" at the server
startup: this was calling prefill and returning None as the cached
batch, that was failing to be serialized.

* feat(generator): prefill input preparation is done on CPU

Doing that on TPU seems to slow down (due to compilation?) and takes a
lot of memory.

* feat(generator): decode input preparation is done on CPU

* feat(generator): support TGI truncate parameter in Request

* fix(generator): warmup clears after prefill

This allows to correctly handle warmup.

* fix(tgi): correct clear implementation

This clears a potential issue when clearing TGI requests.

When a client cancels a TGI request, two different methods can be called
on the TGI server:

- if the request is cancelled after prefill, then the router asks the
  server to "filter" the decoding batch from the corresponding request.
  This is correctly implemented,
- if the request is cancelled during prefill, then the router asks the
  server to clear the whole prefill batch. This was not correctly
  implemented because in that configuration we cleared all requests,
  even those not included in that prefill batch.

This is now fixed, basically reproducing TGI Neuron fix:
huggingface/optimum-neuron#609

* feat(ci): release TGI images only when release is published

* chore(generator): turn log info -> debug on clear

---------

Co-authored-by: Morgan Funtowicz <funtowiczmo@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants