feat(router): drop requests when client closes the channel #202

OlivierDehaene · 2023-04-19T14:25:06Z

No description provided.

njhill · 2023-04-19T15:46:05Z

@OlivierDehaene I notice that you've incorporated much of #138 here. I had started attempting to break that into smaller PRs, but I guess there's probably no point in continuing that.

I am still curious what you think about moving the stopping evaluation logic to the router as I had done in #138?

OlivierDehaene · 2023-04-19T16:29:31Z

hey @njhill!
Sorry that you were porting #138 to a smaller PR at the same time =/
I need this garbage collection logic asap for one of our deployment.
I'm adding you on this pr as co-author.

I am still curious what you think about moving the stopping evaluation logic to the router as I had done in #138?

For now I'm not sure how that would work. I'm against supporting only Rust tokenizers so having the decoding logic in the server while having the stop logic in the router would be a bit strange.

I want to investigate if it would be possible to spawn a Python interpreter in the router with PyO3 when we don't have a rust tokenizer. If its easy enough then we can move everything to the router.

But I think right now the prio for this repo is to stabilize and go for a v1.0.0.

njhill · 2023-04-19T16:46:06Z

Thanks @OlivierDehaene

For now I'm not sure how that would work.

Since you were against moving the detokenization to the router, I was just referring to the stopping criteria, I think that could be still be done there even if it's still strings being streamed back from the shards. But fair enough, could always reevaluate after these changes.

I want to investigate if it would be possible to spawn a Python interpreter in the router with PyO3 when we don't have a rust tokenizer.

That would be cool. FWIW I haven't encountered any models with tokenizers that didn't work if converted (i.e. doing AutoTokenizer.from_pretrained() followed by .save_pretrained()).

I need this garbage collection logic asap for one of our deployment.

Is this related to flash attention? I'd coincidentally also made some related changes that I was about to open a PR for. But I'll hold off until you merge this to avoid the churn.

OlivierDehaene · 2023-04-19T17:12:14Z

I was just referring to the stopping criteria.

Yes but you need to handle the generated_text in the final token payload. Today it acts as a ground truth for some users when the streaming method does not work perfectly so it's required to continue decoding it independently with the full sequence of ids and we can only to that in the server. So it will require a back and forth between the router and the server, right?

FWIW I haven't encountered any models with tokenizers that didn't work if converted (i.e. doing AutoTokenizer.from_pretrained() followed by .save_pretrained()).

I think THUDM/chatglm-6b is an example of such a model.

Is this related to flash attention?

No, it's just a model that is under heavy load with requests that can take a while so some requests end up waiting in the queue for 10s of seconds. I want to make sure that once they go through, the client hasn't already timed out.

OlivierDehaene · 2023-04-20T15:05:07Z

@njhill, I'm sorry I completely forgot to add you as co-author =/
I wish github had a better way of doing this...

njhill · 2023-04-20T17:43:41Z

@OlivierDehaene no worries. I have a few more things to contribute :)

OlivierDehaene force-pushed the feat/time_limit branch from 30cf546 to b908ca7 Compare April 19, 2023 16:17

OlivierDehaene force-pushed the feat/time_limit branch from 6d5a5f0 to 766c0ab Compare April 19, 2023 18:00

OlivierDehaene mentioned this pull request Apr 19, 2023

GPU Memory Cache not cleared. #209

Closed

njhill mentioned this pull request Apr 20, 2023

feat(router): Dynamic batch sizing #210

Closed

OlivierDehaene added 9 commits April 20, 2023 10:52

wip

4e63d9c

wip

9476170

wip

2ad7a63

fix tests for causal lm

d957815

fix queue

118f33d

make batch optional again

94ff101

push test image

ca98470

add metrics

521f620

revert build

3652d82

OlivierDehaene force-pushed the feat/time_limit branch from 766c0ab to 3652d82 Compare April 20, 2023 08:53

OlivierDehaene merged commit 709d893 into main Apr 20, 2023

OlivierDehaene deleted the feat/time_limit branch April 20, 2023 09:07

gsaivinay mentioned this pull request Jun 26, 2023

Proper way for client to stop generate_stream #495

Closed

drbh mentioned this pull request Jan 29, 2024

proposal: Move token decoding and stopping evaluation to router #138

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(router): drop requests when client closes the channel #202

feat(router): drop requests when client closes the channel #202

OlivierDehaene commented Apr 19, 2023

njhill commented Apr 19, 2023

OlivierDehaene commented Apr 19, 2023

njhill commented Apr 19, 2023 •

edited

Loading

OlivierDehaene commented Apr 19, 2023 •

edited

Loading

OlivierDehaene commented Apr 20, 2023

njhill commented Apr 20, 2023

feat(router): drop requests when client closes the channel #202

feat(router): drop requests when client closes the channel #202

Conversation

OlivierDehaene commented Apr 19, 2023

njhill commented Apr 19, 2023

OlivierDehaene commented Apr 19, 2023

njhill commented Apr 19, 2023 • edited Loading

OlivierDehaene commented Apr 19, 2023 • edited Loading

OlivierDehaene commented Apr 20, 2023

njhill commented Apr 20, 2023

njhill commented Apr 19, 2023 •

edited

Loading

OlivierDehaene commented Apr 19, 2023 •

edited

Loading