-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: stop generation at n_ctx_train
if n_predict
is not set
#6638
Conversation
Before we make the change, we should see if the |
@ggerganov I have created a web application to stress-test the server and see how it handles multiple clients sending random questions and documents simultaneously. I tested it with four clients using mixtral 8x7B q8_0 in x3 RTX 3090 for one hour, and the server didn't encounter any issues. master: But the I will keep investigating for the meantime. |
Agreed, I am doing performance and capacity tests since 2 month+, there is no such bug. The server is stable and production ready. |
Alongside an optional cap, I think we should make the server stop generating when the connection is closed for whatever reason (clients may well have a timeout or interrupt things manually, but the server keeps going / stays busy needlessly). Maybe an interrupt check callback to call before generating each token? |
Yeah, it is identified in: |
We can conclude that the user was using an old version. That's it. |
@ggerganov, finally, I would prefer not to go this way but to stop the generation at |
Ok. I'm not sure we have to da anything at this point - seems the latest version work OK |
There should still be some limit to avoid getting into an infinite loop in the server. |
…ver/avoid-infinite-loop
n_ctx_train
is n_predict
is not set
@ggerganov @slaren please have a look to this proposal |
n_ctx_train
is n_predict
is not setn_ctx_train
if n_predict
is not set
When this happens, the response of "truncated": true,
"stopped_eos": false,
"stopped_word": false,
"stopped_limit": false, I am not familiar with the meaning of each of these flags. Should this be different? Maybe |
server: infinite loop: set stop limit to true
21702de
to
1423fce
Compare
Maybe it would be simpler to set |
Yeah, it was the first version, but I feel it noisy to log this warning at each request: 6fd5ad5 |
That's not exactly what I mean. Basically I would just change the default to |
I see, I am OK with both solutions even if it will be sort of a breaking change to set n_predict all the time. AFAIK not all models hallucinate and not on all completion, plus normally it should always emmit EOS token if the trained chat template is in used in chat completion endpoint. @ggerganov up to you, but we need to stop this infinite loop recurrent concern some way. |
This comment has been minimized.
This comment has been minimized.
This would be simple if context shifting was opt-in, then there would always be a hard limit of |
Oh yes, and it is so slow in the current implementation, blocking the whole server. |
@ggerganov I think with the removal of hard coded stop tokens, this PR is becoming more important |
Yes, let's do that. Context-shift has to be refactored and become optional (in a future PR) |
@ggerganov @slaren Finally I prefer to keep checking at each token if we do not exceed |
…gerganov#6638) * server: cap n_predict if not set to n_ctx_train * server: fix infinite loop * server: infinite loop, move in process_token server: infinite loop: set stop limit to true * minor: spaces * minor: spaces * server: include prompt tokens in the EOS limit
…n_predict (ggerganov#6935) * ci: server: fix python env * ci: server: fix server tests after ggerganov#6638 * ci: server: fix windows is not building PR branch
Context
If the model hallucinates (EOS-less generation), server will go to infinite loop if
n_predict
is not set.It is a wrong usage of the server or the model:
--ctx-size
#6617 (comment)But as it brings confusion, I propose to stop the generation at the size of the context which with the model was trained if self-extent context is disabled.
Tests
curl http://localhost:8080/completion --data '{"prompt": "hallucinate"}'