server: stop generation at `n_ctx_train` if `n_predict` is not set #6638

phymbert · 2024-04-12T11:45:03Z

Context

If the model hallucinates (EOS-less generation), server will go to infinite loop if n_predict is not set.

It is a wrong usage of the server or the model:

But as it brings confusion, I propose to stop the generation at the size of the context which with the model was trained if self-extent context is disabled.

Tests

server --hf-repo ggml-org/models --hf-file phi-2/ggml-model-q4_0.gguf --model phi-2-q4_0.gguf --parallel 1 --ctx-size 4096 --log-format text -ngl 33

curl http://localhost:8080/completion --data '{"prompt": "hallucinate"}'

WARN [ update_slots] n_predict is not set and self-context extend is disabled. Limiting generated tokens to n_ctx_train to avoid EOS-less generation infinite loop | tid="127864424759296" timestamp=1713526370 params.n_predict=-1 slot.n_predict=-1 slot.n_decoded=2048 n_slots=1 n_ctx=4096 n_ctx_train=2048 ga_n=1

ggerganov · 2024-04-12T11:51:03Z

Before we make the change, we should see if the update_slots error is really reproducible. If it is, it's a bug that we first need to fix. If we merge the change now, we might hide the underlying issue

FSSRepo · 2024-04-13T00:35:57Z

@ggerganov I have created a web application to stress-test the server and see how it handles multiple clients sending random questions and documents simultaneously. I tested it with four clients using mixtral 8x7B q8_0 in x3 RTX 3090 for one hour, and the server didn't encounter any issues.

master:

But the gg/flash-attn branch sometimes generates NaNs in the query tensor, which are not generated by the flash attention kernel (this happens before the flash attention kernel).

I will keep investigating for the meantime.

phymbert · 2024-04-13T05:34:58Z

Agreed, I am doing performance and capacity tests since 2 month+, there is no such bug. The server is stable and production ready.

ochafik · 2024-04-13T13:48:30Z

Alongside an optional cap, I think we should make the server stop generating when the connection is closed for whatever reason (clients may well have a timeout or interrupt things manually, but the server keeps going / stays busy needlessly). Maybe an interrupt check callback to call before generating each token?

phymbert · 2024-04-13T14:11:08Z

Yeah, it is identified in:

Feature Request: Task Cancellation on Client Disconnection #6421

phymbert · 2024-04-16T06:04:48Z

Before we make the change, we should see if the update_slots error is really reproducible.

We can conclude that the user was using an old version. That's it.

phymbert · 2024-04-19T07:10:39Z

@ggerganov, finally, I would prefer not to go this way but to stop the generation at n_ctx with a warning, instead of printing a warning each time if n_predict is not set.

ggerganov · 2024-04-19T07:45:12Z

Ok. I'm not sure we have to da anything at this point - seems the latest version work OK

slaren · 2024-04-19T10:08:51Z

There should still be some limit to avoid getting into an infinite loop in the server.

…ver/avoid-infinite-loop

phymbert · 2024-04-19T11:36:20Z

@ggerganov @slaren please have a look to this proposal

slaren · 2024-04-19T11:47:09Z

When this happens, the response of /completion has these fields:

  "truncated": true,
  "stopped_eos": false,
  "stopped_word": false,
  "stopped_limit": false,

I am not familiar with the meaning of each of these flags. Should this be different? Maybe stopped_limit should be true to indicate that a limit was hit?

server: infinite loop: set stop limit to true

examples/server/server.cpp

slaren · 2024-04-19T18:59:04Z

Maybe it would be simpler to set n_predict to n_ctx_train by default if not set in the request.

phymbert · 2024-04-19T19:03:19Z

Maybe it would be simpler to set n_predict to n_ctx_train by default if not set in the request.

Yeah, it was the first version, but I feel it noisy to log this warning at each request: 6fd5ad5

slaren · 2024-04-19T19:11:47Z

That's not exactly what I mean. Basically I would just change the default to n_ctx_train (or other value) in the line that sets slot.params.n_predict = json_value(data, "n_predict", default_params.n_predict);. No need to print any warnings, just document it. The user can check stopped_limit or set a different limit if needed.

phymbert · 2024-04-19T19:22:55Z

I see, I am OK with both solutions even if it will be sort of a breaking change to set n_predict all the time.

AFAIK not all models hallucinate and not on all completion, plus normally it should always emmit EOS token if the trained chat template is in used in chat completion endpoint.

@ggerganov up to you, but we need to stop this infinite loop recurrent concern some way.

slaren · 2024-04-19T19:30:05Z

This would be simple if context shifting was opt-in, then there would always be a hard limit of n_ctx tokens. I am not sure that enabling context shift by default and without a way to disable it is a good idea, it can result in very poor generation quality and it is not the normal behavior for LLM providers, as far as I know.

phymbert · 2024-04-19T19:32:25Z

This would be simple if context shifting was opt-in, then there would always be a hard limit of n_ctx tokens. I am not sure that enabling context shift by default and without a way to disable it is a good idea, it can result in very poor generation quality and it is not the normal behavior for LLM providers, as far as I know.

Oh yes, and it is so slow in the current implementation, blocking the whole server.

phymbert · 2024-04-22T11:09:40Z

@ggerganov up to you, but we need to stop this infinite loop recurrent concern some way.

@ggerganov I think with the removal of hard coded stop tokens, this PR is becoming more important

ggerganov · 2024-04-22T11:27:02Z

Maybe it would be simpler to set n_predict to n_ctx_train by default if not set in the request.

Yes, let's do that. Context-shift has to be refactored and become optional (in a future PR)

phymbert · 2024-04-26T09:54:55Z

Maybe it would be simpler to set n_predict to n_ctx_train by default if not set in the request.

Yes, let's do that. Context-shift has to be refactored and become optional (in a future PR)

@ggerganov @slaren Finally I prefer to keep checking at each token if we do not exceed n_ctx_train because one can simply pass n_predict = -1 in the request payload and the server will still go to infinite loop. I feel this approach safer with a proper warning.

…n_predict (#6935) * ci: server: fix python env * ci: server: fix server tests after #6638 * ci: server: fix windows is not building PR branch

…gerganov#6638) * server: cap n_predict if not set to n_ctx_train * server: fix infinite loop * server: infinite loop, move in process_token server: infinite loop: set stop limit to true * minor: spaces * minor: spaces * server: include prompt tokens in the EOS limit

…n_predict (ggerganov#6935) * ci: server: fix python env * ci: server: fix server tests after ggerganov#6638 * ci: server: fix windows is not building PR branch

server: cap n_predict if not set to n_ctx_train

6fd5ad5

phymbert added the server/webui label Apr 12, 2024

phymbert requested a review from ggerganov April 12, 2024 11:45

phymbert marked this pull request as draft April 19, 2024 07:10

phymbert added 2 commits April 19, 2024 13:19

Merge remote-tracking branch 'refs/remotes/origin/master' into hp/ser…

558f690

…ver/avoid-infinite-loop

server: fix infinite loop

5d64ffd

phymbert changed the title ~~server: cap n_predict if not set to avoid infinite loop~~ server: stop generation at n_ctx_train is n_predict is not set Apr 19, 2024

phymbert marked this pull request as ready for review April 19, 2024 11:34

phymbert requested a review from slaren April 19, 2024 11:35

phymbert changed the title ~~server: stop generation at n_ctx_train is n_predict is not set~~ server: stop generation at n_ctx_train if n_predict is not set Apr 19, 2024

server: infinite loop, move in process_token

1423fce

server: infinite loop: set stop limit to true

phymbert force-pushed the hp/server/avoid-infinite-loop branch from 21702de to 1423fce Compare April 19, 2024 11:58

minor: spaces

03f458c

slaren approved these changes Apr 19, 2024

View reviewed changes

phymbert commented Apr 19, 2024

View reviewed changes

examples/server/server.cpp Outdated Show resolved Hide resolved

minor: spaces

0d3eca6

server: include prompt tokens in the EOS limit

6c257f4

This comment has been minimized.

Sign in to view

phymbert marked this pull request as draft April 21, 2024 18:28

phymbert marked this pull request as ready for review April 21, 2024 18:29

phymbert marked this pull request as draft April 22, 2024 11:29

Upasunda mentioned this pull request Apr 25, 2024

ollama gets stuck in an infinite loop sometimes and has to be restarted ollama/ollama#2805

Closed

phymbert marked this pull request as ready for review April 26, 2024 09:52

ggerganov approved these changes Apr 26, 2024

View reviewed changes

phymbert merged commit 7f5ff55 into master Apr 26, 2024
61 of 64 checks passed

phymbert deleted the hp/server/avoid-infinite-loop branch April 26, 2024 10:15

phymbert added a commit that referenced this pull request Apr 26, 2024

ci: server: fix server tests after #6638

ed97927

phymbert added a commit to phymbert/llama.cpp that referenced this pull request Apr 26, 2024

ci: server: fix server tests after ggerganov#6638

658451b

phymbert mentioned this pull request Apr 26, 2024

ci: server: tests python env on github container ubuntu latest / fix n_predict #6935

Merged

slaren mentioned this pull request Jul 29, 2024

Server: Don't ignore llama.cpp params #8754

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: stop generation at `n_ctx_train` if `n_predict` is not set #6638

server: stop generation at `n_ctx_train` if `n_predict` is not set #6638

phymbert commented Apr 12, 2024 •

edited

Loading

ggerganov commented Apr 12, 2024

FSSRepo commented Apr 13, 2024

phymbert commented Apr 13, 2024

ochafik commented Apr 13, 2024

phymbert commented Apr 13, 2024

phymbert commented Apr 16, 2024

phymbert commented Apr 19, 2024 •

edited

Loading

ggerganov commented Apr 19, 2024

slaren commented Apr 19, 2024

phymbert commented Apr 19, 2024 •

edited

Loading

slaren commented Apr 19, 2024

slaren commented Apr 19, 2024

phymbert commented Apr 19, 2024

slaren commented Apr 19, 2024

phymbert commented Apr 19, 2024

This comment has been minimized.

slaren commented Apr 19, 2024

phymbert commented Apr 19, 2024

phymbert commented Apr 22, 2024

ggerganov commented Apr 22, 2024

phymbert commented Apr 26, 2024

server: stop generation at n_ctx_train if n_predict is not set #6638

server: stop generation at n_ctx_train if n_predict is not set #6638

Conversation

phymbert commented Apr 12, 2024 • edited Loading

Context

Tests

ggerganov commented Apr 12, 2024

FSSRepo commented Apr 13, 2024

phymbert commented Apr 13, 2024

ochafik commented Apr 13, 2024

phymbert commented Apr 13, 2024

phymbert commented Apr 16, 2024

phymbert commented Apr 19, 2024 • edited Loading

ggerganov commented Apr 19, 2024

slaren commented Apr 19, 2024

phymbert commented Apr 19, 2024 • edited Loading

slaren commented Apr 19, 2024

slaren commented Apr 19, 2024

phymbert commented Apr 19, 2024

slaren commented Apr 19, 2024

phymbert commented Apr 19, 2024

This comment has been minimized.

slaren commented Apr 19, 2024

phymbert commented Apr 19, 2024

phymbert commented Apr 22, 2024

ggerganov commented Apr 22, 2024

phymbert commented Apr 26, 2024

server: stop generation at `n_ctx_train` if `n_predict` is not set #6638

server: stop generation at `n_ctx_train` if `n_predict` is not set #6638

phymbert commented Apr 12, 2024 •

edited

Loading

phymbert commented Apr 19, 2024 •

edited

Loading

phymbert commented Apr 19, 2024 •

edited

Loading