-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server: Add prompt processing progress endpoint? #6586
Comments
Have you looked at the |
I can't get a response fom server on Maybe it's already supposed to be working during prompt processsing, in which case there's probably a bug. |
It's not a bug. Prompt processing is blocking the main loop during a batch iteration. You can reduce batch size. We have also in mind to better split concurrent prompt processing in a fair use. More info in : |
Ok, so decreasing the batch size allows the server to respond on that endpoint between batches dring prompt processing, but |
Which metrics do you want to see? |
The current response json contain thes metrics. [
{
"next_token": {
"n_remain": -1,
"n_decoded": 0,
...
},
...
}
] During prompt processing, these stay at their default values of -1 and 0, and during token generation, they both get updated as the tokens get generated, so they add up to the value of |
From my understanding of batch processing, this information is not knowable (though it's possible I'm misunderstanding something). During prompt processing, the prompt is split into batches of But it might still be possible to get an estimate of the progress within a ubatch with some heuristic based on how many nodes in the compute graph have been computed compared to the node count of the graph, though I don't know if that information can be extracted at all and if it can be done reliably for all backends. Maybe there's a way. But if what you're asking is progress granulated on batch size, that should be easier. |
Maybe the CB eval approach on the server can help also: |
Feature Description
It would be nice to have an endpoint on the server example to fetch information about the progress of an ongoing prompt processing It could return something like this:
Motivation
For longer prompts, or when the processing speed is very slow, it would be nice to get a clue about the advencement of the prompt processing. This would possibly also be useful for other projects, not just the server.
Possible Implementation
I haven't yet looked too deep in the current server implementation, so I can't really tell how this would work, but I imagine it would require some deeper changes in the backend too.
I did add a simillar feature on a very old project based on an ancient version of llama.cpp, a year ago: stduhpf/fastLLaMa@1ebd5ba This is now very much outdated, but this feature was nice to have.
The text was updated successfully, but these errors were encountered: