-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Get Token Metrics to GRPC server #3687
Conversation
Signed-off-by: Siddharth More <siddimore@gmail.com>
✅ Deploy Preview for localai ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
// Define the empty request | ||
message MetricsRequest {} | ||
|
||
message MetricsResponse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what other metrics are good to be exposed out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not related to the processing request, but could be useful also to expose infos like how many parallel requests can be served
mmm I wonder if it would make sense to add a route also via http rest to leverage this, but shouldn't block merging this PR |
@@ -374,3 +374,21 @@ func (c *Client) Rerank(ctx context.Context, in *pb.RerankRequest, opts ...grpc. | |||
client := pb.NewBackendClient(conn) | |||
return client.Rerank(ctx, in, opts...) | |||
} | |||
|
|||
func (c *Client) GetTokenMetrics(ctx context.Context, in *pb.MetricsRequest, opts ...grpc.CallOption) (*pb.MetricsResponse, error) { | |||
if !c.parallel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code doesn't look formatted here - might need a go fmt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgot to fix this
@mudler yup i already did make those changes just did not add it yet because i had to investigate some build issue |
gotcha 👍 changes looks good here otherwise! |
Signed-off-by: Siddharth More <siddimore@gmail.com>
@mudler added a route to http endpoints. I will update test and have this PR ready for review this week |
// @Summary Get TokenMetrics for Active Slot. | ||
// @Accept json | ||
// @Produce audio/x-wav | ||
// @Success 200 {string} binary "generated audio/wav file" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
binary "generated audio/wav file"?
// @Success 200 {string} binary "generated audio/wav file" | ||
// @Router /v1/tokenMetrics [get] | ||
// @Router /tokenMetrics [get] | ||
func TokenMetricsEndpoint(cl *config.BackendConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) func(c *fiber.Ctx) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That endpoint didn't used anywhere at all
If it gets included - it needs to be called in parallel with running slot to get some output by the logic from modified grpc-server.cpp, but i guess loadModel is just waiting until the slot released producing empty output every time
That thing is very unfinished and exposed swagger api documentation makes it even more confusing To get it to work every request is need to be stored somewhere with that statistics (we have an id field in openai response) Or the less intrusive but hacky and unreliable way is to fire that route right after firing inference, where the metrics will wait until the finish and return that additional data |
llama_client_slot* get_active_slot() { | ||
for (llama_client_slot& slot : slots) { | ||
// Check if the slot is currently processing | ||
if (slot.is_processing()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we replace that with slot.available() - if there was at least one inference we get statistics for the last call
if not - we get a 'Server error error="json: unsupported value: NaN"'
@mudler maybe we can introduce some request flag that will add some statistics into the response that will be just an addition to the openai response? for my goals i need to get the token speed and precise inference run time after each request for multiple nodes which running in federated mode using tokenMetrics endpoint possibly can be routed to the wrong machine after each request, which is not reliable at all what do you think about that? |
Description
This PR creates a GRPC method to expose some of the Token Metrics
TODO
Add tests
Signed commits