Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Get Token Metrics to GRPC server #3687

Merged
merged 3 commits into from
Oct 1, 2024

Conversation

siddimore
Copy link
Contributor

@siddimore siddimore commented Sep 29, 2024

Description

This PR creates a GRPC method to expose some of the Token Metrics

TODO
Add tests

Signed commits

  • Yes, I signed my commits.

Signed-off-by: Siddharth More <siddimore@gmail.com>
Copy link

netlify bot commented Sep 29, 2024

Deploy Preview for localai ready!

Name Link
🔨 Latest commit b44218f
🔍 Latest deploy log https://app.netlify.com/sites/localai/deploys/66fad364e93a7c0008058471
😎 Deploy Preview https://deploy-preview-3687--localai.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@siddimore siddimore changed the title (feat) Add Get Token Metrics to GRPC server feat: Add Get Token Metrics to GRPC server Sep 29, 2024
// Define the empty request
message MetricsRequest {}

message MetricsResponse {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what other metrics are good to be exposed out?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not related to the processing request, but could be useful also to expose infos like how many parallel requests can be served

@mudler
Copy link
Owner

mudler commented Sep 29, 2024

mmm I wonder if it would make sense to add a route also via http rest to leverage this, but shouldn't block merging this PR

@@ -374,3 +374,21 @@ func (c *Client) Rerank(ctx context.Context, in *pb.RerankRequest, opts ...grpc.
client := pb.NewBackendClient(conn)
return client.Rerank(ctx, in, opts...)
}

func (c *Client) GetTokenMetrics(ctx context.Context, in *pb.MetricsRequest, opts ...grpc.CallOption) (*pb.MetricsResponse, error) {
if !c.parallel {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code doesn't look formatted here - might need a go fmt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to fix this

@siddimore
Copy link
Contributor Author

siddimore commented Sep 29, 2024

mmm I wonder if it would make sense to add a route also via http rest to leverage this, but shouldn't block merging this PR

@mudler yup i already did make those changes just did not add it yet because i had to investigate some build issue

@mudler
Copy link
Owner

mudler commented Sep 29, 2024

mmm I wonder if it would make sense to add a route also via http rest to leverage this, but shouldn't block merging this PR

@mudler yup i already did make those changes just did not add it yet because i had to investigate some build issue

gotcha 👍 changes looks good here otherwise!

mudler
mudler previously approved these changes Sep 29, 2024
Signed-off-by: Siddharth More <siddimore@gmail.com>
@siddimore
Copy link
Contributor Author

mmm I wonder if it would make sense to add a route also via http rest to leverage this, but shouldn't block merging this PR

@mudler added a route to http endpoints. I will update test and have this PR ready for review this week

@siddimore siddimore marked this pull request as ready for review September 30, 2024 16:35
@mudler mudler added the enhancement New feature or request label Oct 1, 2024
@mudler mudler merged commit f84b55d into mudler:master Oct 1, 2024
29 checks passed
// @Summary Get TokenMetrics for Active Slot.
// @Accept json
// @Produce audio/x-wav
// @Success 200 {string} binary "generated audio/wav file"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binary "generated audio/wav file"?

// @Success 200 {string} binary "generated audio/wav file"
// @Router /v1/tokenMetrics [get]
// @Router /tokenMetrics [get]
func TokenMetricsEndpoint(cl *config.BackendConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) func(c *fiber.Ctx) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That endpoint didn't used anywhere at all

If it gets included - it needs to be called in parallel with running slot to get some output by the logic from modified grpc-server.cpp, but i guess loadModel is just waiting until the slot released producing empty output every time

@mintyleaf
Copy link
Contributor

mintyleaf commented Jan 2, 2025

That thing is very unfinished and exposed swagger api documentation makes it even more confusing

To get it to work every request is need to be stored somewhere with that statistics (we have an id field in openai response)

Or the less intrusive but hacky and unreliable way is to fire that route right after firing inference, where the metrics will wait until the finish and return that additional data

llama_client_slot* get_active_slot() {
for (llama_client_slot& slot : slots) {
// Check if the slot is currently processing
if (slot.is_processing()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we replace that with slot.available() - if there was at least one inference we get statistics for the last call

if not - we get a 'Server error error="json: unsupported value: NaN"'

@mintyleaf
Copy link
Contributor

@mudler
we already have all the needed data in backend ready to be packed into the message

maybe we can introduce some request flag that will add some statistics into the response that will be just an addition to the openai response?

for my goals i need to get the token speed and precise inference run time after each request for multiple nodes which running in federated mode

using tokenMetrics endpoint possibly can be routed to the wrong machine after each request, which is not reliable at all

what do you think about that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants