Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPC metrics middleware implemented as a Tower Layer #3931

Open
hdevalence opened this issue Mar 4, 2024 · 1 comment
Open

GRPC metrics middleware implemented as a Tower Layer #3931

hdevalence opened this issue Mar 4, 2024 · 1 comment
Assignees
Labels
A-telemetry Area: Metrics, logging, and other observability-related features _P-V2 Priority: after mainnet

Comments

@hdevalence
Copy link
Member

Is your feature request related to a problem? Please describe.

Recently, the public RPC fell into a degraded state as someone made too many requests to it. In this case, we identified the cause via an accidental backchannel. However, had that not happened we would have been totally unable to determine what the cause was, what kind of requests we were getting, and what was happening to them, because we only have metrics on the requests we already had a reason to care about.

Instead, we need to have generic GRPC metrics that work with any GRPC method.

Describe the solution you'd like

Scope and implement a tower Layer that we can apply to our GRPC services. The trace middleware is probably a good reference implementation to study.

We should identify a few relevant metrics and then emit them with the GRPC method name as a metrics key. That would allow us to take cross-sections of each metric by rpc method and identify performance culprits.

Suggestions to get started:

  • Request count (allows computing rates)
  • Request latency (will require inspecting the request, this is ~easy using Tower)

Ideas for later:

  • Some way to attribute load to the request
  • Bandwidth (related to above)

We should implement this specifically as a Tower Layer rather than spending time adding additional specific metrics; it will be a bit more work upfront but will have much better results long term.

@github-project-automation github-project-automation bot moved this to 🗄️ Backlog in Penumbra Mar 4, 2024
@cratelyn cratelyn added the A-telemetry Area: Metrics, logging, and other observability-related features label Mar 4, 2024
@cratelyn cratelyn added this to the Sprint 4 milestone Apr 8, 2024
@cratelyn cratelyn moved this from Backlog to Todo in Penumbra Apr 8, 2024
@aubrika aubrika added the _P-V2 Priority: after mainnet label Apr 17, 2024
@cratelyn cratelyn modified the milestones: Sprint 4, Sprint 5 Apr 22, 2024
@aubrika aubrika added the friction something made this fall into the following milestone & the reason should be noted in a comment label May 6, 2024
@aubrika aubrika modified the milestones: Sprint 5, Sprint 6 May 6, 2024
@conorsch
Copy link
Contributor

conorsch commented May 6, 2024

Haven't made progress on this lately, but did sit down with @cratelyn a few weeks ago for a pairing session, and documented the state of play in a branch: https://github.com/penumbra-zone/penumbra/tree/tonic-metrics-spike I still view this work as must-have, but in the immediate near-term, I'm going to prioritize testing (#4323) and release (#4325).

@aubrika aubrika removed this from the Sprint 6 milestone May 6, 2024
@cratelyn cratelyn removed the friction something made this fall into the following milestone & the reason should be noted in a comment label May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-telemetry Area: Metrics, logging, and other observability-related features _P-V2 Priority: after mainnet
Projects
Status: Todo
Development

No branches or pull requests

4 participants