GRPC metrics middleware implemented as a Tower Layer #3931

hdevalence · 2024-03-04T06:48:54Z

Is your feature request related to a problem? Please describe.

Recently, the public RPC fell into a degraded state as someone made too many requests to it. In this case, we identified the cause via an accidental backchannel. However, had that not happened we would have been totally unable to determine what the cause was, what kind of requests we were getting, and what was happening to them, because we only have metrics on the requests we already had a reason to care about.

Instead, we need to have generic GRPC metrics that work with any GRPC method.

Describe the solution you'd like

Scope and implement a tower Layer that we can apply to our GRPC services. The trace middleware is probably a good reference implementation to study.

We should identify a few relevant metrics and then emit them with the GRPC method name as a metrics key. That would allow us to take cross-sections of each metric by rpc method and identify performance culprits.

Suggestions to get started:

Request count (allows computing rates)
Request latency (will require inspecting the request, this is ~easy using Tower)

Ideas for later:

Some way to attribute load to the request
Bandwidth (related to above)

We should implement this specifically as a Tower Layer rather than spending time adding additional specific metrics; it will be a bit more work upfront but will have much better results long term.

conorsch · 2024-05-06T20:20:36Z

Haven't made progress on this lately, but did sit down with @cratelyn a few weeks ago for a pairing session, and documented the state of play in a branch: https://github.com/penumbra-zone/penumbra/tree/tonic-metrics-spike I still view this work as must-have, but in the immediate near-term, I'm going to prioritize testing (#4323) and release (#4325).

hdevalence assigned conorsch Mar 4, 2024

github-project-automation bot added this to Penumbra Mar 4, 2024

github-project-automation bot moved this to 🗄️ Backlog in Penumbra Mar 4, 2024

cratelyn added the A-telemetry Area: Metrics, logging, and other observability-related features label Mar 4, 2024

cratelyn added this to the Sprint 4 milestone Apr 8, 2024

cratelyn moved this from Backlog to Todo in Penumbra Apr 8, 2024

aubrika added the _P-V2 Priority: after mainnet label Apr 17, 2024

cratelyn modified the milestones: Sprint 4, Sprint 5 Apr 22, 2024

aubrika added the friction something made this fall into the following milestone & the reason should be noted in a comment label May 6, 2024

aubrika modified the milestones: Sprint 5, Sprint 6 May 6, 2024

aubrika removed this from the Sprint 6 milestone May 6, 2024

cratelyn removed the friction something made this fall into the following milestone & the reason should be noted in a comment label May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPC metrics middleware implemented as a Tower Layer #3931

GRPC metrics middleware implemented as a Tower Layer #3931

hdevalence commented Mar 4, 2024

conorsch commented May 6, 2024

GRPC metrics middleware implemented as a Tower Layer #3931

GRPC metrics middleware implemented as a Tower Layer #3931

Comments

hdevalence commented Mar 4, 2024

conorsch commented May 6, 2024