Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kafka: Add fetch plan and execute latency metric #13485

Merged
merged 1 commit into from
Nov 8, 2023

Conversation

StephanDollberg
Copy link
Member

Adds a histogram metric to measure the time it takes to create the fetch
plan and execute it - aka a single fetch poll.

It's an approximation for the time it takes to process the data in a
fetch request once it is available.

I have separated two series one which is tracking empty fetches and one
that isn't.

Further the count of the histogram can be used to calculate the ratio of
fetch requests to polls like so:

sum(irate(vectorized_kafka_handler_requests_completed_total{...,
handler="fetch"}[$__rate_interval])) by ($aggr_criteria) /
sum(irate(vectorized_fetch_stats_plan_and_execute_latency_us_count{...}[$__rate_interval])) by
($aggr_criteria)

Looking at some scenarios we get the following values:

  • 500MB/s, 4P/4C, 288P, ~110k batch, 1ms debounce: ~0.37
  • 500MB/s, 4P/4C, 288P, ~110k batch, 10ms debounce: ~0.66
  • 125MB/s, 8kP/8kC, 40k partitions, 1ms debounce: ~0.012
  • 125MB/s, 8kP/8kC, 40k partitions, 10ms debounce: ~0.035
  • 125MB/s, 8kP/8kC, 40k partitions, 100ms debounce: ~0.24

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

Improvements

  • Adds a metric to track fetch plan and execute latency

ballard26
ballard26 previously approved these changes Nov 2, 2023
Copy link
Contributor

@ballard26 ballard26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Are we planning on removing or renaming the existing kafka_latency_fetch_latency_us metric which records per shards latency for each poll?

@StephanDollberg
Copy link
Member Author

LGTM. Are we planning on removing or renaming the existing kafka_latency_fetch_latency_us metric which records per shards latency for each poll?

That is my longterm goal yes.

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@StephanDollberg StephanDollberg force-pushed the stephan/fetch-plan-and-execute-latency branch from 37de022 to 97a0e9d Compare November 6, 2023 09:30
Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and thanks for the example numbers in the patch description: very useful!

Adds a histogram metric to measure the time it takes to create the fetch
plan and execute it - aka a single fetch poll.

It's an approximation for the time it takes to process the data in a
fetch request once it is available.

I have separated two series one which is tracking empty fetches and one
that isn't.

Further the count of the histogram can be used to calculate the ratio of
fetch requests to polls like so:

```
sum(irate(vectorized_kafka_handler_requests_completed_total{...,
handler="fetch"}[$__rate_interval])) by ($aggr_criteria) /
sum(irate(vectorized_fetch_stats_plan_and_execute_latency_us_count{...}[$__rate_interval])) by
($aggr_criteria)
```

Looking at some scenarios we get the following values:

 - 500MB/s, 4P/4C, 288P, ~110k batch, 1ms debounce: ~0.37
 - 500MB/s, 4P/4C, 288P, ~110k batch, 10ms debounce: ~0.66
 - 125MB/s, 8kP/8kC, 40k partitions, 1ms debounce: ~0.012
 - 125MB/s, 8kP/8kC, 40k partitions, 10ms debounce: ~0.035
 - 125MB/s, 8kP/8kC, 40k partitions, 100ms debounce: ~0.24
@StephanDollberg StephanDollberg force-pushed the stephan/fetch-plan-and-execute-latency branch from 97a0e9d to 220b11e Compare November 7, 2023 09:34
@StephanDollberg
Copy link
Member Author

Failure is: #14254

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as always, amazing commit messages @StephanDollberg

@piyushredpanda piyushredpanda merged commit 3f361a0 into dev Nov 8, 2023
30 of 32 checks passed
@piyushredpanda piyushredpanda deleted the stephan/fetch-plan-and-execute-latency branch November 8, 2023 09:36
@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-13485-v23.1.x-543 remotes/upstream/v23.1.x
git cherry-pick -x 220b11ed449bc553cd4c69b830b976f8d01db646

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-13485-v23.2.x-8 remotes/upstream/v23.2.x
git cherry-pick -x 220b11ed449bc553cd4c69b830b976f8d01db646

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants