Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy+pageserver: shared leaky bucket impl #8539

Merged
merged 21 commits into from
Aug 29, 2024
Merged

Conversation

conradludgate
Copy link
Contributor

@conradludgate conradludgate commented Jul 29, 2024

In proxy I switched to a leaky-bucket impl using the GCRA algorithm. I figured I could share the code with pageserver and remove the leaky_bucket crate dependency with some very basic tokio timers and queues for fairness.

The underlying algorithm should be fairly clear how it works from the comments I have left in the code.


In benchmarking pageserver, @problame found that the new implementation fixes a getpage throughput discontinuity in pageserver under the pagebench get-page-latest-lsn benchmark with the clickbench dataset (test_perf_olap.py).
The discontinuity is that for any of --num-clients={2,3,4}, getpage throughput remains 10k.
With --num-clients=5 and greater, getpage throughput then jumps to the configured 20k rate limit.
With the changes in this PR, the discontinuity is gone, and we scale throughput linearly to --num-clients until the configured rate limit.

More context in https://github.com/neondatabase/cloud/issues/16886#issuecomment-2315257641.

closes https://github.com/neondatabase/cloud/issues/16886

@conradludgate conradludgate changed the title Pageserver leaky bucket proxy+pageserver: shared leaky bucket impl Jul 29, 2024
@conradludgate conradludgate changed the base branch from proxy-leaky-bucket-gcra to main July 29, 2024 13:51
Copy link

github-actions bot commented Jul 29, 2024

3853 tests run: 3747 passed, 0 failed, 106 skipped (full report)


Flaky tests (7)

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 32.5% (7408 of 22766 functions)
  • lines: 50.7% (60163 of 118653 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
2aa98e3 at 2024-08-29T11:36:07.181Z :recycle:

@conradludgate conradludgate marked this pull request as ready for review July 29, 2024 16:58
@conradludgate conradludgate requested review from a team as code owners July 29, 2024 16:58
@conradludgate conradludgate requested review from khanova and skyzh July 29, 2024 16:58
@conradludgate
Copy link
Contributor Author

conradludgate commented Jul 29, 2024

Upon testing, it seems the current leaky-bucket crate supports waiting for more tokens than can ever fit in the bucket. This isn't working properly with my impl just yet (it technically still works but it first resets the bucket to be empty, which breaks the time tracking, so it ends up waiting longer than it should)

I'm not sure if this feature is important for ps

@koivunej
Copy link
Member

I'm not sure if this feature is important for ps

I think all of the waits are for one token, so no.

@conradludgate
Copy link
Contributor Author

It should now be 100% compatible by tracking the start time, and only adjusting empty bucket position based on that start time.

conradludgate and others added 4 commits July 30, 2024 22:01
…er-leaky-bucket/rebase

Conflicts:
	Cargo.lock
	proxy/src/rate_limiter.rs
	proxy/src/rate_limiter/leaky_bucket.rs

Due to split-up of leaky_bucket.rs, the merge would have lost the
(minor) changes that were made to proxy/src/rate_limiter/leaky_bucket.rs
since #8539 was created.

Backported them manually, see the commits in the first parent.

git log -p 2416da3..origin/main -- proxy/src/rate_limiter/
@problame
Copy link
Contributor

I just resolved the conflicts by merging from main.

See commit message for how I resolved conflicts: 5b9d371

@problame
Copy link
Contributor

Context why I'm looking into this: https://github.com/neondatabase/cloud/issues/16886#issuecomment-2315257641

=> @conradludgate please review my recent pushes and let's get this merged.

Here's my proposed updated PR description

[EXISITING TEXT]

In benchmarking pageserver, @problame found that the new implementation fixes a getpage throughput discontinuity in pageserver under the `pagebench get-page-latest-lsn` benchmark with the clickbench dataset (`test_perf_olap.py`).
The discontinuity is that for any of `--num-clients={2,3,4}`, getpage throughput remains 10k.
With `--num-clients=5` and greater, getpage throughput then jumps to the configured 20k rate limit.
With the changes in this PR, the discontinuity is gone, and we scale throughput linearly to `--num-clients` until the configured rate limit.

More context in https://github.com/neondatabase/cloud/issues/16886#issuecomment-2315257641.

closes https://github.com/neondatabase/cloud/issues/16886

@problame problame requested review from problame and removed request for khanova August 28, 2024 13:14
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed libs/utils/src/leaky_bucket.rs. Didn't know GCRA before and only skimmed the Wikipedia article. Some comments

pageserver/src/tenant/throttle.rs Outdated Show resolved Hide resolved
libs/utils/src/leaky_bucket.rs Outdated Show resolved Hide resolved
libs/utils/src/leaky_bucket.rs Outdated Show resolved Hide resolved
libs/utils/src/leaky_bucket.rs Outdated Show resolved Hide resolved
@problame
Copy link
Contributor

Call with conrad:

  1. remove the discrete refill - pageserver doesn't rely on it
  2. make queue mandatory
  3. move RateLimiter into shared crate

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a perf test after the recent changes pushed by Conrad. All looking good.

libs/pageserver_api/src/models.rs Outdated Show resolved Hide resolved
@conradludgate conradludgate enabled auto-merge (squash) August 29, 2024 07:54
Copy link
Contributor

@cloneable cloneable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Not sure why state and config are kept completely separate instead of keeping both inside a LeakyBucket type.

@conradludgate
Copy link
Contributor Author

Not sure why state and config are kept completely separate instead of keeping both inside a LeakyBucket type.

In proxy, we have 1 config globally (32 bytes), then 1 state per endpoint (16 bytes), hence the split

@conradludgate conradludgate merged commit a644f01 into main Aug 29, 2024
67 checks passed
@conradludgate conradludgate deleted the pageserver-leaky-bucket branch August 29, 2024 11:26
@MMeent
Copy link
Contributor

MMeent commented Aug 29, 2024

I think all of the waits are for one token, so no.
@koivunej this is incorrect, the vectored path uses as many tokens as the number of pages it is processing in that vectored request.

@conradludgate
Copy link
Contributor Author

I fixed the issue regardless, so it should not matter

@koivunej
Copy link
Member

I was replying on the basis of page_service requests done which I think are the only throttled ones. I think my answer still holds. Please let me know if that is wrong.

@Bodobolero
Copy link
Contributor

Dashboard still doesn't show the 44 second elapsed time for query-1 that we had before (see green line in )

benchmark history

@problame
Copy link
Contributor

problame commented Sep 2, 2024

@Bodobolero let's continue the discussion in the investigation issue https://github.com/neondatabase/cloud/issues/16886#issuecomment-2324742299

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants