proxy+pageserver: shared leaky bucket impl #8539

conradludgate · 2024-07-29T13:51:07Z

In proxy I switched to a leaky-bucket impl using the GCRA algorithm. I figured I could share the code with pageserver and remove the leaky_bucket crate dependency with some very basic tokio timers and queues for fairness.

The underlying algorithm should be fairly clear how it works from the comments I have left in the code.

In benchmarking pageserver, @problame found that the new implementation fixes a getpage throughput discontinuity in pageserver under the pagebench get-page-latest-lsn benchmark with the clickbench dataset (test_perf_olap.py).
The discontinuity is that for any of --num-clients={2,3,4}, getpage throughput remains 10k.
With --num-clients=5 and greater, getpage throughput then jumps to the configured 20k rate limit.
With the changes in this PR, the discontinuity is gone, and we scale throughput linearly to --num-clients until the configured rate limit.

More context in https://github.com/neondatabase/cloud/issues/16886#issuecomment-2315257641.

closes https://github.com/neondatabase/cloud/issues/16886

github-actions · 2024-07-29T15:11:45Z

3853 tests run: 3747 passed, 0 failed, 106 skipped (full report)

Flaky tests (7)

Postgres 16

test_scrubber_physical_gc_ancestors_split: release-arm64
test_scrubber_physical_gc_ancestors[None]: release-arm64
test_scrubber_physical_gc_ancestors[2]: release-arm64

Postgres 15

test_secondary_background_downloads: release-x86-64
test_storage_controller_smoke: release-x86-64

Postgres 14

test_location_conf_churn[2]: release-x86-64
test_storage_controller_node_deletion[False]: release-x86-64

Code coverage* (full report)

functions: 32.5% (7408 of 22766 functions)
lines: 50.7% (60163 of 118653 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
2aa98e3 at 2024-08-29T11:36:07.181Z :recycle:}

conradludgate · 2024-07-29T17:37:37Z

Upon testing, it seems the current leaky-bucket crate supports waiting for more tokens than can ever fit in the bucket. This isn't working properly with my impl just yet (it technically still works but it first resets the bucket to be empty, which breaks the time tracking, so it ends up waiting longer than it should)

~~I'm not sure if this feature is important for ps~~

koivunej · 2024-07-30T09:40:07Z

I'm not sure if this feature is important for ps

I think all of the waits are for one token, so no.

libs/utils/src/leaky_bucket.rs

Co-authored-by: Joonas Koivunen <joonas@neon.tech>

conradludgate · 2024-07-30T14:33:03Z

It should now be 100% compatible by tracking the start time, and only adjusting empty bucket position based on that start time.

…er-leaky-bucket/rebase Conflicts: Cargo.lock proxy/src/rate_limiter.rs proxy/src/rate_limiter/leaky_bucket.rs Due to split-up of leaky_bucket.rs, the merge would have lost the (minor) changes that were made to proxy/src/rate_limiter/leaky_bucket.rs since #8539 was created. Backported them manually, see the commits in the first parent. git log -p 2416da3..origin/main -- proxy/src/rate_limiter/

problame · 2024-08-28T12:55:36Z

I just resolved the conflicts by merging from main.

See commit message for how I resolved conflicts: 5b9d371

problame · 2024-08-28T13:07:15Z

Context why I'm looking into this: https://github.com/neondatabase/cloud/issues/16886#issuecomment-2315257641

=> @conradludgate please review my recent pushes and let's get this merged.

Here's my proposed updated PR description

[EXISITING TEXT]

In benchmarking pageserver, @problame found that the new implementation fixes a getpage throughput discontinuity in pageserver under the `pagebench get-page-latest-lsn` benchmark with the clickbench dataset (`test_perf_olap.py`).
The discontinuity is that for any of `--num-clients={2,3,4}`, getpage throughput remains 10k.
With `--num-clients=5` and greater, getpage throughput then jumps to the configured 20k rate limit.
With the changes in this PR, the discontinuity is gone, and we scale throughput linearly to `--num-clients` until the configured rate limit.

More context in https://github.com/neondatabase/cloud/issues/16886#issuecomment-2315257641.

closes https://github.com/neondatabase/cloud/issues/16886

problame

reviewed libs/utils/src/leaky_bucket.rs. Didn't know GCRA before and only skimmed the Wikipedia article. Some comments

pageserver/src/tenant/throttle.rs

libs/utils/src/leaky_bucket.rs

problame · 2024-08-28T14:44:55Z

Call with conrad:

remove the discrete refill - pageserver doesn't rely on it
make queue mandatory
move RateLimiter into shared crate

problame

Did a perf test after the recent changes pushed by Conrad. All looking good.

libs/pageserver_api/src/models.rs

cloneable

LGTM

Not sure why state and config are kept completely separate instead of keeping both inside a LeakyBucket type.

conradludgate · 2024-08-29T09:17:06Z

Not sure why state and config are kept completely separate instead of keeping both inside a LeakyBucket type.

In proxy, we have 1 config globally (32 bytes), then 1 state per endpoint (16 bytes), hence the split

…he rollout to prod is safe

MMeent · 2024-08-29T14:29:54Z

I think all of the waits are for one token, so no.
@koivunej this is incorrect, the vectored path uses as many tokens as the number of pages it is processing in that vectored request.

conradludgate · 2024-08-29T14:57:59Z

I fixed the issue regardless, so it should not matter

koivunej · 2024-08-29T16:49:17Z

I was replying on the basis of page_service requests done which I think are the only throttled ones. I think my answer still holds. Please let me know if that is wrong.

Bodobolero · 2024-08-30T15:33:29Z

Dashboard still doesn't show the 44 second elapsed time for query-1 that we had before (see green line in )

benchmark history

problame · 2024-09-02T13:17:23Z

@Bodobolero let's continue the discussion in the investigation issue https://github.com/neondatabase/cloud/issues/16886#issuecomment-2324742299

conradludgate added 7 commits July 28, 2024 23:00

proxy: improve performance of leaky-bucket

54c5196

tweak comments

ef7e96f

replace leaky-bucket crate with gcra impl

60c3e13

add quantization

40d2395

share impl between proxy and ps

28b85ca

add more tests

d73475c

rename

300a43d

conradludgate changed the title ~~Pageserver leaky bucket~~ proxy+pageserver: shared leaky bucket impl Jul 29, 2024

conradludgate changed the base branch from proxy-leaky-bucket-gcra to main July 29, 2024 13:51

fix fair queue

c766021

conradludgate marked this pull request as ready for review July 29, 2024 16:58

conradludgate requested review from a team as code owners July 29, 2024 16:58

conradludgate requested review from khanova and skyzh July 29, 2024 16:58

koivunej reviewed Jul 30, 2024

View reviewed changes

libs/utils/src/leaky_bucket.rs Outdated Show resolved Hide resolved

conradludgate and others added 2 commits July 30, 2024 10:46

Update libs/utils/src/leaky_bucket.rs

2dd85f5

Co-authored-by: Joonas Koivunen <joonas@neon.tech>

fix limitation of requesting more than max tokens

490f475

conradludgate and others added 4 commits July 30, 2024 22:01

fix fix test

d6d9ad2

backport d6eede5

31f9d39

backport 12850dd

5549a90

problame requested review from problame and removed request for khanova August 28, 2024 13:14

conradludgate requested a review from cloneable August 28, 2024 13:31

problame reviewed Aug 28, 2024

View reviewed changes

pageserver/src/tenant/throttle.rs Outdated Show resolved Hide resolved

libs/utils/src/leaky_bucket.rs Outdated Show resolved Hide resolved

libs/utils/src/leaky_bucket.rs Outdated Show resolved Hide resolved

libs/utils/src/leaky_bucket.rs Outdated Show resolved Hide resolved

conradludgate added 4 commits August 28, 2024 16:54

make ratelimiter always fair and move to shared lib

543cfe6

remove discrete draining

0e7f312

revert back to using instant rather than epoch+duration

2dbf1a1

update language and improve ergonomics

1668761

problame approved these changes Aug 28, 2024

View reviewed changes

libs/pageserver_api/src/models.rs Outdated Show resolved Hide resolved

remove fair config and slightly tweak docs

20b04f0

conradludgate enabled auto-merge (squash) August 29, 2024 07:54

use helper function in proxy

d39430f

cloneable approved these changes Aug 29, 2024

View reviewed changes

fix test broken by fair removal and add temporary tests to ensure t…

2aa98e3

…he rollout to prod is safe

conradludgate merged commit a644f01 into main Aug 29, 2024
67 checks passed

conradludgate deleted the pageserver-leaky-bucket branch August 29, 2024 11:26

conradludgate mentioned this pull request Sep 19, 2024

Proxy release 2024-09-19 #9056

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proxy+pageserver: shared leaky bucket impl #8539

proxy+pageserver: shared leaky bucket impl #8539

conradludgate commented Jul 29, 2024 •

edited

Loading

github-actions bot commented Jul 29, 2024 •

edited

Loading

Postgres 16

Postgres 15

Postgres 14

conradludgate commented Jul 29, 2024 •

edited

Loading

koivunej commented Jul 30, 2024

conradludgate commented Jul 30, 2024

problame commented Aug 28, 2024

problame commented Aug 28, 2024

problame left a comment

problame commented Aug 28, 2024

problame left a comment

cloneable left a comment

conradludgate commented Aug 29, 2024

MMeent commented Aug 29, 2024

conradludgate commented Aug 29, 2024

koivunej commented Aug 29, 2024

Bodobolero commented Aug 30, 2024

problame commented Sep 2, 2024

proxy+pageserver: shared leaky bucket impl #8539

proxy+pageserver: shared leaky bucket impl #8539

Conversation

conradludgate commented Jul 29, 2024 • edited Loading

github-actions bot commented Jul 29, 2024 • edited Loading

3853 tests run: 3747 passed, 0 failed, 106 skipped (full report)

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)

conradludgate commented Jul 29, 2024 • edited Loading

koivunej commented Jul 30, 2024

conradludgate commented Jul 30, 2024

problame commented Aug 28, 2024

problame commented Aug 28, 2024

problame left a comment

Choose a reason for hiding this comment

problame commented Aug 28, 2024

problame left a comment

Choose a reason for hiding this comment

cloneable left a comment

Choose a reason for hiding this comment

conradludgate commented Aug 29, 2024

MMeent commented Aug 29, 2024

conradludgate commented Aug 29, 2024

koivunej commented Aug 29, 2024

Bodobolero commented Aug 30, 2024

problame commented Sep 2, 2024

conradludgate commented Jul 29, 2024 •

edited

Loading

github-actions bot commented Jul 29, 2024 •

edited

Loading

conradludgate commented Jul 29, 2024 •

edited

Loading