Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancel-able reservations for ratelimiters #6030

Merged
merged 22 commits into from
May 28, 2024

Conversation

Groxx
Copy link
Member

@Groxx Groxx commented May 20, 2024

Implements a fix for #6029, allowing normal ratelimiter usage to reserve tokens and be able to cancel them, assuming no racing claims on the limiter.

In the course of benchmarking this, I discovered what I believe to be some fundamental flaws with how *rate.Limiter works, which effectively require a wrapper like I've built here. So unless this has unsustainable latency or CPU costs, IMO it should be preferred over the underlying *rate.Limiter on correctness alone.

The good news is that latency seems entirely survivable, less than 2x slower in basically all cases (and usually much less).
Surprisingly, it's even slightly faster for one of our common uses (reserve -> cancel, used in multi-stage ratelimits, which I suspect we will use more heavily in the future).

What this wrapper does

A few core things:

  1. re-implements *rate.Limiter in terms of an interface, to allow mocks and wrappers (*rate.Limiter has Reserve() *rate.Reservation which can't be mocked in any reasonable way)
  2. implements ^ the interface while using a TimeSource so tests can resist CPU noise more easily (will also require a context-wrapper in some cases, which will be done later when needed. there's a simple prototype in this commit.)
  3. prevents *rate.Limiter-internal time from going backwards (leads to irrational behavior / violated limits)
  4. synchronously cancels any reservations that are not immediately available (more efficient, matches our intended use, and has much more correct throughput when contended)

The flaw

Time travel does weird things.

Essentially, it's a combination of:

  • *rate.Limiter's use of time.Now() is not monotonic, despite time.Now() itself being monotonic: Now() is collected before the ratelimiter's mutex is held, so values may be arbitrarily delayed and out of order while waiting to acquire the mutex.
  • *rate.Limiter allows its internal time to advance and rewind by design, but it does not preserve enough information to e.g. rewind 1s and then advance 1s without affecting the next call.
  • as a mitigating factor on all this, *rate.Limiter only updates its last time when an event is allowed. this means ^ this essentially only applies if time skew is greater than your limiter's rate, especially if it's >=2x or other time-changing methods are called.

With fast benchmarks this is very easy to trigger, IRL... I'm unsure. Some of our limits are set to >100,000 per second, or <10µs of mutex-imposed delay tolerated. With a tight loop in a benchmark I regularly exceed 100µs of latency between locks even without parallel calls (zero contention), so I suspect it's not all that rare.

In any case, because it explicitly allows time to rewind and the simpler methods do not acquire the lock before gathering time.Now(), concurrent use can lead to the following behavior. Assuming the limit refills a token every 5 time-periods, and the calls acquire the mutex in the following global order (as observed through the limiter's mutex) but with their original T=time.Now() values:

  1. Allow() called at T0: successfully claims a token, emptying the token bucket.
    • The limiter's internal time, last, is now T0.
  2. Allow() called at T1: fails, 0 tokens available.
    • The limiter's internal time does not advance, nothing changes.
  3. Allow() called at T10: successfully claims a token.
    • The limiter's internal time advances from T0 to T10, and adds 2 tokens.
    • The 1 full token is claimed, the call is allowed, and 1 token remains after the call.
  4. Allow() called at T0: successfully claims a token.
    • The limiter's internal time advances from T10 to T0. No tokens have been added because the new time is lower.
    • There is 1 token in the bucket, so the call is allowed. 0 tokens remain.
  5. Allow() called at T5: successfully claims a token.
    • The limiter's internal time advances from T0 to T5, and adds a token to the bucket.
    • The 1 full token is claimed, the call is allowed, and 0 tokens remain after the call.

From an outside observer who knows "real" time has only moved from T0 to T10, four calls have been allowed when only three should have been (one each at T0, T5, and T10). This can happen any number of times between real-T0 and real-T10 if more old/new stale-time-alternating calls occur (especially with larger skew), and some benchmarks show this allowing thousands of times more events than it should.

The fix

Never allow the internal time to rewind.

It's pretty straightforward conceptually, and I am absolutely thrilled that *rate.Limiter has these "pass in a time" APIs: they allow this kind of wrapper, both for mocking time and for implementing the fix. Implementing it is a bit subtle and fiddly (as shown in this PR), but thankfully possible.

This eliminates rapid thrashing (all old-time events simply occur "at the same time"), and as long as "real" time is used, it eventually converges on real-time-like throughput because real time keeps advancing at a steady rate when viewed at larger time scales. At small time scales there is no correct time available anyway, so only larger-scale behavior is important.

I've also chosen to "pin" reservations' time, because this allows canceling them well after they are claimed if no other time-advancing calls have occurred.

Conceptually I believe this is sound, and benchmarks show essentially perfect timing between allowed events:

  • Events that occur in the real-past / at the "same" time as the reservation (or after) can be allowed to use the returned token because no possibly-over-limit calls have occurred after it (in fact, none at all).
    • If using time.Now(), this token likely could not be returned because time has usually advanced, and canceling does nothing when that occurs.
    • You can see an isolated version of this in the pinned-time benchmarks, notice how its sequential call throughput is essentially perfect.
  • If other calls have occurred "after" the reservation, returning a consumed token cannot be guaranteed safe / may not correctly limit throughput, so it is not returned.
    • e.g. burst-1 limit-1/s and allowing late cancels would allow: reserve (0 delay) -> wait 1s -> Allow() (true) -> cancel() -> Allow() (true), which means 2 allowed events in the same second, violating the limit.

A simpler fix we can contribute upstream

Lock before acquiring time.Now(), not after.

For "normal" use (Wait(ctx) and Allow()), this would ensure that time progresses monotonically, which seems like an obvious assumption for users. Even high frequency events should behave correctly in this case.

Even if this is done, this wrapper still has some use to protect against rewinds when canceling while recovering more tokens. In fact, the entire wrapper would likely be completely unchanged, because we have no way to know what the internal time is when canceling...

... but all non-wrapper and non-at-time-passing users Go-ecosystem-wide would be much more correct. And adding some AllowN(int) (not AllowN(time, int)) time-free non-singular APIs would likely take care of the VAST majority of the remaining cases, leaving only Reserve-users to suffer through the details.

(I would need to think harder about it, but changing how CancelAt works to compare against r.lim.last might resolve it for cancellation too)

@Groxx Groxx changed the title ratelimiter locked limiter Cancel-able reservations for ratelimiters May 20, 2024
Copy link

codecov bot commented May 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.16%. Comparing base (5f45da5) to head (fd398d7).

Additional details and impacted files
Files Coverage Δ
common/clock/ratelimiter.go 100.00% <100.00%> (ø)

... and 9 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f45da5...fd398d7. Read the comment docs.

Copy link
Member

@taylanisikdemir taylanisikdemir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of makes sense at high level. added some nit comments. will do another round of review once marked ready for review.

// implementation if they are handled separately.
// - All MethodAt APIs have been excluded as we do not use them, and they would
// have to bypass the contained TimeSource, which seems error-prone.
// - All `MethodN` APIs have been excluded because we do not use them, but they
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side not: these would be useful if we decide to have different weights per call or its content

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

their interaction with burst is a bit funky and I'm not sure that's something we want to play around with (as opposed to just repeatedly calling Allow())... but yea. easy and has useful scenarios, I'm pretty confident there won't be any surprises if we add it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can be useful, if the limit is not based on QPS but BPS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have any of those? I can add them, I just haven't seen it

common/clock/ratelimiter_test.go Outdated Show resolved Hide resolved
common/clock/ratelimiter_test.go Outdated Show resolved Hide resolved
Groxx added 2 commits May 20, 2024 23:38
Benchmarking and checking logs carefully showed a flaw!
Using "now" rather than the reserved-at time when checking `Allow()` is incorrect,
and can lead to far more allowed calls than intended.

This is now fixed, and full benchmark runs are saved in testdata.
// in high-CPU benchmarks, if you use the limiter's "now", you can sometimes
// see this allowing many times more requests than it should, due to the
// interleaving time advancement.
return r.res.OK() && r.res.DelayFrom(r.reservedAt) == 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a flaw while benchmarking!

Previously this was using r.limiter.lockNow()'s value, and under high concurrency it could allow millions rather than the intended couple-thousand (afaict due to the limiter's internal-"now" advancing to a time where the reservation would be allowed).

Now it's fixed, and I have not seen it misbehave again.

return d
}

type timesourceContext struct {
Copy link
Member Author

@Groxx Groxx May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly useful to break out separately: context-timeouts need a timesource-wrapper.

main caveat is that using it in prod will mean losing stdlib-context-internal optimizations (cancels are propagated more efficiently if it's an stdlib implementation).
build tags could solve that, but we have generally avoided them because they make a few things annoying... but with a standardized and checked approach I'd be entirely in favor of it.

Comment on lines 240 to 247
// unusable, immediately cancel it to get rid of the future-reservation
// since we don't allow waiting for it.
//
// this restores MANY more tokens when heavily contended, as time skews
// less before canceling, so it gets MUCH closer to the intended limit,
// and it performs very slightly better.
res.CancelAt(now)
return deniedReservation{}
Copy link
Member Author

@Groxx Groxx May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noticed some super-low allow rates in a mock-time benchmark when advancing time slowly (and sometimes in real-time too), tracked the issue down to "cancels often arrive later than desired in this test", leading to the ratelimiter mostly staying at negative tokens (due to not-allowed pending reservations) and only allowing ~0.001% through.

synchronously canceling solves that, and should also improve behavior IRL.

@Groxx Groxx marked this pull request as ready for review May 24, 2024 05:12
@Groxx Groxx requested a review from neil-xie as a code owner May 24, 2024 05:12
@Groxx
Copy link
Member Author

Groxx commented May 24, 2024

Tests cleaned up and flattened, docs updated, minor optimizations optimized, and 100% coverage even without the fuzz test.

Seems good to go :)
Plenty open to suggestions for test-design. I've fiddled with it a fair bit and most options have been substantially worse and I'm not sure how to make it more compact... but maybe it's possible? I probably need to sleep on it a few times to see anything structurally-new at this point though.

// - If false, the Reservation will be rolled back, possibly restoring a Ratelimiter token.
// This is equivalent to calling Cancel() on a [rate.Reservation], but it is more likely
// to return the token.
Used(wasUsed bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumb question: what's the intent one might try to convey by specifying that a reservation was used, even when it wasn't allowed? What does that do to the underlying limiter?

Copy link
Member Author

@Groxx Groxx May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this implementation: nothing, it's always a no-op.

In the global ratelimiter thing that led to all this: so you have an "accepted/rejected" event that can be monitored, for counting usage.

With just "Reserve -> Cancel", you have to assume all reserves are "used" immediately, and subtract 1 if cancel is called. And then you may have to deal with counts going negative briefly.
With something called in all cases, that's not a concern, just defer the count until it's marked as used-or-not. It'll even be more time-correct since you're "using" it closer to the Used(true) call than the Reserve() call.

I suspect it'll also be useful in tests, as this way you always EXPECT a call (but the arg varies) rather than "it might be called or it might not".

Copy link
Member Author

@Groxx Groxx May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you probably meant with the real limiter / a somewhat different question.

*rate.Limiter reservations don't really have an "allowed" concept. We (currently and in this wrapper) consider a reservation "allowed" if it is OK() (possible to reserve, literally just "N <= burst") and Delay() == 0 (the limiter had a full token available right now, which has already been claimed).

Delay can be positive, if there wasn't a full token in the bucket. You can take advantage of this if you wanted to reserve a token and wait for it to become available.
At that point, reservations are essentially identical to Wait() but you can tell ahead of time how long you would have to wait, instead of it being a black box that you can give up on (cancel the context) but not predict.

When you get one of these positive-delay reservations, a token has still been removed. The limiter goes from e.g. 0.5 tokens to -0.5 tokens. Reserve another and you'll be at -1.5 tokens, and it'll be 2.5 ratelimiter-periods before any Allow() call can succeed, as it refills the bucket from negative values. But you can continue Reserve()ing (or Wait()ing) as much as you like, and push that to-be-allowed-call backlog out arbitrarily far, each one just waits long enough to produce a steady throughput.

You can cancel these reservations and return a token while waiting, allowing new reservations to occur sooner and possibly even an Allow() call*. But not after your time has passed, the Cancel() call just no-ops.

*: ... I'm not actually sure this is reasonable, now that I think about it. I suspect it can allow arbitrarily large bursts. But it is how *rate.Limiter works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*: ... I'm not actually sure this is reasonable, now that I think about it. I suspect it can allow arbitrarily large bursts.

Ah, no, that's handled. It may take me a while to fully understand the details, but it's tracking "reserves since reserved" and not exceeding that value, so carefully-timed reserve+cancel interactions don't seem to allow scheduling a bunch of reserved events at the same time.

// and cause undefined behavior.
func (r *ratelimiter) lockNow() (now time.Time, unlock func()) {
r.mut.Lock()
// important design note: `r.timesource.Now()` MUST be gathered while the lock is held,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, but repeated above

Copy link
Member Author

@Groxx Groxx May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had intended this as "callers of this func must hold the lock as long as the limiter is being modified (often handled by defer unlock())" being a different concern from "this func's implementation must lock before gathering time.Now()" (which can't be observed by callers of lockNow()).
"do less inside the lock" -> "move Now() outside the lock" is a fairly predictable future optimization idea, but it'd reproduce the underlying *rate.Limiter flaw.

I can reword it, but I do think both need to be documented explicitly, in a way that'd be noticed during code review for any changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, wait, you were probably referring to the comment above timesource in the struct definition. yea, that feels like a duplicate. I'll clean that up.


// ensure we are unlocked, which will not have happened if an err
// was returned immediately.
once.Do(unlock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly confused about the number of unlocks here. My recollection is that defers are LIFO and you're manually unlocking here. Won't this be called a second time on 327?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, that's what you're using once for 🤦 ignore me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea - it just felt (a lot) easier to once it and not worry about handling specific branches. Much harder to screw up.

I can figure out exactly what that'd take if we wanted a different route / see the comparison, and it doesn't actually have to be a sync-once to work so it can be optimized a bit. I just fought with it briefly and then realized "wait this approach is obviously correct, I'll just do that" and never looked back.

Copy link
Member

@davidporter-id-au davidporter-id-au left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't fault anything to the extent that I was following along the minutae of the rate-limter behaviour

if delay == 0 {
return nil // available token, allow it
}
if deadline, ok := ctx.Deadline(); ok && now.Add(delay).After(deadline) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want a few milliseconds of buffer (now+delay < deadline-buffer). Ideally specified by the endpoint being protected but due to lack of annotations that will be a mess. Probably fine to leave like this for now

Copy link
Member Author

@Groxx Groxx May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from IRL: this kind of "reserve a buffer for the caller to use the thing protected by the limiter" is better handled by layers above this, by making a context with a deadline before that buffer.

which does very much seem worth making easier to do, it's quite verbose with plain contexts and I suspect that's part of why we don't do it much. and it may be worth building a Wait(ctx, reserveTimeBeforeCtxTimesOut) API to make it super easy and clear. we just don't really have anything that works this way right now, and the decision of how much time to buffer wouldn't be implemented at this layer anyway.

@Groxx Groxx merged commit f8765f0 into cadence-workflow:master May 28, 2024
20 checks passed
@Groxx
Copy link
Member Author

Groxx commented May 28, 2024

Merging since I got approvals, very much open to documentation improvements / things to try though. And I'll probably present it internally to the team and make some tweaks based on how that goes.

@Groxx Groxx deleted the ratelimiter-locked-limiter branch May 28, 2024 22:56
timl3136 pushed a commit to timl3136/cadence that referenced this pull request Jun 6, 2024
Implements a fix for cadence-workflow#6029, allowing _normal_ ratelimiter usage to reserve tokens and be able to cancel them, assuming no racing claims on the limiter.

In the course of benchmarking this, I discovered what I believe to be some fundamental flaws with how `*rate.Limiter` works, which effectively _require_ a wrapper like I've built here.  So unless this has unsustainable latency or CPU costs, IMO it should be preferred over the underlying `*rate.Limiter` on correctness alone.

The good news is that latency seems entirely survivable, less than 2x slower in basically all cases (and usually much less).
Surprisingly, it's even slightly _faster_ for one of our common uses (reserve -> cancel, used in multi-stage ratelimits, which I suspect we will use more heavily in the future).

# What this wrapper does

A few core things:
1. re-implements `*rate.Limiter` in terms of an interface, to allow mocks and wrappers (`*rate.Limiter` has `Reserve() *rate.Reservation` which can't be mocked in any reasonable way)
2. implements ^ the interface while using a `TimeSource` so tests can resist CPU noise more easily (will also require a context-wrapper in some cases, which will be done later when needed.  there's a simple prototype in this commit.)
3. prevents `*rate.Limiter`-internal time from going backwards (leads to irrational behavior / violated limits)
4. synchronously cancels any reservations that are not immediately available (more efficient, matches our intended use, and has much more correct throughput when contended)

# The flaw

Time travel does weird things.

Essentially, it's a combination of:
- `*rate.Limiter`'s *use of* `time.Now()` is not monotonic, despite `time.Now()` *itself* being monotonic: `Now()` is collected _before_ the ratelimiter's mutex is held, so values may be arbitrarily delayed and out of order while waiting to acquire the mutex.
- `*rate.Limiter` allows its internal time to advance _and rewind_ by design, but it does not preserve enough information to e.g. rewind 1s and then advance 1s without affecting the next call.
- as a mitigating factor on all this, `*rate.Limiter` only updates its `last` time when an event is _allowed_.  this means ^ this essentially only applies if time skew is greater than your limiter's rate, especially if it's >=2x or other time-changing methods are called.

With fast benchmarks this is very easy to trigger, IRL... I'm unsure.  Some of our limits are set to >100,000 per second, or <10µs of mutex-imposed delay tolerated.  With a tight loop in a benchmark I regularly exceed 100µs of latency between locks even without parallel calls (zero contention), so I suspect it's not all that rare.

In any case, because it explicitly allows time to rewind and the simpler methods do not acquire the lock before gathering `time.Now()`, concurrent use can lead to the following behavior.  Assuming the limit refills a token every 5 time-periods, and the calls acquire the mutex in the following global order (as observed through the limiter's mutex) but with their original T=time.Now() values:
1. `Allow()` called at T0: successfully claims a token, emptying the token bucket.
    - The limiter's internal time, `last`, is now T0.
3. `Allow()` called at T1: fails, 0 tokens available.
    - The limiter's internal time does not advance, nothing changes.
5. `Allow()` called at T10: successfully claims a token.
    - The limiter's internal time advances from T0 to T10, and adds 2 tokens.
    - The 1 full token is claimed, the call is allowed, and 1 token remains after the call.
6. `Allow()` called at T0: successfully claims a token.
    - The limiter's internal time advances from T10 to T0.  No tokens have been added because the new time is lower.
    - There is 1 token in the bucket, so the call is allowed.  0 tokens remain.
7. `Allow()` called at T5: successfully claims a token.
    - The limiter's internal time advances from T0 to T5, and adds a token to the bucket.
    - The 1 full token is claimed, the call is allowed, and 0 tokens remain after the call.

From an outside observer who knows "real" time has only moved from T0 to T10, _four_ calls have been allowed when only three should have been (one each at T0, T5, and T10).  This can happen _any number of times_ between real-T0 and real-T10 if more old/new stale-time-alternating calls occur (especially with larger skew), and some benchmarks show this allowing _thousands_ of times more events than it should.

# The fix

Never allow the internal time to rewind.

It's pretty straightforward conceptually, and I am _absolutely thrilled_ that `*rate.Limiter` has these "pass in a time" APIs: they allow this kind of wrapper, both for mocking time and for implementing the fix.  Implementing it is a bit subtle and fiddly (as shown in this PR), but thankfully possible.

This eliminates rapid thrashing (all old-time events simply occur "at the same time"), and as long as "real" time is used, it eventually converges on real-time-like throughput because real time keeps advancing at a steady rate when viewed at larger time scales.  At small time scales there is no correct time available anyway, so only larger-scale behavior is important.

I've also chosen to "pin" reservations' time, because this allows canceling them well after they are claimed _if no other time-advancing calls have occurred_.

Conceptually I believe this is sound, and benchmarks show essentially perfect timing between allowed events:
- Events that occur in the real-past / at the "same" time as the reservation (or after) can be allowed to use the returned token because no possibly-over-limit calls have occurred after it (in fact, none at all).
    - If using `time.Now()`, this token likely could not be returned because time has usually advanced, and canceling does nothing when that occurs.
    - You can see an isolated version of this in the pinned-time benchmarks, notice how its sequential call throughput is essentially perfect.
- If other calls have occurred "after" the reservation, returning a consumed token cannot be guaranteed safe / may not correctly limit throughput, so it is not returned.
    - e.g. burst-1 limit-1/s and allowing late cancels would allow: reserve (0 delay) -> wait 1s -> Allow() (true) -> cancel() -> Allow() (true), which means 2 allowed events in the same second, violating the limit.

# A simpler fix we can contribute upstream

Lock before acquiring `time.Now()`, not after.

For "normal" use (`Wait(ctx)` and `Allow()`), this would ensure that time progresses monotonically, which seems like an obvious assumption for users.  Even high frequency events should behave correctly in this case.

Even if this is done, this wrapper still has some use to protect against rewinds when canceling while recovering more tokens.  In fact, the entire wrapper would likely be completely unchanged, because we have no way to know what the internal time is when canceling...

... but all non-wrapper and non-at-time-passing users Go-ecosystem-wide would be much more correct.  And adding some `AllowN(int)` (not `AllowN(time, int)`) time-free non-singular APIs would likely take care of the VAST majority of the remaining cases, leaving only Reserve-users to suffer through the details.

(I would need to think harder about it, but changing how CancelAt works to compare against `r.lim.last` might resolve it for cancellation too)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants