-
-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
makeBackoffMachine: Improve backoff sleep logic by making it local. #3841
makeBackoffMachine: Improve backoff sleep logic by making it local. #3841
Conversation
c126fb7
to
ed75f0f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rename commit feels like a misstep. makeBackoffMachine
has a different API and no shared implementation details; it would be more straightforward to:
- create
makeBackoffMachine
directly inasync.js
, with new tests; - replace the various uses of
progressiveTimeout
; and - remove
progressiveTimeout
.
Mm, that approach makes a lot more sense than what I did! Even considering that the rename commit still preserves functionality and achieves a pretty minimal diff, but at the cost of being semantically incorrect, as noted in the commit message. Thanks for proposing this strategy, which avoids that awkward hiccup. |
OK! I'll await #3727 before switching to Lolex and adding those additional tests, since it looks like there's some question about what file the Lolex shim will eventually be in. |
24d8fc8
to
212d261
Compare
I just reorganized my commits, but you may want to postpone another review until I've added the Lolex changes (pending resolution of that PR) and those additional tests you mentioned; up to you. Also, the actual resolution of the bug you spotted with |
212d261
to
0c62b6f
Compare
Removed the commit repealing the no-undef-init rule, and fixed the startTime declaration to comply, as discussed yesterday with Ray. |
To minimize blockage, the Lolex timer work has been separated from the presence-heartbeat work into a new PR, #3886. |
Following discussion about private fields on classes, resolved in https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/private.20class.20fields/near/815229, I'm happy to turn this into a |
The other day I happened to learn about some science! that shows there's a way to choose how long to wait before retry which is much better than the pure exponential version we've been using: The takeaway is it's important to add "jitter", and specifically a quite radical amount of jitter, like this:
Analysis in the linked post, and it makes all kinds of sense in retrospect. I think this will be a pretty easy tweak to the code in this branch 🙂 when we come back to it -- the bulk of the work here is basically orthogonal to the particular formula used. |
That’s great!! I’d seen jitter being used in some examples I found while working on this, but didn’t see the performance data or the different possible ways to use jitter. |
36ad8b1
to
a94febf
Compare
OK, I've pushed a commit that adds jitter. I've also (hopefully) clarified the commit message for the commit that introduces makeBackoffMachine. |
Lolex is now merged, via #3886 -- so this is now unblocked to use that. |
|
a94febf
to
6b513e3
Compare
And, now with Lolex! Thanks, @ray-kraesig and @gnprice! |
@@ -32,6 +32,8 @@ rules: | |||
# Formatting and naming | |||
camelcase: off | |||
|
|||
no-await-in-loop: off # Promise.all is not always desirable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how turning this off will prevent – or even make less likely – the failure state you describe in the commit message. Our existing improper uses of array.forEach
seem more likely to have been induced by AirBnB's anti-loop position, which we've since thoughtfully revoked. (See the bottom of this file.)
Which is not to say that I'm against removing this lint! I suspect we almost invariably do want sequential await
whenever we have multiple Awaitables, simply because we rarely – if ever – want to fire off a hundred network requests at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how turning this off will prevent – or even make less likely – the failure state you describe in the commit message.
Well, with this rule, if you want to await
sequentially in a loop, you have to choose between an eslint-disable
and the failure of using a forEach. I think it's reasonable to assume that someone will choose wrong because they trust our choice of rules and they're not looking closely at the implementation of forEach. I'll see if I can make this clearer in the commit message.
Our existing improper uses of array.forEach seem more likely to have been induced by AirBnB's anti-loop position, which we've since thoughtfully revoked. (See the bottom of this file.)
That may be true, but it seems anecdotal; you're talking about the outbox issue, right, which I didn't know about when I committed this; in fact, my reasoning doesn't depend on any improper uses I saw. It sounds like you're not opposed to repealing no-await-in-loop
, anyway, but I hope the reason I'd like to do so is clear.
expect(minFromAllTrials).toBeLessThan(expectedMax * 0.25); | ||
expect(maxFromAllTrials).toBeGreaterThan(expectedMax * 0.75); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first reaction on seeing this is "wow, that's likely to flake someday, isn't it?"
Which is of course wrong. But perhaps this could use a comment noting that the chances of this flaking are on the order of 2 * (1-(1-(0.25**100))**11)
≈ 1 / 7.3e58
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is of course wrong. But perhaps this could use a comment noting that the chances of this flaking are on the order of
2 * (1-(1-(0.25**100))**11)
≈1 / 7.3e58
?
I have no idea what I was thinking here with 0.25**100
rather than 0.75**100
. (The rest is due to considering the aggregate flake chances rather than individual assertion chance failure. I'm skeptical of 2 *
, in retrospect, but it's probably at least close to that.)
... anyway, the comment you've added in the current head is 👍.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... anyway, the comment you've added in the current head is 👍.
Many thanks to @gnprice for this, I can't take the credit! 😄
The docs at https://eslint.org/docs/rules/no-await-in-loop say, "Usually, the code should be refactored to create all the promises then get access to the results using `Promise.all()`. Otherwise, each successive operation will not start until the previous one has completed." But sometimes invoking one asynchronous task depends on the resolution of the previous task, or it's important to have only one task in progress at a time. If either of these is important, the programmer must use an `eslint-disable` within their `for` or `while` loop. The only other choice that seems reasonable is to use Array.prototype.forEach, but this will introduce a bug: A promise returned by the callback passed to forEach is not `await`ed before the next invocation of the callback. ['a', 'b', 'c'].forEach(async letter => { await new Promise(resolve => setTimeout(resolve, 1000)); console.log(letter); }); The desired result is // (one second passes) // 'a' // (one second passes) // 'b' // (one second passes) // 'c' but the actual result is // (one second passes) // 'a' // 'b' // 'c' With this rule, if you want to `await` sequentially in a loop, you have to choose between an `eslint-disable` and the failure of using a forEach. It's reasonable to assume that someone will choose wrong because they trust our choice of rules and they're not looking closely at the implementation of forEach. So, we're better without it.
6b513e3
to
0439707
Compare
I've made the requested changes, and an additional change that Greg and I discussed (making it into a This isn't ready to be merged in its current state. There seems to be Lolex-related interference between the existing tests in
The failures in the newly added tests go away when the old tests are commented out, and vice versa. |
And that's what triggered it. Jest is running both Lolex-based tests in the same environment in parallel. 😞 One would hope this would be resolved out of the box in Jest 26; it might even be safe in Jest 25.1.0 with a custom environment. In the meantime, I think placing the |
Argh, Jest!! OK, I'll put the tests back in their own file. Or, argh, JavaScript? On the face of it, I wouldn't expect any interference, because the two tests running in parallel each use their own Lolex object ( |
Improve state logic for sleep durations in API request retry loops. progressiveTimeout uses a global state, with the effect that one network request's retry attempts affect the sleep durations used for any other request. This has the benefit of enabling a general throttle on all request retry loops that used progressiveTimeout. But different requests may have transient failures for different reasons, so it's not obvious that we want this general throttle. And hard-to-trace bugs can creep in when the behavior of progressiveTimeout can't be determined from a particular call site. Also, progressiveTimeout uses a 60-second threshold as a heuristic to distinguish request retry loops from each other, with the simplifying assumption that different types of requests will not occur within 60 seconds of each other. This distinction is more effectively done by managing the state per-loop in the first place, and doing so eliminates more of those hard-to-trace bugs mentioned in the previous paragraph. So, introduce the ability to handle state locally. Capped exponential backoff still mitigates high request traffic, but per-request. Preparation for zulip#3829, to be completed in this series of commits.
The benefits of "jitter" are large, and the implementation is not complex. With plain capped exponential backoff, if there is a burst of client requests, they will all back off together, resulting in more bursts, with gaps in between. Introducing randomness, particularly a lot of randomness, smooths out those bursts. Greg refers to this article: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ and he discusses the advantages of "jitter" in more detail at https://chat.zulip.org/#narrow/stream/92-learning/topic/exponential.20backoff.20--.20with.20jitter. There, he claims that the work to be done with "full jitter" is O(N log N), a great improvement over O(N^2) without any jitter.
Enable the use of makeBackoffMachine in tryUntilSuccessful. We need to call `new BackoffMachine()` in the body of tryUntilSuccessful, but above the loop. This isn't possible when tryUntilSuccessful is recursive. Now we can call `new BackoffMachine()` inside the function but above the loop.
…ne(). Fixes: 3840
0439707
to
a3a442d
Compare
OK, I've put them back into backoffMachine-test.js. EDIT: Um, oops? Does Command+Enter trigger "Close Pull Request" instead of "Comment"?? |
I could make a decent argument that it's "Argh, Ray!" I didn't write the core of our Lolex shim, but I did write most of the shell.
You don't need to dig too deep. For starters, they're both (presumably?) modifying the same global This problem is actually noted in the Lolex shim. I wasn't expecting this use case to fail either, though; I hadn't considered that Jest might run different test-groups in the same environment in parallel. I don't know how Jest v26 is supposed to handle this – but then, I don't know how Jest handles mocking the
Depends on exactly what the unfulfilled expectation in question is, I think. It's certainly possible to see similar behavior in any dynamic language with a mutable global environment. |
Sorry about the delay! Not sure how this one fell off my radar. Looks good. Merged! |
No problem! Thanks!! 🙂 |
We disabled this rule for the whole tree in 4fcfb76; so now we get to delete these lines that disabled it in specific places it had popped up.
Fixes: #3840.
Replace progressiveTimeout with makeBackoffMachine, which is called
above a loop, and .wait() can be called inside the loop on the
returned backoffMachine. Give access to numSleepsCompleted and
timeElapsedSoFar so "give up" logic (e.g., on too many network
requests with transient failures) can be handled near .wait() call
sites. Prepare for work on #3829 (a permanent timeout on sending
outbox messages).
Previously, the state logic in progressiveTimeout was global, which
meant a potentially desirable general throttle on all network
requests that used it, but the drawbacks were greater. A particular
call to progressiveTimeout was nondeterministic, because other calls
may have been made recently, affecting the resulting delay duration.
A 60-second threshold was used as a heuristic to distinguish request
retry loops from each other, but this is much more effectively done
with the new interface.