makeBackoffMachine: Improve backoff sleep logic by making it local. #3841

chrisbobbe · 2020-01-25T01:04:42Z

Fixes: #3840.

Replace progressiveTimeout with makeBackoffMachine, which is called
above a loop, and .wait() can be called inside the loop on the
returned backoffMachine. Give access to numSleepsCompleted and
timeElapsedSoFar so "give up" logic (e.g., on too many network
requests with transient failures) can be handled near .wait() call
sites. Prepare for work on #3829 (a permanent timeout on sending
outbox messages).

Previously, the state logic in progressiveTimeout was global, which
meant a potentially desirable general throttle on all network
requests that used it, but the drawbacks were greater. A particular
call to progressiveTimeout was nondeterministic, because other calls
may have been made recently, affecting the resulting delay duration.
A 60-second threshold was used as a heuristic to distinguish request
retry loops from each other, but this is much more effectively done
with the new interface.

rk-for-zulip

The rename commit feels like a misstep. makeBackoffMachine has a different API and no shared implementation details; it would be more straightforward to:

create makeBackoffMachine directly in async.js, with new tests;
replace the various uses of progressiveTimeout; and
remove progressiveTimeout.

src/utils/async.js

src/utils/__tests__/makeBackoffMachine-test.js

src/utils/async.js

chrisbobbe · 2020-01-31T00:26:31Z

The rename commit feels like a misstep. makeBackoffMachine has a different API and no shared implementation details; it would be more straightforward to:

create makeBackoffMachine directly in async.js, with new tests;

replace the various uses of progressiveTimeout; and

remove progressiveTimeout.

Mm, that approach makes a lot more sense than what I did! Even considering that the rename commit still preserves functionality and achieves a pretty minimal diff, but at the cost of being semantically incorrect, as noted in the commit message. Thanks for proposing this strategy, which avoids that awkward hiccup.

chrisbobbe · 2020-01-31T02:04:08Z

This would definitely be safer with Lolex. (Ref: #3727.)

There should probably also be tests for the backoff machine's auxiliary functions here.

OK! I'll await #3727 before switching to Lolex and adding those additional tests, since it looks like there's some question about what file the Lolex shim will eventually be in.

chrisbobbe · 2020-01-31T19:41:28Z

I just reorganized my commits, but you may want to postpone another review until I've added the Lolex changes (pending resolution of that PR) and those additional tests you mentioned; up to you. Also, the actual resolution of the bug you spotted with Date.now() - undefined, is now done on Greg's recommendation by just removing the timeElapsedSinceFirstWait getter because it is not used anywhere.

chrisbobbe · 2020-02-05T18:07:21Z

Removed the commit repealing the no-undef-init rule, and fixed the startTime declaration to comply, as discussed yesterday with Ray.

rk-for-zulip · 2020-02-10T02:09:53Z

To minimize blockage, the Lolex timer work has been separated from the presence-heartbeat work into a new PR, #3886.

chrisbobbe · 2020-02-12T02:33:50Z

Following discussion about private fields on classes, resolved in https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/private.20class.20fields/near/815229, I'm happy to turn this into a class BackoffMachine if that's desired.

gnprice · 2020-02-20T03:26:41Z

The other day I happened to learn about some science! that shows there's a way to choose how long to wait before retry which is much better than the pure exponential version we've been using:
https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

The takeaway is it's important to add "jitter", and specifically a quite radical amount of jitter, like this:

sleep = random_between(0, min(cap, base * 2 ** attempt))

Analysis in the linked post, and it makes all kinds of sense in retrospect.

I think this will be a pretty easy tweak to the code in this branch 🙂 when we come back to it -- the bulk of the work here is basically orthogonal to the particular formula used.

chrisbobbe · 2020-02-20T04:00:35Z

That’s great!! I’d seen jitter being used in some examples I found while working on this, but didn’t see the performance data or the different possible ways to use jitter.

chrisbobbe · 2020-02-20T20:08:00Z

OK, I've pushed a commit that adds jitter. I've also (hopefully) clarified the commit message for the commit that introduces makeBackoffMachine.

src/utils/async.js

gnprice · 2020-02-25T02:11:05Z

Lolex is now merged, via #3886 -- so this is now unblocked to use that.

chrisbobbe · 2020-02-25T19:37:20Z

~~> Lolex is now merged, via #3886 -- so this is now unblocked to use that.~~

~~I'll pause in case there's cleanup work for 6d1d1df#r37482253.~~ (This was a false alarm; I'm OK to proceed.)

chrisbobbe · 2020-02-27T01:13:32Z

And, now with Lolex! Thanks, @ray-kraesig and @gnprice!

rk-for-zulip · 2020-02-28T02:37:08Z

.eslintrc.yaml

@@ -32,6 +32,8 @@ rules:
  # Formatting and naming
  camelcase: off

+  no-await-in-loop: off # Promise.all is not always desirable


I'm not sure how turning this off will prevent – or even make less likely – the failure state you describe in the commit message. Our existing improper uses of array.forEach seem more likely to have been induced by AirBnB's anti-loop position, which we've since thoughtfully revoked. (See the bottom of this file.)

Which is not to say that I'm against removing this lint! I suspect we almost invariably do want sequential await whenever we have multiple Awaitables, simply because we rarely – if ever – want to fire off a hundred network requests at once.

I'm not sure how turning this off will prevent – or even make less likely – the failure state you describe in the commit message.

Well, with this rule, if you want to await sequentially in a loop, you have to choose between an eslint-disable and the failure of using a forEach. I think it's reasonable to assume that someone will choose wrong because they trust our choice of rules and they're not looking closely at the implementation of forEach. I'll see if I can make this clearer in the commit message.

Our existing improper uses of array.forEach seem more likely to have been induced by AirBnB's anti-loop position, which we've since thoughtfully revoked. (See the bottom of this file.)

That may be true, but it seems anecdotal; you're talking about the outbox issue, right, which I didn't know about when I committed this; in fact, my reasoning doesn't depend on any improper uses I saw. It sounds like you're not opposed to repealing no-await-in-loop, anyway, but I hope the reason I'd like to do so is clear.

src/utils/async.js

src/utils/__tests__/makeBackoffMachine-test.js

rk-for-zulip · 2020-02-28T03:02:36Z

src/utils/__tests__/makeBackoffMachine-test.js

+      expect(minFromAllTrials).toBeLessThan(expectedMax * 0.25);
+      expect(maxFromAllTrials).toBeGreaterThan(expectedMax * 0.75);


My first reaction on seeing this is "wow, that's likely to flake someday, isn't it?"

Which is of course wrong. But perhaps this could use a comment noting that the chances of this flaking are on the order of 2 * (1-(1-(0.25**100))**11) ≈ 1 / 7.3e58?

Which is of course wrong. But perhaps this could use a comment noting that the chances of this flaking are on the order of 2 * (1-(1-(0.25**100))**11) ≈ 1 / 7.3e58?

I have no idea what I was thinking here with 0.25**100 rather than 0.75**100. (The rest is due to considering the aggregate flake chances rather than individual assertion chance failure. I'm skeptical of 2 *, in retrospect, but it's probably at least close to that.)

... anyway, the comment you've added in the current head is 👍.

... anyway, the comment you've added in the current head is 👍.

Many thanks to @gnprice for this, I can't take the credit! 😄

src/utils/async.js

The docs at https://eslint.org/docs/rules/no-await-in-loop say, "Usually, the code should be refactored to create all the promises then get access to the results using `Promise.all()`. Otherwise, each successive operation will not start until the previous one has completed." But sometimes invoking one asynchronous task depends on the resolution of the previous task, or it's important to have only one task in progress at a time. If either of these is important, the programmer must use an `eslint-disable` within their `for` or `while` loop. The only other choice that seems reasonable is to use Array.prototype.forEach, but this will introduce a bug: A promise returned by the callback passed to forEach is not `await`ed before the next invocation of the callback. ['a', 'b', 'c'].forEach(async letter => { await new Promise(resolve => setTimeout(resolve, 1000)); console.log(letter); }); The desired result is // (one second passes) // 'a' // (one second passes) // 'b' // (one second passes) // 'c' but the actual result is // (one second passes) // 'a' // 'b' // 'c' With this rule, if you want to `await` sequentially in a loop, you have to choose between an `eslint-disable` and the failure of using a forEach. It's reasonable to assume that someone will choose wrong because they trust our choice of rules and they're not looking closely at the implementation of forEach. So, we're better without it.

chrisbobbe · 2020-03-04T02:59:02Z

I've made the requested changes, and an additional change that Greg and I discussed (making it into a class BackoffMachine for readability). I also moved the BackoffMachine tests into async-test.js, since BackoffMachine is and has been in async.js.

This isn't ready to be merged in its current state. There seems to be Lolex-related interference between the existing tests in async-test and the ones I added; I'm seeing failures on both the existing sleep (ideal) › waits for exactly the right number of milliseconds test and my added BackoffMachine › timeouts are random from zero to 100ms, 200ms, 400ms, 800ms... test, with this message:

Timeout - Async callback was not invoked within the 5000ms timeout specified by jest.setTimeout.Timeout - Async callback was not invoked within the 5000ms timeout specified by jest.setTimeout.Error:

The failures in the newly added tests go away when the old tests are commented out, and vice versa.

rk-for-zulip · 2020-03-04T03:57:56Z

I also moved the BackoffMachine tests into async-test.js, since BackoffMachine is and has been in async.js.

And that's what triggered it. Jest is running both Lolex-based tests in the same environment in parallel. 😞

One would hope this would be resolved out of the box in Jest 26; it might even be safe in Jest 25.1.0 with a custom environment. In the meantime, I think placing the BackoffMachine tests in their own file (with a note describing why) should suffice as a workaround for now.

chrisbobbe · 2020-03-04T18:39:11Z

Argh, Jest!! OK, I'll put the tests back in their own file.

Or, argh, JavaScript? On the face of it, I wouldn't expect any interference, because the two tests running in parallel each use their own Lolex object (const lolex: Lolex = new Lolex()). I'd like that to be sufficient to ensure that they don't interfere, but I suspect that if I dug deeply enough, I'd find that those two objects are sharing the use of some bit of state. Do you think I would encounter the same unfulfilled expectation in most other programming languages, or is this just another lovely JavaScript quirk?

Improve state logic for sleep durations in API request retry loops. progressiveTimeout uses a global state, with the effect that one network request's retry attempts affect the sleep durations used for any other request. This has the benefit of enabling a general throttle on all request retry loops that used progressiveTimeout. But different requests may have transient failures for different reasons, so it's not obvious that we want this general throttle. And hard-to-trace bugs can creep in when the behavior of progressiveTimeout can't be determined from a particular call site. Also, progressiveTimeout uses a 60-second threshold as a heuristic to distinguish request retry loops from each other, with the simplifying assumption that different types of requests will not occur within 60 seconds of each other. This distinction is more effectively done by managing the state per-loop in the first place, and doing so eliminates more of those hard-to-trace bugs mentioned in the previous paragraph. So, introduce the ability to handle state locally. Capped exponential backoff still mitigates high request traffic, but per-request. Preparation for zulip#3829, to be completed in this series of commits.

The benefits of "jitter" are large, and the implementation is not complex. With plain capped exponential backoff, if there is a burst of client requests, they will all back off together, resulting in more bursts, with gaps in between. Introducing randomness, particularly a lot of randomness, smooths out those bursts. Greg refers to this article: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ and he discusses the advantages of "jitter" in more detail at https://chat.zulip.org/#narrow/stream/92-learning/topic/exponential.20backoff.20--.20with.20jitter. There, he claims that the work to be done with "full jitter" is O(N log N), a great improvement over O(N^2) without any jitter.

Enable the use of makeBackoffMachine in tryUntilSuccessful. We need to call `new BackoffMachine()` in the body of tryUntilSuccessful, but above the loop. This isn't possible when tryUntilSuccessful is recursive. Now we can call `new BackoffMachine()` inside the function but above the loop.

…ne(). Fixes: 3840

chrisbobbe · 2020-03-04T19:02:23Z

OK, I've put them back into backoffMachine-test.js.

EDIT: Um, oops? Does Command+Enter trigger "Close Pull Request" instead of "Comment"??

rk-for-zulip · 2020-03-04T21:04:29Z

Argh, Jest!! OK, I'll put the tests back in their own file.

Or, argh, JavaScript?

I could make a decent argument that it's "Argh, Ray!" I didn't write the core of our Lolex shim, but I did write most of the shell.

On the face of it, I wouldn't expect any interference, because the two tests running in parallel each use their own Lolex object (const lolex: Lolex = new Lolex()). I'd like that to be sufficient to ensure that they don't interfere, but I suspect that if I dug deeply enough, I'd find that those two objects are sharing the use of some bit of state.

You don't need to dig too deep. For starters, they're both (presumably?) modifying the same global Date object.

This problem is actually noted in the Lolex shim. I wasn't expecting this use case to fail either, though; I hadn't considered that Jest might run different test-groups in the same environment in parallel.

I don't know how Jest v26 is supposed to handle this – but then, I don't know how Jest handles mocking the setTimeout global binding in some tests but not others today. (I can think of at least two ways, and both of them are extensible to handle Date.now().)

Do you think I would encounter the same unfulfilled expectation in most other programming languages, or is this just another lovely JavaScript quirk?

Depends on exactly what the unfulfilled expectation in question is, I think. It's certainly possible to see similar behavior in any dynamic language with a mutable global environment.

rk-for-zulip · 2020-04-10T22:09:59Z

Sorry about the delay! Not sure how this one fell off my radar.

Looks good. Merged!

chrisbobbe · 2020-04-10T22:13:38Z

No problem! Thanks!! 🙂

We disabled this rule for the whole tree in 4fcfb76; so now we get to delete these lines that disabled it in specific places it had popped up.

chrisbobbe requested a review from gnprice January 25, 2020 01:04

chrisbobbe force-pushed the improve-progressive-timeout branch 2 times, most recently from c126fb7 to ed75f0f Compare January 27, 2020 21:52

rk-for-zulip reviewed Jan 30, 2020

View reviewed changes

chrisbobbe added the blocked on other work To come back to after another related PR, or some other task. label Jan 31, 2020

chrisbobbe force-pushed the improve-progressive-timeout branch 2 times, most recently from 24d8fc8 to 212d261 Compare January 31, 2020 19:34

chrisbobbe force-pushed the improve-progressive-timeout branch from 212d261 to 0c62b6f Compare February 5, 2020 18:06

chrisbobbe force-pushed the improve-progressive-timeout branch 2 times, most recently from 36ad8b1 to a94febf Compare February 20, 2020 20:07

showell reviewed Feb 23, 2020

View reviewed changes

src/utils/async.js Outdated Show resolved Hide resolved

gnprice removed the blocked on other work To come back to after another related PR, or some other task. label Feb 25, 2020

chrisbobbe force-pushed the improve-progressive-timeout branch from a94febf to 6b513e3 Compare February 27, 2020 01:11

rk-for-zulip reviewed Feb 29, 2020

View reviewed changes

chrisbobbe force-pushed the improve-progressive-timeout branch from 6b513e3 to 0439707 Compare March 4, 2020 02:50

Chris Bobbe added 5 commits March 4, 2020 10:59

backoff machine: Replace all uses of progressiveTimeout.

9c1131f

progressiveTimeout: Delete; was entirely replaced by new BackoffMachi…

a3a442d

…ne(). Fixes: 3840

chrisbobbe force-pushed the improve-progressive-timeout branch from 0439707 to a3a442d Compare March 4, 2020 19:02

chrisbobbe closed this Mar 4, 2020

chrisbobbe reopened this Mar 4, 2020

chrisbobbe added the P1 high-priority label Mar 9, 2020

chrisbobbe mentioned this pull request Apr 10, 2020

Long, sometimes endless loading in message list #4033

Open

rk-for-zulip merged commit 2f20086 into zulip:master Apr 10, 2020

chrisbobbe referenced this pull request Apr 11, 2020

eslint: Cut "disable" lines for now-disabled no-await-in-loop.

4559548

We disabled this rule for the whole tree in 4fcfb76; so now we get to delete these lines that disabled it in specific places it had popped up.

chrisbobbe mentioned this pull request Apr 15, 2020

Time out sending a message, and indicate failure #3829

Open

chrisbobbe deleted the improve-progressive-timeout branch November 6, 2020 03:39

gnprice mentioned this pull request Jun 14, 2023

Retry long-poll failures, with backoff zulip/zulip-flutter#184

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

makeBackoffMachine: Improve backoff sleep logic by making it local. #3841

makeBackoffMachine: Improve backoff sleep logic by making it local. #3841

chrisbobbe commented Jan 25, 2020 •

edited

Loading

rk-for-zulip left a comment

chrisbobbe commented Jan 31, 2020 •

edited

Loading

chrisbobbe commented Jan 31, 2020

chrisbobbe commented Jan 31, 2020 •

edited

Loading

chrisbobbe commented Feb 5, 2020

rk-for-zulip commented Feb 10, 2020

chrisbobbe commented Feb 12, 2020

gnprice commented Feb 20, 2020

chrisbobbe commented Feb 20, 2020 •

edited

Loading

chrisbobbe commented Feb 20, 2020

gnprice commented Feb 25, 2020

chrisbobbe commented Feb 25, 2020 •

edited

Loading

chrisbobbe commented Feb 27, 2020

rk-for-zulip Feb 28, 2020

chrisbobbe Mar 3, 2020 •

edited

Loading

rk-for-zulip Feb 28, 2020

rk-for-zulip Apr 10, 2020

chrisbobbe Apr 10, 2020

chrisbobbe commented Mar 4, 2020 •

edited

Loading

rk-for-zulip commented Mar 4, 2020

chrisbobbe commented Mar 4, 2020

chrisbobbe commented Mar 4, 2020 •

edited

Loading

rk-for-zulip commented Mar 4, 2020

rk-for-zulip commented Apr 10, 2020

chrisbobbe commented Apr 10, 2020 •

edited

Loading

		expect(minFromAllTrials).toBeLessThan(expectedMax * 0.25);
		expect(maxFromAllTrials).toBeGreaterThan(expectedMax * 0.75);

makeBackoffMachine: Improve backoff sleep logic by making it local. #3841

makeBackoffMachine: Improve backoff sleep logic by making it local. #3841

Conversation

chrisbobbe commented Jan 25, 2020 • edited Loading

rk-for-zulip left a comment

Choose a reason for hiding this comment

chrisbobbe commented Jan 31, 2020 • edited Loading

chrisbobbe commented Jan 31, 2020

chrisbobbe commented Jan 31, 2020 • edited Loading

chrisbobbe commented Feb 5, 2020

rk-for-zulip commented Feb 10, 2020

chrisbobbe commented Feb 12, 2020

gnprice commented Feb 20, 2020

chrisbobbe commented Feb 20, 2020 • edited Loading

chrisbobbe commented Feb 20, 2020

gnprice commented Feb 25, 2020

chrisbobbe commented Feb 25, 2020 • edited Loading

chrisbobbe commented Feb 27, 2020

rk-for-zulip Feb 28, 2020

Choose a reason for hiding this comment

chrisbobbe Mar 3, 2020 • edited Loading

Choose a reason for hiding this comment

rk-for-zulip Feb 28, 2020

Choose a reason for hiding this comment

rk-for-zulip Apr 10, 2020

Choose a reason for hiding this comment

chrisbobbe Apr 10, 2020

Choose a reason for hiding this comment

chrisbobbe commented Mar 4, 2020 • edited Loading

rk-for-zulip commented Mar 4, 2020

chrisbobbe commented Mar 4, 2020

chrisbobbe commented Mar 4, 2020 • edited Loading

rk-for-zulip commented Mar 4, 2020

rk-for-zulip commented Apr 10, 2020

chrisbobbe commented Apr 10, 2020 • edited Loading

chrisbobbe commented Jan 25, 2020 •

edited

Loading

chrisbobbe commented Jan 31, 2020 •

edited

Loading

chrisbobbe commented Jan 31, 2020 •

edited

Loading

chrisbobbe commented Feb 20, 2020 •

edited

Loading

chrisbobbe commented Feb 25, 2020 •

edited

Loading

chrisbobbe Mar 3, 2020 •

edited

Loading

chrisbobbe commented Mar 4, 2020 •

edited

Loading

chrisbobbe commented Mar 4, 2020 •

edited

Loading

chrisbobbe commented Apr 10, 2020 •

edited

Loading