client: always wait 200ms before sending updates #9435

schmichael · 2020-11-24T17:26:58Z

Always wait 200ms before calling the Node.UpdateAlloc RPC to send
allocation updates to servers.

Prior to this change we only reset the update ticker when an error was
encountered. This meant the 200ms ticker was running while the RPC was
being performed. If the RPC was slow due to network latency or server
load and took >=200ms, the ticker would tick during the RPC.

Then on the next loop only the select would randomly choose between the
two viable cases: receive an update or fire the RPC again.

If the RPC case won it would immediately loop again due to there being
no updates to send.

When the update chan receive is selected a single update is added to the
slice. The odds are then 50/50 that the subsequent loop will send the
single update instead of receiving any more updates.

This could cause a couple of problems:

Since only a small number of updates are sent, the chan buffer may
fill, applying backpressure, and slowing down other client
operations.
The small number of updates sent may already be stale and not
represent the current state of the allocation locally.

A risk here is that it's hard to reason about how this will interact
with the 50ms batches on servers when the servers under load.

A further improvement would be to completely remove the alloc update
chan and instead use a mutex to build a map of alloc updates. I wanted
to test the lowest risk possible change on loaded servers first before
making more drastic changes.

krishicks

In looking at the old code vs. the new, it's not clear to me how it is functionally different. It looks like the same behavior occurs, with what used to be a ticker Stop/re-initialization being converted to a Reset.

It is possible I am missing something, however. The difference between Stop/re-initialization and Stop is lost on me, and the godoc doesn't help.

client/client.go

Always wait 200ms before calling the Node.UpdateAlloc RPC to send allocation updates to servers. Prior to this change we only reset the update ticker when an error was encountered. This meant the 200ms ticker was running while the RPC was being performed. If the RPC was slow due to network latency or server load and took >=200ms, the ticker would tick during the RPC. Then on the next loop only the select would randomly choose between the two viable cases: receive an update or fire the RPC again. If the RPC case won it would immediately loop again due to there being no updates to send. When the update chan receive is selected a single update is added to the slice. The odds are then 50/50 that the subsequent loop will send the single update instead of receiving any more updates. This could cause a couple of problems: 1. Since only a small number of updates are sent, the chan buffer may fill, applying backpressure, and slowing down other client operations. 2. The small number of updates sent may already be stale and not represent the current state of the allocation locally. A risk here is that it's hard to reason about how this will interact with the 50ms batches on servers when the servers under load. A further improvement would be to completely remove the alloc update chan and instead use a mutex to build a map of alloc updates. I wanted to test the lowest risk possible change on loaded servers first before making more drastic changes.

schmichael · 2020-11-25T20:24:23Z

In looking at the old code vs. the new, it's not clear to me how it is functionally different.

It's super subtle and my commit message is probably far too verbose.

The case I ran into was:

client batches 200ms worth of alloc updates by receiving on the update chan until the ticker ticks
ticker ticks, Node.UpdateAlloc is called with the batch of updates
200ms elapse (I observed this when a server was under load but it could also be a slow network or even local load)
while this is happening alloc updates are being buffered in the alloc update chan
Node.UpdateAlloc returns, for loops
Here's the problem: at this point there's a backlog of alloc updates waiting in the update chan, but the ticker has already ticked! Since select randomly chooses between eligible cases we may only receive a single alloc update before calling Node.UpdateAlloc again, completely defeating the intended batching in the client.

If Node.UpdateAlloc continues to take >200ms, we continue to send 1 (or maybe 2 or 3 if we're very lucky) updates per Node.UpdateAlloc call.

But it gets worse! At this point not only might the update chan buffer be full (64 updates), but it's plausible many of the updates are now stale and should be replaced by allocs deeper in the chan's buffer (or blocking on sending to the buffer!).

So this timing approach causes a cascading load effect when servers are under such load that Node.UpdateAlloc takes >200ms: we end up writing more allocs to raft because clients aren't able to coalesce a series of updates to a single allocation into the latest update. Instead each update is applied 1 at a time despite being out of date with reality.

The fix

Stop including the Node.UpdateAlloc call in the 200ms timer!

That's it. That's the whole fix.

Now it doesn't matter how long Node.UpdateAlloc takes, we will always spend an additional 200ms batching allocation updates before sending them. Since this is such a tight loop a client would have to be under severe load to not consume the entire update chan's buffer (max 64 items) within 200ms, so this change should be sufficient to prevent the applying-stale-updates problem.

The downside is that now we're delaying update allocs further -- I observed this RPC taking 1-2.5s in high load tests:

This means that alloc updates would be delayed by update to 2700ms in that cluster. That affects not only operators monitoring nomad alloc status but more importantly it effects how quickly the scheduler makes progress on drains, deployments, and rescheduling.

The reason I think this is an ok tradeoff is because it provides natural backpressure on heavily loaded servers. If the server is under so much load that Node.UpdateAlloc takes 2s, it may help alleviate that load to slow down drains, deployments, and rescheduling. The pathological case for heavily loaded servers is two-fold:

A heavily loaded server may be unable to process client heartbeats quickly enough, causing the clients to be treated as down, marking their allocations as lost, and causing a significant amount of extra server load to create and process evaluations. This is the nightmare scenario as it could lead to widespread Nomad and service outages.
A heavily loaded server my be unable to heartbeat raft quickly enough and either drop out as a follower or lose leadership if it is the leader. A follower dropping out means extra load is now put on the other servers which will likely cause them to fail as well. A leader dropping similarly creates extra load on the remaining servers.

Effectively Node.UpdateAlloc becomes a gross (as-in naive and coarse-grained... but also a little icky) server->client backpressure mechanism: the slower a server is to respond, the longer clients wait before resubmitting updates, and the more opportunity client's have to batch those updates.

This is not optimal (as alluded to by the "gross" description). Servers should have intrinsic mechanisms for avoiding the 2 pathological overload cases I mention above. They should not rely on well-behaved clients. Ideas here are welcome but outside the scope of this modest improvement.

On the client-side we could be smarter about when we update allocs. I put those ideas in a new issue here: #9451

krishicks

The update looks good, and makes it more clear that the critical change is removing the if staggered bit.

vercel · 2020-11-30T18:27:37Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployments, click below or on the icon next to each commit.

nomad-storybook – ./ui/stories

🔍 Inspect: https://vercel.com/hashicorp/nomad-storybook/ajjr2hr0w
✅ Preview: https://nomad-storybook-git-f-allocupdate-timer.hashicorp.vercel.app

nomad-ui – ./ui

🔍 Inspect: https://vercel.com/hashicorp/nomad-ui/nla25dn1s
✅ Preview: https://nomad-ui-git-f-allocupdate-timer.hashicorp.vercel.app

Extension of #14673 Once Vault is initially fingerprinted, extend the period since changes should be infrequent and the fingerprint is relatively expensive since it is contacting a central Vault server. Also move the period timer reset *after* the fingerprint. This is similar to #9435 where the idea is to ensure the retry period starts *after* the operation is attempted. 15s will be the *minimum* time between fingerprints now instead of the *maximum* time between fingerprints. In the case of Vault fingerprinting, the original behavior might cause the following: 1. Timer is reset to 15s 2. Fingerprint takes 16s 3. Timer has already elapsed so we immediately Fingerprint again Even if fingerprinting Vault only takes a few seconds, that may very well be due to excessive load and backing off our fingerprints is desirable. The new bevahior ensures we always wait at least 15s between fingerprint attempts and should allow some natural jittering based on server load and network latency.

github-actions · 2022-12-08T02:17:38Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

nomad: try to avoid slice resizing when batching

48ec412

schmichael requested a review from krishicks November 24, 2020 17:26

schmichael mentioned this pull request Nov 24, 2020

[WIP] Do not merge - collecting performance patches for scale testing #9436

Draft

krishicks reviewed Nov 24, 2020

View reviewed changes

client/client.go Show resolved Hide resolved

client/client.go Outdated Show resolved Hide resolved

schmichael force-pushed the f-allocupdate-timer branch from a65d7c8 to e6fd258 Compare November 25, 2020 19:37

vercel bot temporarily deployed to Preview November 25, 2020 19:37 Inactive

schmichael mentioned this pull request Nov 25, 2020

client: make batching of allocation updates smarter #9451

Closed

schmichael marked this pull request as ready for review November 25, 2020 20:24

krishicks approved these changes Nov 30, 2020

View reviewed changes

docs: add #9435 to changelog

26127c0

vercel bot temporarily deployed to Preview – nomad November 30, 2020 18:27 Inactive

vercel bot deployed to Preview – nomad-storybook November 30, 2020 18:27 View deployment

vercel bot deployed to Preview – nomad-ui November 30, 2020 18:27 View deployment

schmichael merged commit c60d9a9 into master Dec 1, 2020

schmichael deleted the f-allocupdate-timer branch December 1, 2020 16:45

schmichael mentioned this pull request Sep 26, 2022

fingerprint: lengthen Vault check after seen #14693

Merged

github-actions bot locked as resolved and limited conversation to collaborators Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: always wait 200ms before sending updates #9435

client: always wait 200ms before sending updates #9435

schmichael commented Nov 24, 2020

krishicks left a comment

schmichael commented Nov 25, 2020 •

edited

Loading

krishicks left a comment

vercel bot commented Nov 30, 2020 •

edited

Loading

github-actions bot commented Dec 8, 2022

client: always wait 200ms before sending updates #9435

client: always wait 200ms before sending updates #9435

Conversation

schmichael commented Nov 24, 2020

krishicks left a comment

Choose a reason for hiding this comment

schmichael commented Nov 25, 2020 • edited Loading

krishicks left a comment

Choose a reason for hiding this comment

vercel bot commented Nov 30, 2020 • edited Loading

nomad-storybook – ./ui/stories

nomad-ui – ./ui

github-actions bot commented Dec 8, 2022

schmichael commented Nov 25, 2020 •

edited

Loading

vercel bot commented Nov 30, 2020 •

edited

Loading