[WIP] Do not merge - collecting performance patches for scale testing #9436

schmichael · 2020-11-24T17:42:27Z

Contains:

… by job

Prior to this change getAllocs work was split before and after the limiter. This moves all work after the limiter. **WIP** I'm unclear how to test this change or assert its benefits, so I'm leaving it as a draft for now unless others have ideas.

Always wait 200ms before calling the Node.UpdateAlloc RPC to send allocation updates to servers. Prior to this change we only reset the update ticker when an error was encountered. This meant the 200ms ticker was running while the RPC was being performed. If the RPC was slow due to network latency or server load and took >=200ms, the ticker would tick during the RPC. Then on the next loop only the select would randomly choose between the two viable cases: receive an update or fire the RPC again. If the RPC case won it would immediately loop again due to there being no updates to send. When the update chan receive is selected a single update is added to the slice. The odds are then 50/50 that the subsequent loop will send the single update instead of receiving any more updates. This could cause a couple of problems: 1. Since only a small number of updates are sent, the chan buffer may fill, applying backpressure, and slowing down other client operations. 2. The small number of updates sent may already be stale and not represent the current state of the allocation locally. A risk here is that it's hard to reason about how this will interact with the 50ms batches on servers when the servers under load. A further improvement would be to completely remove the alloc update chan and instead use a mutex to build a map of alloc updates. I wanted to test the lowest risk possible change on loaded servers first before making more drastic changes.

hashicorp-cla · 2022-03-12T17:22:44Z

All committers have signed the CLA.

nickethier and others added 6 commits November 13, 2020 12:31

scheduler: only add allocs to network and device trackers if required…

4192bd2

… by job

scheduler: move creating dev allocator into once func

7342826

nomad: try to avoid slice resizing when batching

59ca810

test with much higher alloc sync interval

8acaa32

vercel bot temporarily deployed to Preview November 24, 2020 20:26 Inactive

Base automatically changed from master to main March 8, 2021 19:25

tgross added the stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows label May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Do not merge - collecting performance patches for scale testing #9436

[WIP] Do not merge - collecting performance patches for scale testing #9436

schmichael commented Nov 24, 2020

hashicorp-cla commented Mar 12, 2022 •

edited

Loading

[WIP] Do not merge - collecting performance patches for scale testing #9436

Are you sure you want to change the base?

[WIP] Do not merge - collecting performance patches for scale testing #9436

Conversation

schmichael commented Nov 24, 2020

hashicorp-cla commented Mar 12, 2022 • edited Loading

hashicorp-cla commented Mar 12, 2022 •

edited

Loading