Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocator: write a discrete event simulator #70552

Closed
irfansharif opened this issue Sep 22, 2021 · 1 comment
Closed

allocator: write a discrete event simulator #70552

irfansharif opened this issue Sep 22, 2021 · 1 comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-kv-distribution Relating to rebalancing and leasing. A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Sep 22, 2021

Is your feature request related to a problem? Please describe.

It's difficult to make any changes to allocator with confidence that it's not breaking something or the other. Our (unit) test coverage mostly checks for individual actions being correctly generated ("given X, will we decide to move R to B?"). Given that our implementation is decentralized with respect to what decisions get made -- each store independently arrives at which of its replicas need to get moved around, upreplicated or downreplicated -- it's difficult to glean what the system-wide effects of an allocator change could be. Our decentralized scheme for e.g. forces us to consider decision thrashing, which (a) is not so easily verifiable at the level of unit tests, and (b) is cumbersome to craft and maintain full roachtests for.

We typically validate allocator changes by running a few standard workloads against real clusters and confirming the intended effects. This makes for a pretty slow/painful iteration cycle and limits our ability to comprehensively assess the stability of the change. #65379 is a recent example of the above.

Describe the solution you'd like

Our allocator as is lends pretty well to being simulated. We even had one at some point -- lets go do that again. We should be able to simulate non-trivial cluster sizes over large time horizons cheaply+quickly, define stochastic workloads to be applied against (not running any thing real! just plumbing in various replica QPS numbers varying over time), simulate the actions themselves, all the while generating plots various things we care about (leaseholder thrashing, rebalancing activity, replica count by store, by locality). We also want to model cluster membership -- what happens when nodes get added/removed, when they get drained, when they're experiencing full/gray failures, etc. (We've often observed how disruptive node additions/removals can be, the allocator is often a contributor.)

This simulator-driven approach came in handy when developing our distributed token bucket; there we were able to tweak various knobs (for the workload and the token bucket) and observe how the system as a whole behaved. The same principles apply here.

Describe alternatives you've considered

Keep the (unfortunate) status quo. As a bonus, it'd be cool if we can generate meaningful traces from actual clusters/workloads to then replay from in tests. We could capture the per-node inputs to the allocator periodically (perhaps to a custom logging channel?), and see if proposed changes can improve the metrics we care about -- a pattern used in various other systems.

Additional context

A more rigourous, easy to use testing harness could help pave the way for future allocator improvements we're interested in (using copysets, using solvers, centralizing the whole thing). It would also just help us understand our current thing better.

Jira issue: CRDB-10114

@irfansharif irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. A-testing Testing tools and infrastructure A-kv Anything in KV that doesn't belong in a more specific category. labels Sep 22, 2021
@irfansharif
Copy link
Contributor Author

@kvoli's done this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-kv-distribution Relating to rebalancing and leasing. A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

1 participant