allocator: write a discrete event simulator #70552
Labels
A-kv
Anything in KV that doesn't belong in a more specific category.
A-kv-distribution
Relating to rebalancing and leasing.
A-testing
Testing tools and infrastructure
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Is your feature request related to a problem? Please describe.
It's difficult to make any changes to allocator with confidence that it's not breaking something or the other. Our (unit) test coverage mostly checks for individual actions being correctly generated ("given X, will we decide to move R to B?"). Given that our implementation is decentralized with respect to what decisions get made -- each store independently arrives at which of its replicas need to get moved around, upreplicated or downreplicated -- it's difficult to glean what the system-wide effects of an allocator change could be. Our decentralized scheme for e.g. forces us to consider decision thrashing, which (a) is not so easily verifiable at the level of unit tests, and (b) is cumbersome to craft and maintain full roachtests for.
We typically validate allocator changes by running a few standard workloads against real clusters and confirming the intended effects. This makes for a pretty slow/painful iteration cycle and limits our ability to comprehensively assess the stability of the change. #65379 is a recent example of the above.
Describe the solution you'd like
Our allocator as is lends pretty well to being simulated. We even had one at some point -- lets go do that again. We should be able to simulate non-trivial cluster sizes over large time horizons cheaply+quickly, define stochastic workloads to be applied against (not running any thing real! just plumbing in various replica QPS numbers varying over time), simulate the actions themselves, all the while generating plots various things we care about (leaseholder thrashing, rebalancing activity, replica count by store, by locality). We also want to model cluster membership -- what happens when nodes get added/removed, when they get drained, when they're experiencing full/gray failures, etc. (We've often observed how disruptive node additions/removals can be, the allocator is often a contributor.)
This simulator-driven approach came in handy when developing our distributed token bucket; there we were able to tweak various knobs (for the workload and the token bucket) and observe how the system as a whole behaved. The same principles apply here.
Describe alternatives you've considered
Keep the (unfortunate) status quo. As a bonus, it'd be cool if we can generate meaningful traces from actual clusters/workloads to then replay from in tests. We could capture the per-node inputs to the allocator periodically (perhaps to a custom logging channel?), and see if proposed changes can improve the metrics we care about -- a pattern used in various other systems.
Additional context
A more rigourous, easy to use testing harness could help pave the way for future allocator improvements we're interested in (using copysets, using solvers, centralizing the whole thing). It would also just help us understand our current thing better.
Jira issue: CRDB-10114
The text was updated successfully, but these errors were encountered: