bulk: increased split-after size can result in replica imbalance #75664

nicktrav · 2022-01-28T19:28:27Z

Describe the problem

Spin-off from #68303 - provided a distilled version. Context on original, here.

It appears as though e12c9e6 causes replica imbalance which doesn't play nicely with large imports.

With that commit, on clearrange/checks=true, the import fails as a number of nodes run out of disk due to the imbalance:

Without that commit the import succeeds:

Jira issue: CRDB-12773

The text was updated successfully, but these errors were encountered:

nicktrav · 2022-01-28T19:28:37Z

cc: @dt

blathers-crl · 2022-01-28T19:29:34Z

cc @cockroachdb/bulk-io

dt · 2022-02-02T19:42:13Z

Reverting in #75882 for now, will try to find some time next week to poke at this, and see why the closer-to-default split size does so badly / what the extremely low split size was doing that was better for the allocator.

Maybe the bigger range counts were covering for scatter's known padding issues? Might need @aayushshah15 to help me poke at this when I get to it.

aayushshah15 · 2022-02-02T19:44:07Z

Maybe the bigger range counts were covering for scatter's known padding issues? Might need @aayushshah15 to help me poke at this when I get to it.

I'd still expect the allocator to fix straight-up replica count divergence though. Very curious why this was happening. Feel free to put something on my calendar if you want to look together, but I can also just try repro-ing on my own with the split size bumped.

aayushshah15 · 2022-02-03T15:20:45Z

I am playing with this a little bit (i.e. running with 384mb with our patch for AdminScatter) on a tpcc import. Repro steps are identical to what I've described in that PR, except for the increased split-after size.

The cluster is at http://104.196.120.245:26258/#/metrics/replication/cluster

The main things that I've noticed so far are that, with 384mb, we essentially stop scattering after the very beginning of the import. My sense is that this runs contrary to our understanding (also something that wasn't happening with 48mb)? Yellow is rebalances.

Additionally, as far as I can tell, the allocator is continuously rebalancing. The rate at which we're converging is just slower than the rate at which newer splits are being created. After the import is done, I'd expect this to fully converge.

        {
          "time": "2022-02-03T15:12:34.421667145Z",
          "message": "kv/kvserver/allocator_scorer.go:186 [n7,status] s7: should-rebalance(ranges-overfull): rangeCount=1909, mean=1291.00, overfull-threshold=1356"
        },

dt · 2022-03-26T23:21:16Z

fixed by #77588

nicktrav added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jan 28, 2022

nicktrav added the A-disaster-recovery label Jan 28, 2022

blathers-crl bot added the T-disaster-recovery label Jan 28, 2022

shermanCRL assigned dt Feb 1, 2022

adityamaru mentioned this issue Feb 2, 2022

roachtest: backup/2TB/n10cpu4 failed due to out-of-disk #73331

Closed

nicktrav mentioned this issue Feb 11, 2022

roachtest: clearrange/checks=true failed #75701

Closed

dt closed this as completed Mar 26, 2022

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bulk: increased split-after size can result in replica imbalance #75664

bulk: increased split-after size can result in replica imbalance #75664

nicktrav commented Jan 28, 2022 •

edited by cockroach-jira-scripts

Loading

nicktrav commented Jan 28, 2022

blathers-crl bot commented Jan 28, 2022

dt commented Feb 2, 2022 •

edited

Loading

aayushshah15 commented Feb 2, 2022

aayushshah15 commented Feb 3, 2022 •

edited

Loading

dt commented Mar 26, 2022

bulk: increased split-after size can result in replica imbalance #75664

bulk: increased split-after size can result in replica imbalance #75664

Comments

nicktrav commented Jan 28, 2022 • edited by cockroach-jira-scripts Loading

nicktrav commented Jan 28, 2022

blathers-crl bot commented Jan 28, 2022

dt commented Feb 2, 2022 • edited Loading

aayushshah15 commented Feb 2, 2022

aayushshah15 commented Feb 3, 2022 • edited Loading

dt commented Mar 26, 2022

nicktrav commented Jan 28, 2022 •

edited by cockroach-jira-scripts

Loading

dt commented Feb 2, 2022 •

edited

Loading

aayushshah15 commented Feb 3, 2022 •

edited

Loading