Node-local core assignment: core count decrease #20312

ztlpn · 2024-06-27T01:06:10Z

Implement copying partition data from extra kvstore shards (i.e. kvstore shards with ids >= current shard count) and use it to allow decreasing core count.

Backports Required

Release Notes

Features

Allow decreasing core count if node-local core assignment is enabled.

src/v/cluster/scheduling/allocation_node.cc

src/v/cluster/controller.cc

src/v/cluster/shard_balancer.cc

src/v/cluster/controller_backend.cc

If the number of cores was reduced, we need to have some way to access kvstores for extra cores. Allow constructing kvstore for shard id >= than the number of cores to achieve that.

… helpers

Since kvstore operations can in theory fail, copying everything and then removing (after copy is fully successful) is better than moving pieces of kvstore state one-by-one (in practice move is still a piecewise copy-then-remove). Second reason: we need separate remove helpers to clean garbage and obsolete kvstore data.

Sometimes a partition should still exist on this node, but its kvstore state is no longer relevant (e.g. it was transferred to a different shard but hadn't been deleted yet). Handle this case in shard_placement_table and controller_backend.

…d transfers Previously if a cross-shard transfer failed, we couldn't really tell on the source shard if we should retry or not (we may have failed to remove obsolete state after a successful transfer, in this case retrying is dangerous). Mark the state on the source shard obsolete immediately after a successful transfer to fix that. Also introduce more detailed failure conditions in prepare_tranfer() - are we waiting for the source or the destination shard? This will come handy when we implement moving data from extra shards because we'll have to clean the destination ourselves.

No functional changes.

Pass the current number of kvstore shards to the start method and move existing partitions on extra shards to one of the current shards if it is possible.

Calculate max allowed number of partition replicas with the new core count and reject core count decrease if total number of partition replicas is greater.

Now that shard_balancer will copy partition data from extra kvstore shards, we can relax the check in validate_configuration_invariants.

bashtanov · 2024-06-28T14:03:15Z

src/v/cluster/shard_placement_table.cc

@@ -365,6 +370,9 @@ ss::future<> shard_placement_table::initialize_from_kvstore(
      [&ntp2init_data](shard_placement_table& spt) {
          return spt.scatter_init_data(ntp2init_data);
      });
+    for (auto& spt : extra_spts) {
+        co_await spt->scatter_init_data(ntp2init_data);
+    }


any reason why we process existing shard data concurrently, but extra shards one by one?

No particular reason, but scatter_init_data is CPU-bound, so no benefit in doing it concurrently either.

ztlpn requested review from bharathv, bashtanov and mmaslankaprv June 27, 2024 01:06

github-actions bot added the area/redpanda label Jun 27, 2024