HBASE-28513 The StochasticLoadBalancer should support discrete evaluations #6543

rmdmattingly · 2024-12-13T23:28:14Z

See my design doc here

To sum it up, the current load balancer isn't great for what it's supposed to do now, and it won't support all of the things that we'd like it to do in a perfect world.

Right now: primary replica balancing squashes all other considerations. The default weight for one of the several cost functions that factor into primary replica balancing is 100,000. Meanwhile the default read request cost is 5. The result is that the load balancer, OOTB, basically doesn't care about balancing actual load. To solve this, you can either set primary replica balancing costs to zero, which is fine if you don't use read replicas, or — if you do use read replicas — maybe you can produce a magic incantation of configurations that work just right, until your needs change.

In the future: we'd like a lot more out of the balancer. System table isolation, meta table isolation, colocation of regions based on start key prefix similarity (this is a very rough idea atm, and not touched in the scope of this PR). And to support all of these features with either cost functions or RS groups would be a real burden. I think what I'm proposing here will be a much, much easier path for HBase operators.

New features

This PR introduces some new features:

Balancer conditional based replica distribution
System table isolation (put backups, quotas, etc on their own RegionServer (all sys tables on 1))
Meta table isolation (put meta on its own RegionServer)

These can be controlled via:

hbase.master.balancer.stochastic.conditionals.distributeReplicas: set this to true to enable conditional based replica distribution
hbase.master.balancer.stochastic.conditionals.isolateSystemTables: set this to true to enable system table isolation
hbase.master.balancer.stochastic.conditionals.isolateMetaTable: set this to true to enable meta table isolation
hbase.master.balancer.stochastic.additionalConditionals: much like cost functions, you can define your own RegionPlanConditional implementation and install them here

Testing

I wrote a lot of unit tests to validate the functionality here — both lightweight and some minicluster tests. Even in the most extreme cases (like, system table isolation + meta table isolation enabled on a 3 node cluster, or the number of read replicas == the number of servers) the balancer does what we'd expect.

Replica Distribution Improvements

Not only does this PR offer an alternative means of distributing replicas, but it's actually a massive improvement on the existing approach.

See the Replica Distribution testing section of my design doc. Cost functions never successfully balance 3 replicas across 3 servers OOTB — but balancer conditionals do so expeditiously.

To summarize the testing, we have replicated_table, a table with 3 region replicas. The 3 regions of a given replica share a color, and there are also 3 RegionServers in the cluster. We expect the balancer to evenly distribute one replica per server across the 3 RegionServers...

Cost functions don't work:

….omitting the meaningless snapshots between 4 and 27…

At this point, I just exited the test because it was clear that our existing balancer would never achieve true replica distribution.

But balancer conditionals do work:

New Features: Table Isolation Working as Designed

See below where we ran a new unit test, TestLargerClusterBalancerConditionals, and tracked the locations of regions for 3 tables across 18 RegionServers:

180 “product” table regions
1 meta table region
1 quotas table region

All regions began on a single RegionServer, and within 4 balancer iterations we had a well balanced cluster, and isolation of key system tables. It achieved this in about 2min on my local machine, where most of that time was spent bootstrapping the mini cluster.

cc @ndimiduk @charlesconnell @ksravista @aalhour

rmdmattingly · 2024-12-14T20:14:13Z

Still cleaning this up with the help of the build logs. Will mark as a draft for now. I believe the code is working quite well though, so please feel free to review the proposal and meat of the changes

I'm still deciding whether it's necessary to create a balancer candidate for the replica conditional.

rmdmattingly · 2024-12-15T19:37:38Z

This is working really well in my testing, and I'm not convinced that it's necessary to add a replica distribution candidate generator. This is because, typically, each region replica has so many acceptable destinations (n-r+1, where n is the number of servers and r is the number of replicas), and so many acceptable swap candidates (any region who does not represent the same data). This is different from, say, a table isolation conditional where we really want to drain many virtually all regions from a single RegionServer, and no swaps are appropriate

This is probably work for a separate PR, but I think it would be nice to support pluggable candidate generators to pair with any custom conditionals that users write

…tions

rmdmattingly · 2024-12-27T22:14:14Z

hbase-balancer/src/main/java/org/apache/hadoop/hbase/master/balancer/BalanceAction.java

@@ -28,6 +28,8 @@ enum Type {
    ASSIGN_REGION,
    MOVE_REGION,
    SWAP_REGIONS,
+    ISOLATE_TABLE,
+    MOVE_BATCH,


Our conditional candidate generators can frequently see pretty far into the future if they've gone to the trouble of deriving one very opinionated move anyway. So this is a nice way to represent multiple regions moves triggered by one candidate generation iteration

rmdmattingly · 2024-12-27T22:16:14Z

hbase-balancer/src/main/java/org/apache/hadoop/hbase/master/balancer/BalancerClusterState.java

@@ -705,7 +709,41 @@ enum LocalityType {
    RACK
  }

-  public void doAction(BalanceAction action) {
+  public List<RegionPlan> convertActionToPlans(BalanceAction action) {


RegionPlans are a more straightforward interface than BalanceActions, because you don't have to do all of this switch nonsense. So the new RegionPlanConditional interface isn't concerned with BalanceActions — it's just working with RegionInfo and RegionPlan objects, for example:
isViolatingServer(RegionPlan regionPlan, Set<RegionInfo> destinationRegions)

All this to say, this was a nice method to introduce so that we could convert BalanceActions to RegionPlans as necessary for conditional evaluations, and without altering the current BalancerClusterState in-place via doAction

rmdmattingly · 2024-12-27T22:20:04Z

hbase-balancer/src/main/java/org/apache/hadoop/hbase/master/balancer/BalancerClusterState.java

+    if (newSize < 0) {
+      throw new IllegalStateException(
+        "Region indices mismatch: more regions to remove than in the regions array");
+    }


These methods make it easier to add/remove many indices from the BCS. This is nice for the MOVE_BATCH balance action, that I've justified for candidate generation performance reasons in another comment here.

Also, the nicer error messaging here is a good win imo. Previously you'd just hit ArrayIndexOutOfBoundExceptions, or worse — erroneous moves — when you fumbled the state management of your mutable BalancerClusterState. This should help anyone down the road if they're debugging a custom conditional implementation

rmdmattingly · 2024-12-27T22:21:15Z

hbase-balancer/src/main/java/org/apache/hadoop/hbase/master/balancer/BalancerConditionals.java

+ * make it easy to define that system tables will ideally be isolated on their own RegionServer.
+ */
+@InterfaceAudience.Private
+public final class BalancerConditionals {


This class basically acts as a simplifying wrapper around our conditional implementations. Often we'll want to infer something, like "is this class instantiated?" or we'll want to do something against every conditional — like re-weight them, or validate a move against them. This class gives us an easy place to make these things easier

rmdmattingly · 2024-12-27T22:22:34Z

...e-balancer/src/main/java/org/apache/hadoop/hbase/master/balancer/CacheAwareLoadBalancer.java

+  protected Map<Class<? extends CandidateGenerator>, CandidateGenerator>
+    createCandidateGenerators() {


The list/ordinal management in balancers was pretty ugly imo. Instead, this should be a mapping of class to generator obj

rmdmattingly · 2024-12-27T22:31:46Z

...e-balancer/src/main/java/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.java

+    Map<Class<? extends CandidateGenerator>, Long> generatorToStepCount = new HashMap<>();
+    Map<Class<? extends CandidateGenerator>, Long> generatorToApprovedActionCount = new HashMap<>();


These aren't really necessary, but make debugging easier because you can understand which candidate generators lead to the given balancer plan. If we had this logging, I think we would've realized the flaws in getRandomGenerator earlier (it being prone to just picking the last generator in the list)

rmdmattingly · 2024-12-27T22:32:41Z

...r/src/main/java/org/apache/hadoop/hbase/master/balancer/SystemTableIsolationConditional.java

+  public boolean isViolatingServer(RegionPlan regionPlan, Set<RegionInfo> serverRegions) {
+    RegionInfo regionBeingMoved = regionPlan.getRegionInfo();
+    boolean shouldIsolateMovingRegion = isRegionToIsolate(regionBeingMoved);
+    for (RegionInfo destinationRegion : serverRegions) {
+      if (destinationRegion.getEncodedName().equals(regionBeingMoved.getEncodedName())) {
+        // Skip the region being moved
+        continue;
+      }
+      if (shouldIsolateMovingRegion && !isRegionToIsolate(destinationRegion)) {
+        // Ensure every destination region is also a region to isolate
+        return true;
+      } else if (!shouldIsolateMovingRegion && isRegionToIsolate(destinationRegion)) {
+        // Ensure no destination region is a region to isolate
+        return true;
+      }
+    }
+    return false;


I can dedupe a lot of this logic with the MetaTable conditional, will do sometime soon

rmdmattingly · 2024-12-27T22:34:14Z

...ava/org/apache/hadoop/hbase/master/balancer/TestLargeClusterBalancingMetaTableIsolation.java

+@Category(MediumTests.class)
+public class TestLargeClusterBalancingMetaTableIsolation {


These tests are really solid imo. They make it easy to setup balancer scenarios with 10s of thousands of regions and thousands of servers, and ensure that the balancer can find its way out of the situation in a reasonable amount of time. These tests all pass reliably in 30s-3min on my local machine, often on the faster end, though it's a little dependent on luck — ie, how hairy are the edge cases that we randomly find ourselves in

rmdmattingly · 2024-12-27T22:34:52Z

.../src/main/java/org/apache/hadoop/hbase/master/balancer/TableIsolationCandidateGenerator.java

+      // todo should there be logic to consolidate isolated regions on as few servers as
+      // conditionals allow? This gets complicated with replicas, etc


I think we will want to do this in the v1 balancer conditionals impl, so I will do so shortly. But this current build is working very well, so I wanted to push

rmdmattingly · 2024-12-27T22:36:25Z

...c/test/java/org/apache/hadoop/hbase/balancer/TestReplicaDistributionBalancerConditional.java

+    TEST_UTIL.getConfiguration()
+      .setLong("hbase.master.balancer.stochastic.regionReplicaHostCostKey", 0);
+
+    TEST_UTIL.startMiniCluster(NUM_SERVERS);


In addition to the StochasticLoadBalancer tests in hbase-balancer that can test huge scales of conditional balancing, I wrote these mini cluster tests to smoke test that the balancer changes work in a "real" environment. This is also nice for testing the edge cases on smaller scales — eg, "I have 3 replicas and 3 servers, please distribute them!"

rmdmattingly · 2024-12-27T23:03:34Z

Gonna be a ton of build issues to work through I'm sure, will tackle those

charlesconnell · 2024-12-29T14:07:13Z

hbase-balancer/src/main/java/org/apache/hadoop/hbase/master/balancer/BalancerConditionals.java

+      .flatMap(Optional::stream).forEach(RegionPlanConditionalCandidateGenerator::clearWeightCache);
+  }
+
+  void loadConf(Configuration conf) {


You can have BalancerConditionals implement Configurable or BaseConfigurable to do this in a more consistent way

Apache-HBase · 2024-12-31T15:56:21Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 26s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	hbaseanti	0m 0s		Patch does not have any anti-patterns.
			_ master Compile Tests _
+0 🆗	mvndep	0m 11s		Maven dependency ordering for branch
+1 💚	mvninstall	3m 15s		master passed
+1 💚	compile	3m 30s		master passed
+1 💚	checkstyle	0m 46s		master passed
+1 💚	spotbugs	1m 55s		master passed
+1 💚	spotless	0m 45s		branch has no errors when running spotless:check.
-0 ⚠️	patch	0m 57s		Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 11s		Maven dependency ordering for patch
+1 💚	mvninstall	3m 4s		the patch passed
-1 ❌	compile	0m 18s	/patch-compile-hbase-balancer.txt	hbase-balancer in the patch failed.
-0 ⚠️	javac	0m 18s	/patch-compile-hbase-balancer.txt	hbase-balancer in the patch failed.
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	0m 8s	/results-checkstyle-hbase-balancer.txt	hbase-balancer: The patch generated 2 new + 3 unchanged - 0 fixed = 5 total (was 3)
+1 💚	spotbugs	2m 11s		the patch passed
+1 💚	hadoopcheck	11m 41s		Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 💚	spotless	0m 45s		patch has no errors when running spotless:check.
			_ Other Tests _
+1 💚	asflicense	0m 17s		The patch does not generate ASF License warnings.
		40m 31s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6543/14/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#6543
Optional Tests	dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname	Linux 9c33e8c81b44 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `8ac0c7a`
Default Java	Eclipse Adoptium-17.0.11+9
Max. process+thread count	84 (vs. ulimit of 30000)
modules	C: hbase-balancer hbase-server U: .
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6543/14/console
versions	git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

rmdmattingly force-pushed the HBASE-28513 branch 5 times, most recently from e1283f4 to 517c43b Compare December 13, 2024 23:40

This comment has been minimized.

Sign in to view

rmdmattingly force-pushed the HBASE-28513 branch from 517c43b to e4e6bb5 Compare December 14, 2024 15:45

This comment has been minimized.

Sign in to view

rmdmattingly force-pushed the HBASE-28513 branch from e4e6bb5 to ea9992e Compare December 14, 2024 19:00

This comment has been minimized.

Sign in to view

rmdmattingly marked this pull request as draft December 14, 2024 20:14

rmdmattingly force-pushed the HBASE-28513 branch from ea9992e to 3bd96be Compare December 15, 2024 18:11

This comment has been minimized.

Sign in to view

rmdmattingly force-pushed the HBASE-28513 branch from 3bd96be to 860d8b7 Compare December 15, 2024 19:18

This comment has been minimized.

Sign in to view

rmdmattingly force-pushed the HBASE-28513 branch from 860d8b7 to a89d174 Compare December 16, 2024 01:05

This comment has been minimized.

Sign in to view

rmdmattingly force-pushed the HBASE-28513 branch 2 times, most recently from ae58410 to d1622d1 Compare December 16, 2024 02:21

This comment has been minimized.

Sign in to view

HBASE-28513 The StochasticLoadBalancer should support discrete evalua…

f643db9

…tions

rmdmattingly force-pushed the HBASE-28513 branch from d1622d1 to f643db9 Compare December 16, 2024 14:12

This comment has been minimized.

Sign in to view

Ray Mattingly added 5 commits December 27, 2024 17:12

more conditionals fixes

0a582db

progress on replica distribution

3426e5b

Isolate tables balancer action

e0d9c5a

probably improvements. replica distribution still too slow

c2d88ca

Simplifying the isViolation interface, fixing generator weights

e5ab24c

This comment has been minimized.

Sign in to view

rmdmattingly commented Dec 27, 2024

View reviewed changes

This comment has been minimized.

Sign in to view

rmdmattingly force-pushed the HBASE-28513 branch from d7c1b8c to ee8c552 Compare December 27, 2024 23:09

This comment has been minimized.

Sign in to view

rmdmattingly force-pushed the HBASE-28513 branch from ee8c552 to 7d18dc9 Compare December 28, 2024 01:51

This comment has been minimized.

Sign in to view

charlesconnell reviewed Dec 29, 2024

View reviewed changes

rmdmattingly force-pushed the HBASE-28513 branch from 7d18dc9 to b85d690 Compare December 31, 2024 14:11

This comment has been minimized.

Sign in to view

rmdmattingly force-pushed the HBASE-28513 branch from b85d690 to 8ac0c7a Compare December 31, 2024 15:13

cleanup

2ca6c63

rmdmattingly force-pushed the HBASE-28513 branch from 8ac0c7a to 2ca6c63 Compare December 31, 2024 15:22

		protected Map<Class<? extends CandidateGenerator>, CandidateGenerator>
		createCandidateGenerators() {

		Map<Class<? extends CandidateGenerator>, Long> generatorToStepCount = new HashMap<>();
		Map<Class<? extends CandidateGenerator>, Long> generatorToApprovedActionCount = new HashMap<>();

		@Category(MediumTests.class)
		public class TestLargeClusterBalancingMetaTableIsolation {

		// todo should there be logic to consolidate isolated regions on as few servers as
		// conditionals allow? This gets complicated with replicas, etc

HBASE-28513 The StochasticLoadBalancer should support discrete evaluations #6543

Are you sure you want to change the base?

HBASE-28513 The StochasticLoadBalancer should support discrete evaluations #6543

Conversation

rmdmattingly commented Dec 13, 2024 • edited Loading

New features

Testing

Replica Distribution Improvements

New Features: Table Isolation Working as Designed

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

rmdmattingly commented Dec 14, 2024 • edited Loading

This comment has been minimized.

This comment has been minimized.

rmdmattingly commented Dec 15, 2024 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

rmdmattingly commented Dec 27, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

Apache-HBase commented Dec 31, 2024

rmdmattingly commented Dec 13, 2024 •

edited

Loading

rmdmattingly commented Dec 14, 2024 •

edited

Loading

rmdmattingly commented Dec 15, 2024 •

edited

Loading