UP node check needs to be made more resilient #360

spilchen · 2023-03-24T11:43:00Z

This fixes an issue with subcluster creation. We weren't running the rebalance_shards() to add subscriptions to the new subcluster. To hit this you had to have a few specific things:

had to use a v11 server
the database has to be migrated from enterprise to eon

There was a server issue during migration, to be fixed separately, that set some wrong state for certain tables. Some tables were identified as being "shared", which to the server means we need active shard subscriptions to query them. If no subscriptions, then the query would fail with "ERROR 9099: Cannot find participating nodes to run the query".

This was affecting queries the operator does to catalog tables -- key for this fix was it affected any query to the nodes table. Because of this, the operator deemed the new scaled out nodes as down. So, it would never run the rebalance.

The fix is to align the UP node check with the livenessProbe by looking to see if the vertica process is running. We still query out the node state from node, but it is used for information purposes now.

Closes #355

This fixes an issue with subcluster creation. We weren't running the rebalance_shards() to add subscriptions to the new subcluster. To hit this you had to have a few specific things: - had to use a v11 server - the database has to be migrated from enterprise to eon There was a server issue during migration, to be fixed separately, that set some wrong state for certain tables. Some tables were identified as being "shared", which to the server means we need active shard subscriptions to query them. If no subscriptions, then the query would fail with "ERROR 9099: Cannot find participating nodes to run the query". This was affecting queries the operator does to catalog tables -- key for this fix was it affected any query to the nodes table. Because of this, the operator deemed the new scaled out nodes as down. So, it would never run the rebalance. The fix is to align the UP node check with the livenessProbe by looking to see if the vertica process is running. We still query out the node state from node, but it is used for information purposes now.

roypaulin

Looks good!

roypaulin · 2023-03-24T15:36:00Z

pkg/controllers/vdb/verticadb_controller.go

@@ -209,7 +209,7 @@ func (r *VerticaDBReconciler) constructActors(log logr.Logger, vdb *vapi.Vertica
 		// status updates after both of them.
 		MakeStatusReconciler(r.Client, r.Scheme, log, vdb, pfacts),
 		// Update the labels in pods so that Services route to nodes to them.
-		MakeClientRoutingLabelReconciler(r, vdb, pfacts, AddNodeApplyMethod, ""),
+		MakeClientRoutingLabelReconciler(r, vdb, pfacts, PodRescheduleApplyMethod, ""),


Just curious what are these *ApplyMethod?

Good question, when I was testing this out I ran into an issue here. When we add a label for client routing and use AddNodeApplyMethod we don't do anything if the node we need to update doesn't have any subscriptions. But the reconciler for the rebalance happens after this. So, it was a bit of a chicken/egg problem. Using PodScheduleApplyMethod got me around that.

spilchen requested a review from roypaulin March 24, 2023 11:43

spilchen self-assigned this Mar 24, 2023

roypaulin approved these changes Mar 24, 2023

View reviewed changes

spilchen merged commit e7548af into vertica:main Mar 24, 2023

spilchen deleted the handle-no-subscriptions branch March 24, 2023 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UP node check needs to be made more resilient #360

UP node check needs to be made more resilient #360

spilchen commented Mar 24, 2023

roypaulin left a comment

roypaulin Mar 24, 2023

spilchen Mar 24, 2023

UP node check needs to be made more resilient #360

UP node check needs to be made more resilient #360

Conversation

spilchen commented Mar 24, 2023

roypaulin left a comment

Choose a reason for hiding this comment

roypaulin Mar 24, 2023

Choose a reason for hiding this comment

spilchen Mar 24, 2023

Choose a reason for hiding this comment