Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UP node check needs to be made more resilient #360

Merged
merged 1 commit into from
Mar 24, 2023

Conversation

spilchen
Copy link
Collaborator

This fixes an issue with subcluster creation. We weren't running the rebalance_shards() to add subscriptions to the new subcluster. To hit this you had to have a few specific things:

  • had to use a v11 server
  • the database has to be migrated from enterprise to eon

There was a server issue during migration, to be fixed separately, that set some wrong state for certain tables. Some tables were identified as being "shared", which to the server means we need active shard subscriptions to query them. If no subscriptions, then the query would fail with "ERROR 9099: Cannot find participating nodes to run the query".

This was affecting queries the operator does to catalog tables -- key for this fix was it affected any query to the nodes table. Because of this, the operator deemed the new scaled out nodes as down. So, it would never run the rebalance.

The fix is to align the UP node check with the livenessProbe by looking to see if the vertica process is running. We still query out the node state from node, but it is used for information purposes now.

Closes #355

This fixes an issue with subcluster creation. We weren't running the
rebalance_shards() to add subscriptions to the new subcluster. To hit
this you had to have a few specific things:
- had to use a v11 server
- the database has to be migrated from enterprise to eon

There was a server issue during migration, to be fixed separately, that
set some wrong state for certain tables. Some tables were identified as
being "shared", which to the server means we need active shard
subscriptions to query them. If no subscriptions, then the query would
fail with "ERROR 9099:  Cannot find participating nodes to run the
query".

This was affecting queries the operator does to catalog tables -- key
for this fix was it affected any query to the nodes table. Because of
this, the operator deemed the new scaled out nodes as down. So, it would
never run the rebalance.

The fix is to align the UP node check with the livenessProbe by looking
to see if the vertica process is running. We still query out the node
state from node, but it is used for information purposes now.
@spilchen spilchen requested a review from roypaulin March 24, 2023 11:43
@spilchen spilchen self-assigned this Mar 24, 2023
Copy link
Collaborator

@roypaulin roypaulin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@@ -209,7 +209,7 @@ func (r *VerticaDBReconciler) constructActors(log logr.Logger, vdb *vapi.Vertica
// status updates after both of them.
MakeStatusReconciler(r.Client, r.Scheme, log, vdb, pfacts),
// Update the labels in pods so that Services route to nodes to them.
MakeClientRoutingLabelReconciler(r, vdb, pfacts, AddNodeApplyMethod, ""),
MakeClientRoutingLabelReconciler(r, vdb, pfacts, PodRescheduleApplyMethod, ""),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious what are these *ApplyMethod?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, when I was testing this out I ran into an issue here. When we add a label for client routing and use AddNodeApplyMethod we don't do anything if the node we need to update doesn't have any subscriptions. But the reconciler for the rebalance happens after this. So, it was a bit of a chicken/egg problem. Using PodScheduleApplyMethod got me around that.

@spilchen spilchen merged commit e7548af into vertica:main Mar 24, 2023
@spilchen spilchen deleted the handle-no-subscriptions branch March 24, 2023 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New subcluster is added incorrectly
2 participants