Introducing `piraeus.io/evacuate` Taint for Node Draining Support #530

boedy · 2023-09-20T07:14:49Z

I'm pleased to share my first contribution, focused on enhancing node draining in Linstor/Piraeus clusters. Feedback welcome on areas that could be further improved / optimised or things I might have overlooked.

Note: This PR is not yet complete. While the core functionality is in place, documentation should be added once the initial review is done and any suggested changes are incorporated.

Summary:

This pull request introduces a new annotation, linstor.linbit.com/prepare-for-removal, for Kubernetes nodes. When a node is annotated with this label, the LinstorSatelliteReconciler prepares the node for draining by auto placing additional replicas for all its resources.

The new annotation serves a crucial role in scenarios where a resource does not already have replicas. By creating additional replicas, it ensures that data remains available even when the original node is offline and allows the workload to be migrated in cases where allowRemoteVolumeAccess: false is defined. This feature is also particularly useful as Piraeus currently lacks (please correct me if i'm wrong) an auto-healing functionality to maintain the replica count automatically in case a node is lost.

The annotation provides a simple and declarative way to prepare nodes for maintenance or decommissioning while enhancing data availability.

Required Components:

For this feature to fully function and allow Kubernetes to reschedule the workload to a new node, the Persistent Volume (PV) needs to be updated. This is handled by the linstor-affinity-controller, which must be installed on the cluster. The controller updates the PV to enable the Pod to move to a new node, complementing the node draining and undo-draining functionalities introduced by this pull request.

Changes:

New Annotation:
- linstor.linbit.com/prepare-for-removal: When added to a node, it triggers the prepareNodeForDraining method. When removed it will trigger the undoNodeDrainingPreparation method to reverse the changes.
prepareNodeForDraining Method:
- Fetches the list of resources on the annotated node.
- Autoplaces an additional replica for each resource.
- Updates node properties to mark it for removal and disable new resource placements.
undoNodeDrainingPreparation Method:
- Counterpart to prepareNodeForDraining.
- Deletes either the newest replica or the one on the annotated node, depending on which resource is currently in use.
- Removes the node properties set during the preparation for draining.

Additional Information:

The current implementation necessitates an extra API call and logic to ascertain the location of the newly-created replica. This is because the lc.Resources.Autoplace method does not return this specific metadata. As per the Swagger documentation for the Linstor API, the obj_refs field could potentially contain the required information. To streamline this process, the Golinstor library would need to be updated to accommodate these changes.

WanzenBug

Thanks for the contribution! I'll need to think about it some more, I think some of the logic around "draining" could be simplified.

WanzenBug · 2023-10-11T11:59:15Z

Having thought some more about it: Would it make sense to use a custom taint instead of a node label? So piraeus.io/node-removal: NoSchedule? That way we would also automatically prevent kubernetes from trying to create new resources on the node.

Other things I noticed: I think the Aux/PrepareForRemoval property can be removed. With each reconcile attempt, we should check that a replacement resource is ready, by checking for a matching Aux/CopiedOverFromNode resource.
We would then also need to wait for the resource to actually fully sync up before reporting success.

Then finally, we need to remove the Aux/CopiedOverFromNode once the original node is gone, so we don't accidentally remove the resource at a later point because of it.

boedy · 2023-10-13T13:08:49Z

Hey @WanzenBug,

Thanks for your feedback. I'm on board with the idea of using a taint over an annotation. It does seem to fit the context better, especially considering the eventual node deletion.

Reflecting on it, I think you're spot on about relying exclusively on the Aux/CopiedOverFromNode property. I had initial reservations about potential race conditions, but given the synchronous nature of the reconciliation process, those concerns seem unfounded.

Lastly, taking your suggestions into consideration, I've mapped out a path for the cleanup once everything's in sync and the original node's been decommissioned. I'll try to make some time in the upcoming days and get these changes integrated into the PR.

Signed-off-by: boedy <boudewijng@gmail.com>

…annotation which triggers the operator to autoplace an additional replica to maintain data availability when the node is taken down. Signed-off-by: boedy <boudewijng@gmail.com>

Signed-off-by: boedy <boudewijng@gmail.com>

boedy · 2023-10-28T17:12:02Z

Hey @WanzenBug,

I've just pushed and update with my latest changes, taking into account the feedback you've provided. I must admit, there were a few tricky edge cases that really made me scratch my head, particularly around handling nodes that housed resources from another node which had previously been tainted. After some iterations, I'm confident in the solution I've landed on.

To provide a high-level overview of the process in such scenarios:

Consider a three node cluster with nodes A, B & C
Node A is tainted, starting the evacuation process of node A
All resources on Node A are individually copied to either Node B or Node C
Once all resources from Node A are successfully synchronised to the other nodes, Node A is marked with a 'success' state, indicating it's ready to be drained / deleted
At any point after Node A, Node B is also tainted
All resources from Node B are copied to Node C. This includes any resources that were originally on Node A and got replicated to Node B in the earlier step.
As soon as a resource (originally from Node A) is fully synchronised to Node C, the resource on Node B is automatically deleted. Additionally, the reference, Aux/EvacuatedFromNode, is updated on Node C to indicate the original source of the resource (Node A in this case).

One thing to note: the current approach does result in a higher number of API calls during each reconciliation cycle, which is something you might want to revisit and optimise down the line.

Let me know what you think.

WanzenBug

On first glace looks good. I will need (again) need some time to think about all the corner cases 😄

- findMatchingEvacuatedResource should only consider Diskfull resources - Nil pointer check: can be nil when node is offline

boedy · 2023-10-31T08:15:41Z

I just pushed some bugfixes which I ran into and overlooked at first glance.

One think that isn't yet handled gracefully is the actual deletion of the node / satellite. After I delete the k8s node. The operator spams the following error on every reconciling loop:

ERROR Reconciler error {"controller": "linstorsatellite", "controllerGroup": "piraeus.io", "controllerKind": "LinstorSatellite", "LinstorSatellite": {"name":"h-fsn-ded3"}, "namespace": "", "name": "h-fsn-ded3", "reconcileID": "bb2b3d68-4b98-4a24-a715-0b46b1b3fa68", "error": "Message: 'Node: h-fsn-ded3, Resource: pvc-072225f9-b5ca-40d2-9ca2-2adbc2d04d93 preparing for deletion.'; Details: 'Node: h-fsn-ded3, Resource: pvc-072225f9-b5ca-40d2-9ca2-2adbc2d04d93 UUID is: 6a1d14e7-658d-4a89-9d3d-b4810f0d40cb' next error: Message: 'No connection to satellite 'h-fsn-ded3'' next error: Message: 'Preparing deletion of resource on 'h-fsn-ded2'' next error: Message: 'Preparing deletion of resource on 'h-fsn-ded5'' next error: Message: 'Deletion of resource 'pvc-072225f9-b5ca-40d2-9ca2-2adbc2d04d93' on node 'h-fsn-ded3' failed due to an unknown exception.'; Details: 'Node: h-fsn-ded3, Resource: pvc-072225f9-b5ca-40d2-9ca2-2adbc2d04d93'; Reports: '[653FE239-00000-000018]'"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/Users/boedy/projects/go/me/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/Users/boedy/projects/go/me/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/Users/boedy/projects/go/me/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227
2023-10-30T18:19:25+01:00 INFO Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler {"controller": "linstorsatellite", "controllerGroup": "piraeus.io", "controllerKind": "LinstorSatellite", "LinstorSatellite": {"name":"h-fsn-ded3"}, "namespace": "", "name": "h-fsn-ded3", "reconcileID": "201d80fc-f00b-4a8d-bf52-6e77d07f4c20"}

When draining the node the satellite pod also gets evicted. As the node is tainted it won't get re-spawned, in turn not allowing the operator to gracefully remove the satellite. How should this be resolved?

Lastly as AutoplaceTarget is always reset in the undo method. Evacuating multiple nodes isn't a pleasant experience.

WanzenBug · 2023-10-31T09:41:22Z

I think the satellite needs to tolerate it's own taint.

WanzenBug · 2024-02-13T11:23:59Z

Ok, sorry for the long delay. I played around with this PR a bit more, but I wasn't really happy with the general workflow.

There is some improvement coming on the LINSTOR front, namely LINSTOR will no longer remove the resource on the evacuating node while it is still in use. What that means is that we can simply use Evacuate() instead of having to move volumes around manually. At least once the new feature is released.

I do plan to revisit this PR when that feature is available, and also work on making the behaviour configurable. Some users had trouble with the automatic evacuation when removing a node from k8s, so we should take a look at these processes in general. I envision two new settings

nodeDrainPolicy, triggered when unschedulable is set on the node.
- Ignore: just ignore it, satellite stays online
- NoSchedule: do not place new resources, but keep existing
- Evacuate: trigger node evacuation (what this PR tries to do)
nodeDeletePolicy, triggered when the matching k8s node gets deleted
- Orphan: just delete the satellite resource, but leave the registered node in LINSTOR alone
- Evict: try to evacuate/evict the node, placing the resources on other nodes.

boedy · 2024-04-01T20:47:36Z

Hey @WanzenBug, Thanks for the update. I think the two settings make sense.

There is some improvement coming on the LINSTOR front, namely LINSTOR will no longer remove the resource on the evacuating node while it is still in use. What that means is that we can simply use Evacuate() instead of having to move volumes around manually. At least once the new feature is released.

Regarding above: How would volumes be handled which only have 1 replica? Would they still be migrated?

WanzenBug · 2024-04-02T06:34:13Z

Regarding above: How would volumes be handled which only have 1 replica? Would they still be migrated?

Yes. When the evacuation is triggered, such a resource would have 2 replicas:

on the node to evacuate
on the replacement node
As long as the resource is Primary on the node to evacuate, it will not be deleted.

In the meantime the affinity controller should have updated the PV, k8s should have evicted the Pod, so the resource can become secondary, at which point it is removed by LINSTOR and you are again left with just one replica.

boedy force-pushed the feat-safe-drain branch 3 times, most recently from 516d51a to 9f16929 Compare September 20, 2023 07:47

WanzenBug reviewed Sep 25, 2023

View reviewed changes

WanzenBug added enhancement New feature or request v2 This affects only Operator v2 labels Sep 25, 2023

boedy added 2 commits October 28, 2023 18:08

Fix make process on Mac

c49f0fb

Signed-off-by: boedy <boudewijng@gmail.com>

Feat: Adds support for linstor.linbit.com/prepare-for-removal node …

c05dff0

…annotation which triggers the operator to autoplace an additional replica to maintain data availability when the node is taken down. Signed-off-by: boedy <boudewijng@gmail.com>

boedy force-pushed the feat-safe-drain branch from 9f16929 to 6873870 Compare October 28, 2023 16:11

boedy changed the title ~~Introducing linstor.linbit.com/prepare-for-removal Annotation for Node Draining Support~~ Introducing piraeus.io/evacuate Taint for Node Draining Support Oct 28, 2023

boedy force-pushed the feat-safe-drain branch from 6873870 to f500bc0 Compare October 28, 2023 17:07

Rework Node Evacuation

670dc7d

Signed-off-by: boedy <boudewijng@gmail.com>

boedy force-pushed the feat-safe-drain branch from f500bc0 to 670dc7d Compare October 28, 2023 17:10

WanzenBug reviewed Oct 30, 2023

View reviewed changes

boedy force-pushed the feat-safe-drain branch from 06a3a00 to b870d94 Compare October 30, 2023 21:33

BugFixes:

2261105

- findMatchingEvacuatedResource should only consider Diskfull resources - Nil pointer check: can be nil when node is offline

boedy force-pushed the feat-safe-drain branch from b870d94 to 2261105 Compare October 30, 2023 22:38

WanzenBug mentioned this pull request Jan 31, 2024

Release 2.4 #597

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing `piraeus.io/evacuate` Taint for Node Draining Support #530

Introducing `piraeus.io/evacuate` Taint for Node Draining Support #530

boedy commented Sep 20, 2023

WanzenBug left a comment

WanzenBug commented Oct 11, 2023

boedy commented Oct 13, 2023 •

edited

Loading

boedy commented Oct 28, 2023

WanzenBug left a comment

boedy commented Oct 31, 2023

WanzenBug commented Oct 31, 2023

WanzenBug commented Feb 13, 2024

boedy commented Apr 1, 2024

WanzenBug commented Apr 2, 2024

Introducing piraeus.io/evacuate Taint for Node Draining Support #530

Are you sure you want to change the base?

Introducing piraeus.io/evacuate Taint for Node Draining Support #530

Conversation

boedy commented Sep 20, 2023

Summary:

Required Components:

Changes:

Additional Information:

WanzenBug left a comment

Choose a reason for hiding this comment

WanzenBug commented Oct 11, 2023

boedy commented Oct 13, 2023 • edited Loading

boedy commented Oct 28, 2023

WanzenBug left a comment

Choose a reason for hiding this comment

boedy commented Oct 31, 2023

WanzenBug commented Oct 31, 2023

WanzenBug commented Feb 13, 2024

boedy commented Apr 1, 2024

WanzenBug commented Apr 2, 2024

Introducing `piraeus.io/evacuate` Taint for Node Draining Support #530

Introducing `piraeus.io/evacuate` Taint for Node Draining Support #530

boedy commented Oct 13, 2023 •

edited

Loading