Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: VolumeAttachments WIP #1

Closed
wants to merge 3 commits into from
Closed

fix: VolumeAttachments WIP #1

wants to merge 3 commits into from

Conversation

AndrewSirenko
Copy link
Owner

@AndrewSirenko AndrewSirenko commented May 30, 2024

Fixes #N/A

Description

In order for a stateful pod to smoothly migrate from terminating node to new node...

  1. Consolidation event starts
  2. Stateful pods must terminate
  3. EBS CSI Node pod must unmount all filesystems (NodeUnpublish & NodeUnstage RPCs)
  4. EBS CSI Controller pod must detach all volumes from instance
  5. Karpenter terminates EC2 Instance
  6. Karpenter ensures Node object deleted from Kubernetes

Problems:
A. If 2 doesn't happen, today there's a 6+ minute delay in stateful pod migration because Kubernetes is afraid volume still attached and mounted to instance (6+ min delay)
B. If 3 doesn't happen, the new stateful pod can't start until consolidated instance is terminated which auto-detaches volumes (1+ min delay)

Solutions:

  1. [Scope Medium] We can increase the likelihood of solving both A and B by having karpenter wait (on VolumeAttachment objects dissapearing) between 3 & 4.
  2. [Scope Small] We can 100% solve A (that 6 min / soon-to-be-infinite delay) by applying the node.kubernetes.io/out-of-service:nodeshutdown:NoExecute taint on the node between 4 and 5.

How was this change tested?

Manual WIP

Need to also add following rules to clusterrole.yaml in karpenter-provider-aws

  - apiGroups: ["storage.k8s.io"]
    resources: ["volumeattachments"]
    verbs: ["get", "list", "watch"]

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@coveralls
Copy link

coveralls commented Jun 4, 2024

Pull Request Test Coverage Report for Build 9370530437

Details

  • 49 of 67 (73.13%) changed or added relevant lines in 5 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.04%) to 77.922%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/node/termination/terminator/terminator.go 18 20 90.0%
pkg/operator/operator.go 0 3 0.0%
pkg/utils/node/node.go 7 12 58.33%
pkg/controllers/node/termination/controller.go 15 23 65.22%
Totals Coverage Status
Change from base Build 9369893853: -0.04%
Covered Lines: 8319
Relevant Lines: 10676

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants