Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: RFC for disrupted EBS-Backed StatefulSet delays #6336

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

AndrewSirenko
Copy link
Member

Fixes #N/A

Description
Create RFC for disrupted EBS-Backed StatefulSet delays

How was this change tested?
N/A

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@AndrewSirenko AndrewSirenko requested a review from a team as a code owner June 7, 2024 22:27
Copy link

netlify bot commented Jun 7, 2024

Deploy Preview for karpenter-docs-prod canceled.

Name Link
🔨 Latest commit 95c2945
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/667b4d1fc653f40008a99ca7

@coveralls
Copy link

coveralls commented Jun 7, 2024

Pull Request Test Coverage Report for Build 9423842569

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 82.474%

Totals Coverage Status
Change from base Build 9422632438: 0.0%
Covered Lines: 5553
Relevant Lines: 6733

💛 - Coveralls

@AndrewSirenko AndrewSirenko changed the title docs: Create RFC for disrupted EBS-Backed StatefulSet delays docs: RFC for disrupted EBS-Backed StatefulSet delays Jun 7, 2024
@AndrewSirenko
Copy link
Member Author

Thank you @FernandoMiguel for the edits!

@coveralls
Copy link

coveralls commented Jun 12, 2024

Pull Request Test Coverage Report for Build 9488442511

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.01%) to 82.467%

Files with Coverage Reduction New Missed Lines %
pkg/providers/amifamily/ami.go 1 90.56%
Totals Coverage Status
Change from base Build 9488317383: -0.01%
Covered Lines: 5550
Relevant Lines: 6730

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jun 12, 2024

Pull Request Test Coverage Report for Build 9488799347

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.01%) to 82.467%

Files with Coverage Reduction New Missed Lines %
pkg/providers/amifamily/ami.go 1 90.56%
Totals Coverage Status
Change from base Build 9488317383: -0.01%
Covered Lines: 5550
Relevant Lines: 6730

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jun 12, 2024

Pull Request Test Coverage Report for Build 9489219507

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.02%) to 82.496%

Files with Coverage Reduction New Missed Lines %
pkg/providers/amifamily/ami.go 1 90.56%
Totals Coverage Status
Change from base Build 9488317383: 0.02%
Covered Lines: 5552
Relevant Lines: 6730

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jun 13, 2024

Pull Request Test Coverage Report for Build 9500606241

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.02%) to 82.496%

Files with Coverage Reduction New Missed Lines %
pkg/providers/amifamily/ami.go 1 90.56%
Totals Coverage Status
Change from base Build 9488317383: 0.02%
Covered Lines: 5552
Relevant Lines: 6730

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jun 18, 2024

Pull Request Test Coverage Report for Build 9572946601

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 82.481%

Totals Coverage Status
Change from base Build 9488317383: 0.0%
Covered Lines: 5551
Relevant Lines: 6730

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jun 18, 2024

Pull Request Test Coverage Report for Build 9573214398

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.01%) to 82.467%

Files with Coverage Reduction New Missed Lines %
pkg/providers/amifamily/ami.go 1 90.56%
Totals Coverage Status
Change from base Build 9569831660: -0.01%
Covered Lines: 5550
Relevant Lines: 6730

💛 - Coveralls

designs/statefulset-disruption.md Show resolved Hide resolved
designs/statefulset-disruption.md Outdated Show resolved Hide resolved
designs/statefulset-disruption.md Outdated Show resolved Hide resolved

**Problem B. If step 3 doesn't happen before step 4, there will be a 1+ minute delay**

If Karpenter calls EC2 TerminateInstance **before** EC2 DetachVolume calls from EBS CSI Driver Controller pod finish, then the volumes won't be detached **until the old instance terminates**.This delay depends on how long it takes the underlying instance to enter the `terminated` state, which depends on the instance type. Typically 1 minute for `m5a.large`, up to 15 minutes for certain Metals/GPU/Windows instances. See [appendix D1](#d1-ec2-termination--ec2-detachvolume-relationship-) for more context.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention "From my brief experiments, Typically 1 minute for 5ma.large..."

Copy link
Member

@cartermckinnon cartermckinnon Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variability is really unfortunate. I wonder if you see more consistent behavior with ec2:StopInstances? I assume you can detach a volume when an instance is stopped? If so, maybe stopping prior to termination would at least put a more reasonable upper bound on the delay in this case

edit: I see in the appendix:

The guest OS can take a long time to complete shutting down.

But that sounds like it's worth digging into, not sure how/why the OS shutdown sequence could take 15 minutes

Copy link
Member Author

@AndrewSirenko AndrewSirenko Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @cartermckinnon, two notes for you.

I assume you can detach a volume when an instance is stopped? If so, maybe stopping prior to termination would at least put a more reasonable upper bound on the delay in this case

  1. Great idea on ec2:StopInstances, but I'm not sure if the extra step is worth it considering that one can only detach a volume after instance reaches stopped state (which would take an extra ~8 seconds). I measured the difference between stopping/terminating durations for a couple different instance types here. Looks like stopped is 10-40 seconds faster depending on instance type or EC2 quirks.

  2. My original document was misleading when talking about GPU/Windows instance termination times. Most GPU/Windows instance types are closer to linux termination times than metal instances. I have added an instance termination timings section where I tested a few instance types.

The guest OS can take a long time to complete shutting down.

I can try digging into this with the folks at EC2.

- Configure Kubelet for Graceful Node Shutdown
- Enable Karpenter Spot Instance interruption handling
- Use EBS CSI Driver ≥ `v1.x` in order use the [PreStop Lifecycle Hook](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#what-is-the-prestop-lifecycle-hook)
- Set `.node.tolerateAllTaints=false` when deploying the EBS CSI Driver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If graceful node shutdown is configured, wouldn't it be fine for the driver to tolerate Karpenter's disruption taint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 6+ min delays: In theory yes, in practice I have needed to turn tolerateAllTaints off in order to not see 6+min delays. We can trade notes on this.

designs/statefulset-disruption.md Outdated Show resolved Hide resolved
designs/statefulset-disruption.md Outdated Show resolved Hide resolved

See [a proof-of-concept implementation of A2 & B1 in PR #1294](https://github.com/kubernetes-sigs/karpenter/pull/1294)

Finally, we should add the following EBS x Karpenter end-to-end test in karpenter-provider-aws to catch regressions between releases of Karpenter or EBS CSI Driver:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Really like that we are scoping in testing to this RFC! This is definitely something that we should be actively monitoring!

designs/statefulset-disruption.md Outdated Show resolved Hide resolved
designs/statefulset-disruption.md Show resolved Hide resolved
designs/statefulset-disruption.md Outdated Show resolved Hide resolved
designs/statefulset-disruption.md Outdated Show resolved Hide resolved

This means that our sequence of events will match the ideal diagram from section [Ideal Graceful Shutdown for Stateful Workloads][#ideal-graceful-shutdown-for-stateful-workloads]

We can use similar logic to [today's proof-of-concept implementation](https://github.com/kubernetes-sigs/karpenter/pull/1294), but move it to karpenter-provider-aws and check for `node.Status.VolumesInUse` instead of listing volumeattachment objects. A 20 second max wait was sufficient to prevent delays with m5a instance type, but further testing is needed to ensure it is enough for Windows/GPU instance types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any kind of signal that the EBS CSI driver could give us to know that everything is detached before terminating? One option is that Karpenter does some work from its neutral code; another option is that the EBS CSI driver injects a finalizer and only removes it when it knows that it has finished the work that it needs to do. This, of course, requires us to create a finalizer that we wait for instance termination. There are similar-ish requirements from ALB though with waiting on instance termination before the target group is deregistered and I'm wondering how much overlap that we have here on these two problems

designs/statefulset-disruption.md Show resolved Hide resolved

Wait for volumes to detach before terminating the instance.

We can do this by waiting for all volumes of drain-able nodes to be marked as not be in use nor attached before terminating the node in c.cloudProvider.Delete (until a maximum of 20 seconds). See [Appendix D3 for the implementation details of this wait](#d)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of weird IMO. This basically means that the Delete call is going to return an error until we are good to actually delete the instance. I wonder if there's other ways that we could hook into the deletion flow without having to hack the cloudprovider delete call so that it orchestrates the flow that we're looking for

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem today is that we only call delete once we get a success, which means that you are either going to have to make this synchronous or you are going to have to do some kind of weird erroring so that we back-off until we actually do the termination

@cartermckinnon
Copy link
Member

xref: kubernetes/enhancements#4213

@coveralls
Copy link

coveralls commented Jun 25, 2024

Pull Request Test Coverage Report for Build 9670727891

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 82.481%

Totals Coverage Status
Change from base Build 9569831660: 0.0%
Covered Lines: 5551
Relevant Lines: 6730

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jun 25, 2024

Pull Request Test Coverage Report for Build 9670778619

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 82.481%

Totals Coverage Status
Change from base Build 9569831660: 0.0%
Covered Lines: 5551
Relevant Lines: 6730

💛 - Coveralls

Copy link
Contributor

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@jmdeal
Copy link
Contributor

jmdeal commented Jul 16, 2024

/remove-lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants