Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: removing nic after vm creation fails #68

Merged
merged 16 commits into from
Jan 10, 2024
Merged

Conversation

Bryce-Soghigian
Copy link
Collaborator

@Bryce-Soghigian Bryce-Soghigian commented Dec 15, 2023

Fixes #67

Description
When we issue a VM.Create call, and the vm fails to create (happens frequently at the beginning of karpenter when its trying to find unavailable offerings), we don't delete the Network Interface.

We also do not delete Nic on the InstanceProvider.Delete() call. We should also be attempting to delete nics on the Delete of the Nodeclaim. Since we will get into a scenario where we return ICE errors, then the launch reconciler will delete the nodeclaim. In the case of an ICE Error for TotalRegionalCoresQuota, we will only create the nic for that nodeclaim. Then if we delete that nodeclaim, our provider will only attempt to delete the VM.

This PR adds a new cleanup function p.cleanupAzureResources that we will use to delete all associated azure resources.

How was this change tested?

  • make az-all, paired with a scale up that hit quota errors.

TODO

  • Explore async deletion
    Does this change impact docs?
  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

Release Note

fix: removing nic after vm creation fails

Copy link
Collaborator Author

@Bryce-Soghigian Bryce-Soghigian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/test

@Bryce-Soghigian Bryce-Soghigian marked this pull request as ready for review December 29, 2023 03:01
Copy link
Contributor

@rakechill rakechill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks really good and I like your approach here. Just have a few questions about your unit tests.

Also wondering if you've been able to test this in a cluster where you've previously seen NICs hanging around to demonstrate that this change is effective at preventing that.

pkg/providers/instancetype/suite_test.go Outdated Show resolved Hide resolved
pkg/providers/instancetype/suite_test.go Show resolved Hide resolved
@Bryce-Soghigian
Copy link
Collaborator Author

Bryce-Soghigian commented Dec 29, 2023

@rakechill i did a scale to 750 nodes, and a scale down, and nics were mostly cleaned up. We do see 4 nics that were not and all of them failed due to the same error NicReservedForAnotherVm. I have a doc i am working on for this. Its a bit complex because we need to consider retries, non blocking deletion/creation etc.

So wanted to cut in this pr first to get us out of the 100s of remaining ghost nics

@Bryce-Soghigian
Copy link
Collaborator Author

Did another run, scaled to 300 or so then deleted all the nodes, we are left with one nic that remains due to NicReservedForAnotherVm

@Bryce-Soghigian Bryce-Soghigian merged commit f9f9498 into main Jan 10, 2024
8 checks passed
@Bryce-Soghigian Bryce-Soghigian deleted the bsoghigian/nic-fix branch January 10, 2024 22:28
@@ -25,6 +25,8 @@ import (
"github.com/Azure/karpenter/pkg/auth"
)

var MaxRetries = 20
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

@tallaxes tallaxes added kind/bug Categorizes issue or PR as related to a bug. area/networking Issues or PRs related to networking labels Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking Issues or PRs related to networking kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

if NodeClaims are not satisfied, Karpenter still create the NIC and does no delete it
3 participants