-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-15255: Add terminated as a handled state so terminated instances don't get stuck #83
base: main
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@racheljpg: This pull request references Jira Issue OCPBUGS-15255, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any unit tests that we can be updating to prove out this behaviour?
pkg/actuators/machine/reconciler.go
Outdated
if len(existingInstances) == 0 || r.checkIfInstanceTerminated(existingInstances) { | ||
if r.machine.Spec.ProviderID != nil && *r.machine.Spec.ProviderID != "" && len(r.machine.Status.Addresses) == 0 && (r.machine.Status.LastUpdated == nil || r.machine.Status.LastUpdated.Add(requeueAfterSeconds*time.Second).After(time.Now())) { | ||
klog.Infof("%s: Possible eventual-consistency discrepancy; returning an error to requeue", r.machine.Name) | ||
return false, &machinecontroller.RequeueAfterError{RequeueAfter: requeueAfterSeconds * time.Second} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be what we want. I think there's probably a case where the instance is terminated, but the provider ID is set, without the addresses in place, which makes me think we can probably hit this eventual consistency issue.
Given the last updated check this is probably ok since we wait circa 20 seconds before giving up, but, I'd like to see this tested if we can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review Joel, from our testing we could see that it works how we hope. Will look into unit tests next
26dfed8
to
00d0049
Compare
Hello @huali9, you are down as the QA contact for this bug :) Would you mind running this through the premerge tests? Thank you. |
/retest |
1 similar comment
/retest |
Hi @racheljpg I tried to premerge test this today.
3.Scale the machineset to replicas=5, I saw that the four new machines all Failed.
4.Scale the machineset to replicas=15, I saw that some new machines goto Failed, some new machines stuck in Provisioning. I checked on AWS console, all the Failed and Provisioning machines shows Terminated with this message "Client.VolumeLimitExceeded: Volume limit exceeded. You have exceeded the maximum gp2 storage limit of 30720 GiB in this location for your account. Please contact AWS Support for more information."
Then I tried to premerge test this. Still some machines goto Failed and some machines stuck in Provisioning, which is the same as the result without this PR. So seems the PR doesn't work!
4.Scale the machineset to replicas=10, I saw that some new machines goto Failed, some new machines stuck in Provisioning. I checked on AWS console, all the Failed and Provisioning machines shows Terminated with this message "Client.VolumeLimitExceeded: Volume limit exceeded. You have exceeded the maximum gp2 storage limit of 30720 GiB in this location for your account. Please contact AWS Support for more information."
Screenshot on AWS console: https://drive.google.com/file/d/1cQktyGOCMRof14VtU59a2-UNxpkOiCx2/view?usp=sharing |
2a7e423
to
f571a1c
Compare
/retest |
f571a1c
to
b87739f
Compare
/retest |
2 similar comments
/retest |
/retest |
@racheljpg: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Hello @huali9, thank you very much for running the pre-merge tests. I'm afraid in between the time I pinged you and the tests, I had broken the code a bit, so if you wouldn't mind running them again, I'm hoping they should pass now? I know in the original bug was mentioned as something that happened on a local zone cluster also, so maybe this could be why in my testing it seemed to have fixed the issue but in your tests it hadn't. If you wouldn't mind running them again, though, to test my new changes, that would be great. Thank you! |
Hi @racheljpg I tried to premerge test this today, but get the same result. Some new machines goto Failed, some new machines stuck in Provisioning.
3.Copy a default local zone machineset, and change volumeSize to 16384, then create the new machineset. The new machine goto Provisioned.
4.Scale the machineset to replicas=10, I saw that some new machines goto Failed, some new machines stuck in Provisioning. I checked on AWS console, all the Failed and Provisioning machines shows Terminated with this message "Client.VolumeLimitExceeded: Volume limit exceeded. You have exceeded the maximum gp2 storage limit of 30720 GiB in this location for your account. Please contact AWS Support for more information."
Screenshot on AWS console: https://drive.google.com/file/d/12NnwnuUq2Mwy8Jw9tctXc6tJBCT2ZJX9/view?usp=sharing |
Thank you, @huali9. I will look into it further |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Hello, adding a comment to remove this stale label as it's something I still want to get back to - but further investigation has to be done on why this problem exists on local zone clusters specifically. Thanks! |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
/jira refresh |
@racheljpg: This pull request references Jira Issue OCPBUGS-15255, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@racheljpg: This pull request references Jira Issue OCPBUGS-15255, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
This is a PR to add terminated as a handled state, so if AWS terminates instances before they are ready the instance should no longer get stuck. Creating it as a draft PR to get input/test my changes.