-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 E2E: Increase timeouts to stabilize CI #1724
🌱 E2E: Increase timeouts to stabilize CI #1724
Conversation
/test metal3-bmo-e2e-test-pull |
/test metal3-bmo-e2e-test-optional-pull |
We are hitting the timeout set in the pipeline file. See metal3-io/project-infra#746 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/cc @kashifest |
metal3-io/project-infra#746 is merged. Let's see if it goes better now. |
/test metal3-bmo-e2e-test-optional-pull |
Git credential ID fixed in metal3-io/project-infra#749 |
/test metal3-bmo-e2e-test-optional-pull |
I think we should make use of WaitForNamespaceDeleted in our defer cleanups. I'll try to add that tomorrow if I have the time. |
/test metal3-bmo-e2e-test-optional-pull |
test/e2e/upgrade_test.go
Outdated
@@ -413,5 +413,10 @@ var _ = Describe("Upgrade", Label("optional", "upgrade"), func() { | |||
|
|||
AfterEach(func() { | |||
cleanup(ctx, upgradeClusterProxy, namespace, cancelWatches, e2eConfig.GetIntervals("default", "wait-namespace-deleted")...) | |||
// The BMO/Ironic namespace is deleted after each test, but we need to ensure that it actually gets deleted. | |||
WaitForNamespaceDeleted(ctx, WaitForNamespaceDeletedInput{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this's gonna work :-? Ironic and BMO are deleted in DeferCleanup()
, and those run after AfterEach
. We should put this WaitForNamespaceDeleted
in another DeferCleanup()
, which should be put right after where the namespace is created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe even before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh this is getting too ugly. The problem is that the namespace is part of both the ironic and bmo kustomizations. It will be deleted when we delete either of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it still be there unless all of them are deleted?
Anyway, IMO it's not really an issue, if the namespace has been deleted already when the DeferCleanup()
is called, then the WaitForNamespaceDeleted()
function should just succeed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's not an issue in that regard, it is just very ugly that we create and delete the namespace in both these.
The namespace definitely is deleted with both BMO and Ironic since both the ironic and BMO overlays include the namespace. If you check the logs you will also see tons of errors about "no such object" for all of those that have already been deleted.
In practice, we always delete BMO when we delete Ironic and vice versa. It is not great and there is a chance that this can cause issues. For example, deleting BMO before all BMHs are gone will make them stuck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I don't think the namespace would really be cleaned when we delete BMO, if ironic still lives in it, but I'm not 100% sure.
Anyway, do you have a suggested solution here? Maybe we should just remove the namespace from both the overlays, and explicitly create the namespace before installation?
In practice, we always delete BMO when we delete Ironic and vice versa. It is not great and there is a chance that this can cause issues. For example, deleting BMO before all BMHs are gone will make them stuck.
That's a valid concern, but I don't think either BMO or Ironic deletion will ever happen after BMH cleanup. Both of the deletion are in DeferCleanup, which should happen after the cleanup phase in AfterEach()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am 100 % sure that the namespace is deleted when we delete BMO. It also deletes Ironic. The equivalent of kubectl delete namespace baremetal-operator-system
is executed when we delete BMO, which will delete everything in the namespace first and then the namespace itself.
Anyway, do you have a suggested solution here? Maybe we should just remove the namespace from both the overlays, and explicitly create the namespace before installation?
There are a few options:
- always bundle BMO and Ironic so they are always installed as one unit (which is impractical)
- give them separate namespaces
- be more fine grained so we leave the namespace alone (this could include an operator that takes care of all operations)
- handle the namespace separately, a bit annoying and would require changes in multiple places
That's a valid concern, but I don't think either BMO or Ironic deletion will ever happen after BMH cleanup.
That may be true now and for these tests but not in general. We could easily have a test where we try deleting and re-creating Ironic in the middle of a test to see that it can recover. That would easily break if BMO was also deleted. (If CRDs are deleted then the BMHs also go.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So you are mentioning this issue for general use as well, not just in BMO e2e tests? If that's the case, I have no idea, every option seems to have some side effect.
bbd366d
to
03aae3c
Compare
/test metal3-bmo-e2e-test-optional-pull |
We are changing CI infrastructure. An unfortunate side-effect is that some steps are taking longer than they used to. Because of this we now see some timeouts. This commit is for increasing the relevant timeouts to make the CI stable again. Signed-off-by: Lennart Jern <lennart.jern@est.tech>
Signed-off-by: Lennart Jern <lennart.jern@est.tech>
03aae3c
to
975bdc3
Compare
We are hitting a timeout sometimes when fetching content through kustomize. The default is set here and it can be overridden using query parameters. |
/test metal3-bmo-e2e-test-optional-pull |
🙁 |
This adds a function for retrying function calls that are flaky and makes use of it for kustomizations. Signed-off-by: Lennart Jern <lennart.jern@est.tech>
/test metal3-bmo-e2e-test-optional-pull |
/test metal3-bmo-e2e-test-optional-pull |
1 similar comment
/test metal3-bmo-e2e-test-optional-pull |
Ok we have at least one successful job here, which is better than the periodic jobs that doesn't have these changes. |
/test metal3-bmo-e2e-test-pull |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big SAD to have put such in, but LGTM to get it stabilize in the new env. I don't think this will fix it completely, since the worst cases of performance degradation are so massive, but if it fixes say 80% of them, its a win.
Let's also make a flake ticket that we eventually remember to remove these.
/lgtm
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kashifest The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
What this PR does / why we need it:
We are changing CI infrastructure. An unfortunate side-effect is that some steps are taking longer than they used to. Because of this we now see some timeouts.
This commit is for increasing the relevant timeouts to make the CI stable again.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #