Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 E2E: Increase timeouts to stabilize CI #1724

Merged
merged 3 commits into from
May 15, 2024

Conversation

lentzi90
Copy link
Member

What this PR does / why we need it:

We are changing CI infrastructure. An unfortunate side-effect is that some steps are taking longer than they used to. Because of this we now see some timeouts.
This commit is for increasing the relevant timeouts to make the CI stable again.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@metal3-io-bot metal3-io-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 10, 2024
@lentzi90
Copy link
Member Author

/test metal3-bmo-e2e-test-pull

@lentzi90
Copy link
Member Author

/test metal3-bmo-e2e-test-optional-pull

@lentzi90
Copy link
Member Author

We are hitting the timeout set in the pipeline file. See metal3-io/project-infra#746

Copy link
Member

@Sunnatillo Sunnatillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Sunnatillo
Copy link
Member

/cc @kashifest

@metal3-io-bot metal3-io-bot requested a review from kashifest May 10, 2024 11:06
@lentzi90
Copy link
Member Author

metal3-io/project-infra#746 is merged. Let's see if it goes better now.
I think we may still need to adjust more things so I'm putting hold to avoid merging prematurely
/hold
/test metal3-bmo-e2e-test-optional-pull

@metal3-io-bot metal3-io-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 10, 2024
@Sunnatillo
Copy link
Member

/test metal3-bmo-e2e-test-optional-pull

@lentzi90
Copy link
Member Author

Git credential ID fixed in metal3-io/project-infra#749
Let's see if it helps with the checkout
/test metal3-bmo-e2e-test-optional-pull

@lentzi90
Copy link
Member Author

/test metal3-bmo-e2e-test-optional-pull

@lentzi90
Copy link
Member Author

[2024-05-13T11:00:23.401Z] Warning: Detected changes to resource baremetal-operator-system which is currently being deleted.
[2024-05-13T11:00:23.401Z] Error from server (Forbidden): error when creating "STDIN": deployments.apps "ironic" is forbidden: unable to create new content in namespace baremetal-operator-system because it is being terminated

I think we should make use of WaitForNamespaceDeleted in our defer cleanups. I'll try to add that tomorrow if I have the time.

@lentzi90
Copy link
Member Author

/test metal3-bmo-e2e-test-optional-pull

@@ -413,5 +413,10 @@ var _ = Describe("Upgrade", Label("optional", "upgrade"), func() {

AfterEach(func() {
cleanup(ctx, upgradeClusterProxy, namespace, cancelWatches, e2eConfig.GetIntervals("default", "wait-namespace-deleted")...)
// The BMO/Ironic namespace is deleted after each test, but we need to ensure that it actually gets deleted.
WaitForNamespaceDeleted(ctx, WaitForNamespaceDeletedInput{
Copy link
Member

@mquhuy mquhuy May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this's gonna work :-? Ironic and BMO are deleted in DeferCleanup(), and those run after AfterEach. We should put this WaitForNamespaceDeleted in another DeferCleanup(), which should be put right after where the namespace is created.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe even before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh this is getting too ugly. The problem is that the namespace is part of both the ironic and bmo kustomizations. It will be deleted when we delete either of them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it still be there unless all of them are deleted?

Anyway, IMO it's not really an issue, if the namespace has been deleted already when the DeferCleanup() is called, then the WaitForNamespaceDeleted() function should just succeed, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's not an issue in that regard, it is just very ugly that we create and delete the namespace in both these.
The namespace definitely is deleted with both BMO and Ironic since both the ironic and BMO overlays include the namespace. If you check the logs you will also see tons of errors about "no such object" for all of those that have already been deleted.
In practice, we always delete BMO when we delete Ironic and vice versa. It is not great and there is a chance that this can cause issues. For example, deleting BMO before all BMHs are gone will make them stuck.

Copy link
Member

@mquhuy mquhuy May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I don't think the namespace would really be cleaned when we delete BMO, if ironic still lives in it, but I'm not 100% sure.
Anyway, do you have a suggested solution here? Maybe we should just remove the namespace from both the overlays, and explicitly create the namespace before installation?

In practice, we always delete BMO when we delete Ironic and vice versa. It is not great and there is a chance that this can cause issues. For example, deleting BMO before all BMHs are gone will make them stuck.

That's a valid concern, but I don't think either BMO or Ironic deletion will ever happen after BMH cleanup. Both of the deletion are in DeferCleanup, which should happen after the cleanup phase in AfterEach().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am 100 % sure that the namespace is deleted when we delete BMO. It also deletes Ironic. The equivalent of kubectl delete namespace baremetal-operator-system is executed when we delete BMO, which will delete everything in the namespace first and then the namespace itself.

Anyway, do you have a suggested solution here? Maybe we should just remove the namespace from both the overlays, and explicitly create the namespace before installation?

There are a few options:

  1. always bundle BMO and Ironic so they are always installed as one unit (which is impractical)
  2. give them separate namespaces
  3. be more fine grained so we leave the namespace alone (this could include an operator that takes care of all operations)
  4. handle the namespace separately, a bit annoying and would require changes in multiple places

That's a valid concern, but I don't think either BMO or Ironic deletion will ever happen after BMH cleanup.

That may be true now and for these tests but not in general. We could easily have a test where we try deleting and re-creating Ironic in the middle of a test to see that it can recover. That would easily break if BMO was also deleted. (If CRDs are deleted then the BMHs also go.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So you are mentioning this issue for general use as well, not just in BMO e2e tests? If that's the case, I have no idea, every option seems to have some side effect.

@lentzi90 lentzi90 force-pushed the lentzi90/e2e-increase-timeout branch from bbd366d to 03aae3c Compare May 14, 2024 07:10
@metal3-io-bot metal3-io-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 14, 2024
@lentzi90
Copy link
Member Author

/test metal3-bmo-e2e-test-optional-pull

lentzi90 added 2 commits May 14, 2024 12:35
We are changing CI infrastructure. An unfortunate side-effect is that
some steps are taking longer than they used to. Because of this we now
see some timeouts.
This commit is for increasing the relevant timeouts to make the CI
stable again.

Signed-off-by: Lennart Jern <lennart.jern@est.tech>
Signed-off-by: Lennart Jern <lennart.jern@est.tech>
@lentzi90 lentzi90 force-pushed the lentzi90/e2e-increase-timeout branch from 03aae3c to 975bdc3 Compare May 14, 2024 09:36
@metal3-io-bot metal3-io-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 14, 2024
@lentzi90
Copy link
Member Author

We are hitting a timeout sometimes when fetching content through kustomize. The default is set here and it can be overridden using query parameters.

@lentzi90
Copy link
Member Author

/test metal3-bmo-e2e-test-optional-pull

@lentzi90
Copy link
Member Author

accumulating components: loader.New "failed to run '/usr/bin/git fetch --depth=1 https://github.com/metal3-io/baremetal-operator release-0.6': error: RPC failed; curl 56 GnuTLS recv error (-9): Error decoding the received TLS packet.\nerror: 5987 bytes of body are still expected\nfetch-pack: unexpected disconnect while reading sideband packet\nfatal: early EOF\nfatal: fetch-pack: invalid index-pack output\n: exit status 128"

🙁
Not much to do about that. We can try retries or fixing the infra...

This adds a function for retrying function calls that are flaky and
makes use of it for kustomizations.

Signed-off-by: Lennart Jern <lennart.jern@est.tech>
@metal3-io-bot metal3-io-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 14, 2024
@lentzi90
Copy link
Member Author

/test metal3-bmo-e2e-test-optional-pull

@Sunnatillo
Copy link
Member

/test metal3-bmo-e2e-test-optional-pull

1 similar comment
@lentzi90
Copy link
Member Author

/test metal3-bmo-e2e-test-optional-pull

@lentzi90
Copy link
Member Author

Ok we have at least one successful job here, which is better than the periodic jobs that doesn't have these changes.
What do you think? Should we merge?

@lentzi90
Copy link
Member Author

/test metal3-bmo-e2e-test-pull

Copy link
Member

@tuminoid tuminoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big SAD to have put such in, but LGTM to get it stabilize in the new env. I don't think this will fix it completely, since the worst cases of performance degradation are so massive, but if it fixes say 80% of them, its a win.

Let's also make a flake ticket that we eventually remember to remove these.

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label May 15, 2024
@kashifest
Copy link
Member

/approve

@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kashifest

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2024
@lentzi90
Copy link
Member Author

/hold cancel
This does not solve everything but it does increase the chances of successful jobs.

@metal3-io-bot metal3-io-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 15, 2024
@metal3-io-bot metal3-io-bot merged commit 2eb3da0 into metal3-io:main May 15, 2024
17 checks passed
@metal3-io-bot metal3-io-bot deleted the lentzi90/e2e-increase-timeout branch May 15, 2024 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants