Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test flakes tracker #579

Open
cgwalters opened this issue Jun 1, 2024 · 10 comments
Open

test flakes tracker #579

cgwalters opened this issue Jun 1, 2024 · 10 comments
Labels
area/ci Issues related to our own CI

Comments

@cgwalters
Copy link
Collaborator

cgwalters commented Jun 1, 2024

Parsing layer blob: Broken pipe

stderr: "\e[31mERROR\e[0m Switching: Pulling: Importing: Parsing layer blob sha256:4367367aae6325ce7351edb720e7e6929a7f369205b38fa88d140b7e3d0a274f: Broken pipe (os error 32)"

This one is like my enemy! I have a tracker over here for it coreos/rpm-ostree#4567 too

@henrywang
Copy link
Contributor

But anyways I think the larger problem pointed out by the aws error message is the script hardcodes a security group in a specific AZ, when it could really be targeting any AZ right?

There's only one Zone we can ues because RHEL needs internal access to install podman to run bootc install command. IT only configured one subnet in one Zone.

We had get available Zone for non-rhel test https://gitlab.com/fedora/bootc/tests/bootc-workflow-test/-/blob/2bebcdd18f4e0ff9639aff59e2fdfdfcec70f450/playbooks/deploy-aws.yaml#L55.

A few things on this. First it seems like a lot of this script is a basic "provision an ec2 instance" code that could probably be shared and live outside this repo? Maybe we fetch this stuff from a container or a distinct repo?

That's the things I'd like to talk with you on Monday QE sync meeting.

@cgwalters
Copy link
Collaborator Author

There's only one Zone we can ues because RHEL needs internal access to install podman to run bootc install command. IT only configured one subnet in one Zone.

OK, got it. Well...per the other discussion, what if we focused only on fedora:40 and centos:stream9 for PR testing by default, and did rhel integration testing both post merge (I'll get the -dev images re-spun up which build relevant things from git main) and also as part of dist-git merges to https://gitlab.com/redhat/centos-stream/rpms/bootc/ ?

@henrywang
Copy link
Contributor

OK, got it. Well...per the other discussion, what if we focused only on fedora:40 and centos:stream9 for PR testing by default

I agree.

and did rhel integration testing both post merge (I'll get the -dev images re-spun up which build relevant things from git main) and also as part of dist-git merges to https://gitlab.com/redhat/centos-stream/rpms/bootc/ ?

Just like you mentioned above, rhel-bootc-dev repo can be added just like centos-bootc-dev and -dev image can be saved in gitlab repo (repos under https://gitlab.com/redhat/rhel/bifrost should be private?). I can add test job in this repo without test code added, only run pipeline with https://gitlab.com/fedora/bootc/tests/bootc-workflow-test code. -dev image can be built daily and test will be run daily as well.

I'd not suggest to add testing in https://gitlab.com/redhat/centos-stream/rpms/bootc/ to avoid release block. From my perspective, all tests should be run before release, not on release.

@cgwalters cgwalters added the area/ci Issues related to our own CI label Jun 4, 2024
@henrywang
Copy link
Contributor

Recently, let's say last week, this error has been found more times. Automation added 3-times retry in ansible playbook as a workaround. Let's see what happens after retry.

@cgwalters
Copy link
Collaborator Author

In a different run, we somehow ended up with

Creating root filesystem (xfs) on device /dev/loop0p2 (size=512M)

Which seems related but different from the other one:

Creating root filesystem (xfs) on device /dev/loop0p1 (size=1M)

Actually, having it be 1M sometimes and 512M others looks very much like we're getting partitions swapped.

@henrywang
Copy link
Contributor

henrywang commented Jul 3, 2024

The test is facing Installing to filesystem: Creating ostree deployment: Performing deployment: Importing: Parsing layer blob sha256:9536e521dd6b076e09fa076feb4428e4b94e5330c6d6b3ab1e235a54be3d88b7: Failed to invoke skopeo proxy method FinishPipe: remote error: write |1: broken pipe error recently when run bootc install to-existing-root.

@cgwalters
Copy link
Collaborator Author

@henrywang anything we can do to fix/improve

[13:43:01] [E] [CentOS-Stream-9:x86_64:/plans/e2e/to-disk] guest provisioning failed: Guest couldn't be provisioned: Artemis resource ended in 'error' state
As seen on e.g. https://artifacts.dev.testing-farm.io/4fec6905-15b7-49d6-aff5-2bad9d78a12e/

Having some basically permanently-red CI is a mental overhead to check each time which specific jobs are failing.

@henrywang
Copy link
Contributor

Yes, have issue https://issues.redhat.com/browse/TFT-2691 to track.

@cgwalters
Copy link
Collaborator Author

Actually, having it be 1M sometimes and 512M others looks very much like we're getting partitions swapped.

I didn't try to stress test this much, but I think #698 is going to help. At the very least if we are still racing somehow, we'll get a more clear error message.

@cgwalters
Copy link
Collaborator Author

I didn't try to stress test this much, but I think #698 is going to help. At the very least if we are still racing somehow, we'll get a more clear error message.

I think that fixed the install flake, haven't seen it since.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci Issues related to our own CI
Projects
None yet
Development

No branches or pull requests

2 participants