-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kola reprovision tests are failing on ppc64le #1489
Comments
I was able to get these tests to pass successfully by pinning |
They are failing right now. See coreos/fedora-coreos-tracker#1489 Note that we can't really deny these tests without hitting a bug (feature?) in that if all tests in a kola run are denylisted kola will exit with an error. Therefore we add the reprovision label to a random test (`ext.config.boot.grub2-install`) so we'll have at least one test run.
coreos/fedora-coreos-config#2413 includes a commit that hackily works around coreos/coreos-assembler#3464 to enable denylisting these tests on ppc64le |
They are failing right now. See coreos/fedora-coreos-tracker#1489 Note that we can't really deny these tests without hitting a bug (feature?) in that if all tests in a kola run are denylisted kola will exit with an error. Therefore we add the reprovision label to a random test (`ext.config.boot.grub2-install`) so we'll have at least one test run.
They are failing right now. See coreos/fedora-coreos-tracker#1489 Note that we can't really deny these tests without hitting a bug (feature?) in that if all tests in a kola run are denylisted kola will exit with an error. Therefore we add the reprovision label to a random test (`ext.config.boot.grub2-install`) so we'll have at least one test run.
They are failing right now. See coreos/fedora-coreos-tracker#1489 Note that we can't really deny these tests without hitting a bug (feature?) in that if all tests in a kola run are denylisted kola will exit with an error. Therefore we add the reprovision label to a random test (`ext.config.boot.grub2-install`) so we'll have at least one test run.
This is still happening as of today. |
I don't always see github mentions, sorry. So this: is kind of interesting - "error 74" is EBADMSG - and on XFS we use that when a bad CRC is detected:
Normally this would indicate a storage problem. |
I would say that you aren't expected to see them all. I was just taking a shot in the dark.
I can try to run these tests on a different piece of ppc64le hardware to see if I get the same results. |
Yeah it fails on other hardware too (console.txt).
Though now that I look closely that XFS message you pasted above is for a zram (memory backed) device so probably not hardware disk related... and... it makes sense that this would fail only on our root reprovision tests because we copy data into memory and then back out. So it's some sort of interaction here that's not working. @jlebon any ideas? It feels weird to me that we're only seeing this issue on ppc64le. |
Not sure offhand. One thing worth trying is bumping the minimum memory just to check if it helps (note that on ppc64le and aarch64, kola already doubles the Otherwise, I think trying to get a zram SME to look at this might help. |
I ran a test with |
The snooze on these tests expired and they are still failing as of today. I went ahead and bumped the snooze in coreos/fedora-coreos-config#2472 and I'll work to get a bug filed upstream for this issue. |
+1 - let's open it against the kernel and we'll try to see if we can point other people from different subsystems at the bug. |
I opened a BZ with the kernel team about this issue: |
These tests are still failing in rawhide `ppc64le` builds. See: coreos/fedora-coreos-tracker#1489
These tests are still failing as of today. I opened a PR to bump the snooze while we wait for a response on the BZ. |
These tests are still failing in rawhide `ppc64le` builds. See: coreos/fedora-coreos-tracker#1489
The kernel version in testing-devel was just upgraded to |
So here is where things start to get dangerous. In this case one of the tests that is failing is the This is a case where we should consider pinning the kernel instead and aggressively chasing down a solution. |
Makes sense. Thanks for the info. I mentioned in backlog refinement that this needs to be prioritized, so I'll work on investigating this further. |
We decided to pin the kernel in testing-devel instead of snoozing the reprovision tests while we track down a solution to the issue. xref: coreos/fedora-coreos-tracker#1489 (comment) This reverts commit 17ec963.
testing-devel saw the kernel upgraded from: I opened coreos/fedora-coreos-config#2529 to pin |
Temporarily add `manifest-lock.overrides.ppc64le.yaml` to pin the kernel on `ppc64le` only for the next FCOS release cycle while we wait for a fix on the issue: coreos/fedora-coreos-tracker#1489 See: coreos/fedora-coreos-tracker#1489 (comment)
|
unpin the kernel in testing-devel so we can ship our prod streams with the latest packages in the next release cycle. In the community meeting, we decided to pin the kernel only on `ppc64le` for this cycle. See: coreos/fedora-coreos-tracker#1489 (comment)
Temporarily add `manifest-lock.overrides.ppc64le.yaml` to pin the kernel on `ppc64le` only for the next FCOS release cycle while we wait for a fix on the issue: coreos/fedora-coreos-tracker#1489 See: coreos/fedora-coreos-tracker#1489 (comment)
These tests are still failing in rawhide `ppc64le` builds. Extend the snooze while we wait on a fix for: coreos/fedora-coreos-tracker#1489
These tests are still failing in rawhide `ppc64le` builds. Extend the snooze while we wait on a fix for: coreos/fedora-coreos-tracker#1489
I posted this on the mailing lists: A fix was proposed and has been staged for merging soon: |
This issue was fixed in 95848dcb9 It has made it into the latest rawhide kernel (latest as of today is ( |
This one has the fix for the zram corruption on ppc64le (64K page size). We can also remove the rawhide denial for ext.config.root-reprovision.* because the kernel in rawhide has the fix already too. Closes coreos/fedora-coreos-tracker#1489
|
They are failing right now. See coreos/fedora-coreos-tracker#1489 Note that we can't really deny these tests without hitting a bug (feature?) in that if all tests in a kola run are denylisted kola will exit with an error. Therefore we add the reprovision label to a random test (`ext.config.boot.grub2-install`) so we'll have at least one test run.
These tests are still failing in rawhide `ppc64le` builds. See: coreos/fedora-coreos-tracker#1489
We decided to pin the kernel in testing-devel instead of snoozing the reprovision tests while we track down a solution to the issue. xref: coreos/fedora-coreos-tracker#1489 (comment) This reverts commit 17ec963.
unpin the kernel in testing-devel so we can ship our prod streams with the latest packages in the next release cycle. In the community meeting, we decided to pin the kernel only on `ppc64le` for this cycle. See: coreos/fedora-coreos-tracker#1489 (comment)
Temporarily add `manifest-lock.overrides.ppc64le.yaml` to pin the kernel on `ppc64le` only for the next FCOS release cycle while we wait for a fix on the issue: coreos/fedora-coreos-tracker#1489 See: coreos/fedora-coreos-tracker#1489 (comment)
These tests are still failing in rawhide `ppc64le` builds. Extend the snooze while we wait on a fix for: coreos/fedora-coreos-tracker#1489
This one has the fix for the zram corruption on ppc64le (64K page size). We can also remove the rawhide denial for ext.config.root-reprovision.* because the kernel in rawhide has the fix already too. Closes coreos/fedora-coreos-tracker#1489
They are failing right now. See coreos/fedora-coreos-tracker#1489 Note that we can't really deny these tests without hitting a bug (feature?) in that if all tests in a kola run are denylisted kola will exit with an error. Therefore we add the reprovision label to a random test (`ext.config.boot.grub2-install`) so we'll have at least one test run.
These tests are still failing in rawhide `ppc64le` builds. See: coreos/fedora-coreos-tracker#1489
We decided to pin the kernel in testing-devel instead of snoozing the reprovision tests while we track down a solution to the issue. xref: coreos/fedora-coreos-tracker#1489 (comment) This reverts commit 17ec963.
unpin the kernel in testing-devel so we can ship our prod streams with the latest packages in the next release cycle. In the community meeting, we decided to pin the kernel only on `ppc64le` for this cycle. See: coreos/fedora-coreos-tracker#1489 (comment)
Temporarily add `manifest-lock.overrides.ppc64le.yaml` to pin the kernel on `ppc64le` only for the next FCOS release cycle while we wait for a fix on the issue: coreos/fedora-coreos-tracker#1489 See: coreos/fedora-coreos-tracker#1489 (comment)
These tests are still failing in rawhide `ppc64le` builds. Extend the snooze while we wait on a fix for: coreos/fedora-coreos-tracker#1489
This one has the fix for the zram corruption on ppc64le (64K page size). We can also remove the rawhide denial for ext.config.root-reprovision.* because the kernel in rawhide has the fix already too. Closes coreos/fedora-coreos-tracker#1489
The kola reprovision tests are failing on the latest rawhide
ppc64le
builds.are all failing with:
Looking into the console logs of one of these tests shows:
These reprovision tests began failing with FCOS version
39.20230429.91.0
which saw these package updates:I was able to reproduce these failures locally using the debug pod on the ppc64le remote builder.
Here's the kola reprovision output from one of the latest rawhide ppc64le builds:
kola-reprovision-9fa07871.zip
The text was updated successfully, but these errors were encountered: