Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kola reprovision tests are failing on ppc64le #1489

Closed
marmijo opened this issue May 9, 2023 · 30 comments
Closed

kola reprovision tests are failing on ppc64le #1489

marmijo opened this issue May 9, 2023 · 30 comments
Labels
jira for syncing to jira

Comments

@marmijo
Copy link
Member

marmijo commented May 9, 2023

The kola reprovision tests are failing on the latest rawhide ppc64le builds.

  • ext.config.root-reprovision.linear
  • ext.config.root-reprovision.autosave-xfs
  • ext.config.root-reprovision.filesystem-only
  • ext.config.root-reprovision.raid1
  • ext.config.root-reprovision.luks.autosave-xfs
  • ext.config.root-reprovision.swap-before-root
  • ext.config.root-reprovision.luks

are all failing with:

harness.go:1680: mach.Start() failed: machine 6f5298de-88f3-433d-8f05-dd6914c33063 entered emergency.target in initramfs

Looking into the console logs of one of these tests shows:

[    4.876842] XFS (zram0): metadata I/O error in "xfs_read_agf+0xb4/0x180 [xfs]" at daddr 0xa00008 len 8 error 74
[    4.877038] XFS (zram0): Error -117 reserving per-AG metadata reserve pool.
[    4.877104] XFS (zram0): Corruption of in-memory data (0x8) detected at xfs_fs_reserve_ag_blocks+0x1e0/0x220 [xfs] (fs/xfs/xfs_fsops.c:575).  Shutting down filesystem.
[    4.877308] XFS (zram0): Please unmount the filesystem and rectify the problem(s)
[    4.877382] XFS (zram0): Ending clean mount
[    4.877432] XFS (zram0): Error -5 reserving per-AG metadata reserve pool.

These reprovision tests began failing with FCOS version 39.20230429.91.0 which saw these package updates:

- device-mapper-persistent-data 0.9.0-10.fc38.x86_64 → 1.0.4-1.fc39.x86_64
- fedora-release-common 39-0.9.noarch → 39-0.11.noarch
- fedora-release-coreos 39-0.9.noarch → 39-0.11.noarch
- fedora-release-identity-coreos 39-0.9.noarch → 39-0.11.noarch
- kernel 6.4.0-0.rc0.20230427git6e98b09da931.5.fc39.x86_64 → 6.4.0-0.rc0.20230428git33afd4b76393.7.fc39.x86_64
- kernel-core 6.4.0-0.rc0.20230427git6e98b09da931.5.fc39.x86_64 → 6.4.0-0.rc0.20230428git33afd4b76393.7.fc39.x86_64
- kernel-modules 6.4.0-0.rc0.20230427git6e98b09da931.5.fc39.x86_64 → 6.4.0-0.rc0.20230428git33afd4b76393.7.fc39.x86_64
- kernel-modules-core 6.4.0-0.rc0.20230427git6e98b09da931.5.fc39.x86_64 → 6.4.0-0.rc0.20230428git33afd4b76393.7.fc39.x86_64
- runc 2:1.1.6-1.fc39.x86_64 → 2:1.1.7-1.fc39.x86_64

I was able to reproduce these failures locally using the debug pod on the ppc64le remote builder.

Here's the kola reprovision output from one of the latest rawhide ppc64le builds:
kola-reprovision-9fa07871.zip

@marmijo
Copy link
Member Author

marmijo commented May 9, 2023

I was able to get these tests to pass successfully by pinning kernel 6.4.0-0.rc0.20230427git6e98b09da931.5.fc39 and the associated packages in a local build on the ppc64le remote builder.

@dustymabe
Copy link
Member

@sandeen may be aware of upstream XFS kernel regressions. Looks like this one is ppc64le specific.

I also wonder if this is a side-effect of #1475 (which still needs some investigation).

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue May 9, 2023
They are failing right now. See
coreos/fedora-coreos-tracker#1489

Note that we can't really deny these tests without hitting a bug
(feature?) in that if all tests in a kola run are denylisted kola
will exit with an error. Therefore we add the reprovision label to
a random test (`ext.config.boot.grub2-install`) so we'll have at
least one test run.
@dustymabe
Copy link
Member

coreos/fedora-coreos-config#2413 includes a commit that hackily works around coreos/coreos-assembler#3464 to enable denylisting these tests on ppc64le

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue May 10, 2023
They are failing right now. See
coreos/fedora-coreos-tracker#1489

Note that we can't really deny these tests without hitting a bug
(feature?) in that if all tests in a kola run are denylisted kola
will exit with an error. Therefore we add the reprovision label to
a random test (`ext.config.boot.grub2-install`) so we'll have at
least one test run.
jlebon pushed a commit to coreos/fedora-coreos-config that referenced this issue May 10, 2023
They are failing right now. See
coreos/fedora-coreos-tracker#1489

Note that we can't really deny these tests without hitting a bug
(feature?) in that if all tests in a kola run are denylisted kola
will exit with an error. Therefore we add the reprovision label to
a random test (`ext.config.boot.grub2-install`) so we'll have at
least one test run.
c4rt0 pushed a commit to c4rt0/fedora-coreos-config that referenced this issue May 17, 2023
They are failing right now. See
coreos/fedora-coreos-tracker#1489

Note that we can't really deny these tests without hitting a bug
(feature?) in that if all tests in a kola run are denylisted kola
will exit with an error. Therefore we add the reprovision label to
a random test (`ext.config.boot.grub2-install`) so we'll have at
least one test run.
@dustymabe
Copy link
Member

This is still happening as of today.

@sandeen
Copy link

sandeen commented May 23, 2023

I don't always see github mentions, sorry.

So this:
[ 4.876842] XFS (zram0): metadata I/O error in "xfs_read_agf+0xb4/0x180 [xfs]" at daddr 0xa00008 len 8 error 74

is kind of interesting - "error 74" is EBADMSG - and on XFS we use that when a bad CRC is detected:

#define EFSBADCRC EBADMSG /* Bad CRC detected */

Normally this would indicate a storage problem.

@dustymabe
Copy link
Member

dustymabe commented May 23, 2023

I don't always see github mentions, sorry.

I would say that you aren't expected to see them all. I was just taking a shot in the dark.

Normally this would indicate a storage problem.

I can try to run these tests on a different piece of ppc64le hardware to see if I get the same results.

@dustymabe
Copy link
Member

Yeah it fails on other hardware too (console.txt).

[    4.231472] ignition-ostree-transposefs[814]: Detected partition replacement in fetched Ignition config: /run/ignition.json^M
[    4.350328] zram: Added device: zram0^M
[    4.350824] zram0: detected capacity change from 0 to 20971520^M
[    4.656735] SGI XFS with ACLs, security attributes, realtime, scrub, quota, no debug enabled^M
[    4.658488] XFS (zram0): Mounting V5 Filesystem daaec3a5-b686-45c6-b4e3-d30bec4999c5^M
[    4.660172] XFS (zram0): Metadata CRC error detected at xfs_agf_read_verify+0x108/0x150 [xfs], xfs_agf block 0xa00008 ^M
[    4.660399] XFS (zram0): Unmount and run xfs_repair^M
[    4.660427] XFS (zram0): First 128 bytes of corrupted metadata buffer:^M
[    4.660463] 00000000: fe ed ba be 00 00 00 00 00 00 00 02 00 00 00 00  ................^M
[    4.660505] 00000010: 00 00 00 00 00 00 00 10 00 00 00 01 00 00 00 00  ................^M
[    4.660545] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................^M
[    4.660585] 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................^M
[    4.660625] 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................^M
[    4.660665] 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................^M
[    4.660705] 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................^M
[    4.660745] 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................^M
[    4.660786] XFS (zram0): metadata I/O error in "xfs_read_agf+0xb4/0x180 [xfs]" at daddr 0xa00008 len 8 error 74^M
[    4.660939] XFS (zram0): Error -117 reserving per-AG metadata reserve pool.^M
[    4.660978] XFS (zram0): Corruption of in-memory data (0x8) detected at xfs_fs_reserve_ag_blocks+0x1e0/0x220 [xfs] (fs/xfs/xfs_fsops.c:575).  Shutting down filesystem.^M
[    4.661125] XFS (zram0): Please unmount the filesystem and rectify the problem(s)^M
[    4.661166] XFS (zram0): Ending clean mount^M
[    4.661200] XFS (zram0): Error -5 reserving per-AG metadata reserve pool.^M
[    4.696510] ignition-ostree-transposefs[883]: mount: /run/ignition-ostree-transposefs: can't read superblock on /dev/zram0.^M
[    4.696680] ignition-ostree-transposefs[883]:        dmesg(1) may have more information after failed mount system call.^M
[    4.726331] systemd[1]: ignition-ostree-transposefs-detect.service: Main process exited, code=exited, status=32/n/a^M
[    4.726555] systemd[1]: ignition-ostree-transposefs-detect.service: Failed with result 'exit-code'.^M
[    4.726699] systemd[1]: Failed to start ignition-ostree-transposefs-detect.service - Ignition OSTree: Detect Partition Transposition.^M

Though now that I look closely that XFS message you pasted above is for a zram (memory backed) device so probably not hardware disk related... and... it makes sense that this would fail only on our root reprovision tests because we copy data into memory and then back out.

So it's some sort of interaction here that's not working. @jlebon any ideas? It feels weird to me that we're only seeing this issue on ppc64le.

@jlebon
Copy link
Member

jlebon commented May 25, 2023

Not sure offhand. One thing worth trying is bumping the minimum memory just to check if it helps (note that on ppc64le and aarch64, kola already doubles the minMemory request due to hitting memory limits more easily there).

Otherwise, I think trying to get a zram SME to look at this might help.

@dustymabe
Copy link
Member

I ran a test with cosa kola run --tag=reprovision --qemu-memory=4096 and it failed in the same way.

@marmijo
Copy link
Member Author

marmijo commented Jun 21, 2023

The snooze on these tests expired and they are still failing as of today.

I went ahead and bumped the snooze in coreos/fedora-coreos-config#2472 and I'll work to get a bug filed upstream for this issue.

@dustymabe
Copy link
Member

I'll work to get a bug filed upstream for this issue.

+1 - let's open it against the kernel and we'll try to see if we can point other people from different subsystems at the bug.

@marmijo
Copy link
Member Author

marmijo commented Jul 7, 2023

I opened a BZ with the kernel team about this issue:
https://bugzilla.redhat.com/show_bug.cgi?id=2221314

marmijo added a commit to marmijo/fedora-coreos-config that referenced this issue Jul 14, 2023
These tests are still failing in rawhide `ppc64le` builds.

See: coreos/fedora-coreos-tracker#1489
@marmijo
Copy link
Member Author

marmijo commented Jul 14, 2023

These tests are still failing as of today. I opened a PR to bump the snooze while we wait for a response on the BZ.
coreos/fedora-coreos-config#2507

marmijo added a commit to coreos/fedora-coreos-config that referenced this issue Jul 14, 2023
These tests are still failing in rawhide `ppc64le` builds.

See: coreos/fedora-coreos-tracker#1489
@marmijo
Copy link
Member Author

marmijo commented Jul 24, 2023

The kernel version in testing-devel was just upgraded to kernel-6.4.4-200.fc38.
These tests are now failing in testing-devel so I opened a PR to snooze these there as well.
coreos/fedora-coreos-config#2524

@marmijo marmijo changed the title rawhide: kola reprovision tests are failing on ppc64le kola reprovision tests are failing on ppc64le Jul 24, 2023
@dustymabe
Copy link
Member

The kernel version in testing-devel was just upgraded to kernel-6.4.4-200.fc38. These tests are now failing in testing-devel so I opened a PR to snooze these there as well. coreos/fedora-coreos-config#2524

So here is where things start to get dangerous. testing-devel is the development stream for our production stream(s). When we allow for a test to be snoozed or denylisted there we need to consider the consequences of letting that failure (regression?) make it to end users.

In this case one of the tests that is failing is the autosave-xfs test, which IIUC will happen on any install with a disk >100G that will end up with a rootfs >100G. So most likely a very high percentage of ppc64le installs.

This is a case where we should consider pinning the kernel instead and aggressively chasing down a solution.

@dustymabe dustymabe added the meeting topics for meetings label Jul 27, 2023
@marmijo
Copy link
Member Author

marmijo commented Jul 27, 2023

Makes sense. Thanks for the info. I mentioned in backlog refinement that this needs to be prioritized, so I'll work on investigating this further.

marmijo added a commit to marmijo/fedora-coreos-config that referenced this issue Jul 28, 2023
We decided to pin the kernel in testing-devel instead of
snoozing the reprovision tests while we track down a
solution to the issue. xref:
coreos/fedora-coreos-tracker#1489 (comment)

This reverts commit 17ec963.
@marmijo
Copy link
Member Author

marmijo commented Jul 28, 2023

testing-devel saw the kernel upgraded from: kernel 6.3.12-200.fc38.x86_64 → 6.4.4-200.fc38.x86_64

I opened coreos/fedora-coreos-config#2529 to pin kernel 6.3.12-200.fc38.x86_64 as well as to remove the snooze for these tests in testing-devel.

marmijo added a commit to marmijo/fedora-coreos-config that referenced this issue Aug 3, 2023
Temporarily add `manifest-lock.overrides.ppc64le.yaml` to pin the kernel
on `ppc64le` only for the next FCOS release cycle while we wait for a
fix on the issue: coreos/fedora-coreos-tracker#1489

See: coreos/fedora-coreos-tracker#1489 (comment)
@marmijo
Copy link
Member Author

marmijo commented Aug 3, 2023

I'll get this all sorted out today:

  • unpin kernel in testing-devel
  • add manifest-lock.overrides.ppc64le.yaml to pin kernel on ppc64le only

TCO in coreos/fedora-coreos-config#2537

marmijo added a commit to coreos/fedora-coreos-config that referenced this issue Aug 3, 2023
unpin the kernel in testing-devel so we can ship our prod streams
with the latest packages in the next release cycle. In the
community meeting, we decided to pin the kernel only on
`ppc64le` for this cycle.

See: coreos/fedora-coreos-tracker#1489 (comment)
marmijo added a commit to coreos/fedora-coreos-config that referenced this issue Aug 3, 2023
Temporarily add `manifest-lock.overrides.ppc64le.yaml` to pin the kernel
on `ppc64le` only for the next FCOS release cycle while we wait for a
fix on the issue: coreos/fedora-coreos-tracker#1489

See: coreos/fedora-coreos-tracker#1489 (comment)
marmijo added a commit to marmijo/fedora-coreos-config that referenced this issue Aug 4, 2023
These tests are still failing in rawhide `ppc64le` builds.
Extend the snooze while we wait on a fix for:
coreos/fedora-coreos-tracker#1489
marmijo added a commit to coreos/fedora-coreos-config that referenced this issue Aug 4, 2023
These tests are still failing in rawhide `ppc64le` builds.
Extend the snooze while we wait on a fix for:
coreos/fedora-coreos-tracker#1489
@dustymabe
Copy link
Member

I posted this on the mailing lists:

A fix was proposed and has been staged for merging soon:

@dustymabe dustymabe removed the meeting topics for meetings label Aug 9, 2023
@dustymabe
Copy link
Member

This issue was fixed in 95848dcb9

It has made it into the latest rawhide kernel (latest as of today is (kernel-6.5.0-0.rc6.43.fc40). IIUC it has not yet made it into an F38 kernel.

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Aug 16, 2023
This one has the fix for the zram corruption on ppc64le (64K page size).

We can also remove the rawhide denial for ext.config.root-reprovision.*
because the kernel in rawhide has the fix already too.

Closes coreos/fedora-coreos-tracker#1489
@dustymabe
Copy link
Member

kernel-6.4.11-200.fc38 has now been built and proposed for f38

HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
They are failing right now. See
coreos/fedora-coreos-tracker#1489

Note that we can't really deny these tests without hitting a bug
(feature?) in that if all tests in a kola run are denylisted kola
will exit with an error. Therefore we add the reprovision label to
a random test (`ext.config.boot.grub2-install`) so we'll have at
least one test run.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
These tests are still failing in rawhide `ppc64le` builds.

See: coreos/fedora-coreos-tracker#1489
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
We decided to pin the kernel in testing-devel instead of
snoozing the reprovision tests while we track down a
solution to the issue. xref:
coreos/fedora-coreos-tracker#1489 (comment)

This reverts commit 17ec963.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
unpin the kernel in testing-devel so we can ship our prod streams
with the latest packages in the next release cycle. In the
community meeting, we decided to pin the kernel only on
`ppc64le` for this cycle.

See: coreos/fedora-coreos-tracker#1489 (comment)
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
Temporarily add `manifest-lock.overrides.ppc64le.yaml` to pin the kernel
on `ppc64le` only for the next FCOS release cycle while we wait for a
fix on the issue: coreos/fedora-coreos-tracker#1489

See: coreos/fedora-coreos-tracker#1489 (comment)
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
These tests are still failing in rawhide `ppc64le` builds.
Extend the snooze while we wait on a fix for:
coreos/fedora-coreos-tracker#1489
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
This one has the fix for the zram corruption on ppc64le (64K page size).

We can also remove the rawhide denial for ext.config.root-reprovision.*
because the kernel in rawhide has the fix already too.

Closes coreos/fedora-coreos-tracker#1489
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
They are failing right now. See
coreos/fedora-coreos-tracker#1489

Note that we can't really deny these tests without hitting a bug
(feature?) in that if all tests in a kola run are denylisted kola
will exit with an error. Therefore we add the reprovision label to
a random test (`ext.config.boot.grub2-install`) so we'll have at
least one test run.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
These tests are still failing in rawhide `ppc64le` builds.

See: coreos/fedora-coreos-tracker#1489
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
We decided to pin the kernel in testing-devel instead of
snoozing the reprovision tests while we track down a
solution to the issue. xref:
coreos/fedora-coreos-tracker#1489 (comment)

This reverts commit 17ec963.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
unpin the kernel in testing-devel so we can ship our prod streams
with the latest packages in the next release cycle. In the
community meeting, we decided to pin the kernel only on
`ppc64le` for this cycle.

See: coreos/fedora-coreos-tracker#1489 (comment)
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
Temporarily add `manifest-lock.overrides.ppc64le.yaml` to pin the kernel
on `ppc64le` only for the next FCOS release cycle while we wait for a
fix on the issue: coreos/fedora-coreos-tracker#1489

See: coreos/fedora-coreos-tracker#1489 (comment)
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
These tests are still failing in rawhide `ppc64le` builds.
Extend the snooze while we wait on a fix for:
coreos/fedora-coreos-tracker#1489
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
This one has the fix for the zram corruption on ppc64le (64K page size).

We can also remove the rawhide denial for ext.config.root-reprovision.*
because the kernel in rawhide has the fix already too.

Closes coreos/fedora-coreos-tracker#1489
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira for syncing to jira
Projects
None yet
Development

No branches or pull requests

4 participants