Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random hangs of booted qemu machines #1511

Closed
dustymabe opened this issue Jun 15, 2023 · 8 comments
Closed

Random hangs of booted qemu machines #1511

dustymabe opened this issue Jun 15, 2023 · 8 comments

Comments

@dustymabe
Copy link
Member

We've noticed a fair amount random timeouts in CI where the machines just don't ever fully come up. For some parts our --allow-rerun-success criteria (improved in coreos/coreos-assembler@24df92f) have allowed for the tests to pass since the failure wasn't consistent, but that only applies to some tests. Our testISO tests don't benefit from that enhancement and have continued to randomly fail. Not often, but often enough to be annoying. The machines just stop during boot at:

[    0.133215] x86/cpu: User Mode Instruction Prevention (UMIP) activated
[    0.134029] Last level iTLB entries: 4KB 512, 2MB 255, 4MB 127
[    0.134963] Last level dTLB entries: 4KB 512, 2MB 255, 4MB 127, 1GB 0
[    0.135545] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[    0.135965] Spectre V2 : Mitigation: Retpolines
[    0.136374] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[    0.136963] Spectre V2 : Spectre v2 / SpectreRSB : Filling RSB on VMEXIT
[    0.137963] Spectre V2 : Enabling Speculation Barrier for firmware calls
[    0.138563] RETBleed: Mitigation: untrained return thunk
[    0.138965] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
[    0.139965] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl
[    0.155145] Freeing SMP alternatives memory: 48K

Finally we thing we may understand why. As reported by Richard Jones in https://gitlab.com/qemu-project/qemu/-/issues/1696 and on LKML it appears that random hangs in upstream kernels have been observed by other teams and investigated. Richard points to f31dcb1 as the offending commit, but also points to a posted patch that is the believed fix, and that patch mentions that it fixes e9523a0, which is in v6.3.2, v6.2.15, v6.1.28, etc...

Once the proposed patch lands we can then get this back into Fedora/FCOS and hopefully our CI will be happier and any users that may have hit this rare boot issue will be happier too.

@dustymabe
Copy link
Member Author

dustymabe commented Jun 15, 2023

One such example of this failure is https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/build/1429/ (requires sign in) in which the miniso-install.bios test failed. Here is the console.txt.

@rwmjones
Copy link

I would say this is definitely the same bug and the posted patch which I tested will fix it.

@travier
Copy link
Member

travier commented Jun 16, 2023

@rwmjones Thanks a lot for the investigation on this one!

@dustymabe
Copy link
Member Author

Reportedly fixed by 13bb06f.

@dustymabe
Copy link
Member Author

This is fixed in v6.3.10 in b84d064.

There is a kernel update available: https://bodhi.fedoraproject.org/updates/FEDORA-2023-5fdf0dd9fe

I've fast-tracked this into Fedora CoreOS: coreos/fedora-coreos-config#2489 and also tagged it into the f38-coreos-continuous tag and am doing FORCE build of COSA to pick up the new kernel (i.e. so guestfs manipulations of disks in our build pipeline won't randomly hang).

@dustymabe dustymabe removed the status/pending-upstream-release Fixed upstream. Waiting on an upstream component source code release. label Jun 29, 2023
@dustymabe dustymabe changed the title Random hangs of booted machines in CI Random hangs of booted machines Jun 29, 2023
@dustymabe dustymabe changed the title Random hangs of booted machines Random hangs of booted qemu machines Jun 29, 2023
@dustymabe dustymabe added status/pending-testing-release Fixed upstream. Waiting on a testing release. status/pending-next-release Fixed upstream. Waiting on a next release. labels Jun 29, 2023
@dustymabe
Copy link
Member Author

The fix for this went into next stream release 38.20230709.1.1. Please try out the new release and report issues.

@dustymabe
Copy link
Member Author

The fix for this went into testing stream release 38.20230709.2.0. Please try out the new release and report issues.

@dustymabe dustymabe added status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. and removed status/pending-testing-release Fixed upstream. Waiting on a testing release. status/pending-next-release Fixed upstream. Waiting on a next release. labels Jul 13, 2023
@dustymabe
Copy link
Member Author

The fix for this went into stable stream release 38.20230709.3.0.

@dustymabe dustymabe removed the status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. label Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants