Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pin kernel to 5.15.x for now #152

Merged
merged 1 commit into from
Apr 4, 2023
Merged

Pin kernel to 5.15.x for now #152

merged 1 commit into from
Apr 4, 2023

Conversation

cole-h
Copy link
Member

@cole-h cole-h commented Apr 4, 2023

The 6.1.x series of kernels has busted networking on the community box, so we pin to 5.15.x which works fine. I'll look around to see what the problem may be, and see if I can either bisect the issue, or find something in the kernel archives (somewhere) about what the problem maybe, hopefully including a patch.

[  291.298247] ------------[ cut here ]------------
[  291.302859] NETDEV WATCHDOG: eth3 (mlx5_core): transmit queue 30 timed out
[  291.309738] WARNING: CPU: 77 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x278/0x280
[  291.318081] Modules linked in: cfg80211 rfkill mlx5_ib ib_uverbs ib_core acpi_ipmi crct10dif_ce mlx5_core polyval_ce ipmi_ssif polyval_generic arm_spe_pmu ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper mlxfw ipmi_devintf psample pci_hyperv_intf ipmi_msghandler arm_cmn arm_dmc620_pmu xgene_hwmon cppc_cpufreq arm_dsu_pmu acpi_tad ip6_tables xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_compat sch_fq_codel nf_tables libcrc32c nfnetlink bonding tls tap macvlan bridge stp llc fuse drm dmi_sysfs ip_tables x_tables nvme nvme_core xhci_pci xhci_pci_renesas dm_mod dax zfs(PO) zunicode(PO) zzstd(O) zlua(O) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) overlay
[  291.383581] CPU: 77 PID: 0 Comm: swapper/77 Tainted: P           O       6.1.21 #1-NixOS
[  291.391658] Hardware name: GIGABYTE R272-P30-JG/MP32-AR0-JG, BIOS F17a (SCP: 1.07.20210713) 07/22/2021
[  291.400950] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  291.407899] pc : dev_watchdog+0x278/0x280
[  291.411895] lr : dev_watchdog+0x278/0x280
[  291.415891] sp : ffff80000826bdd0
[  291.419192] x29: ffff80000826bdd0 x28: ffffcbd179f87000 x27: ffff80000826bee0
[  291.426315] x26: ffffcbd17962f008 x25: 0000000000000000 x24: ffffcbd179f8ea58
[  291.433437] x23: 0000000000000100 x22: ffffcbd179f87000 x21: 000000000000001e
[  291.440560] x20: ffff07ff9b6c0000 x19: ffff07ff9b6c0488 x18: 0000000000000006
[  291.447682] x17: ffff3c6ce6977000 x16: ffff80000826c000 x15: ffff80000826b910
[  291.454804] x14: 0000000000000000 x13: 74756f2064656d69 x12: 7420303320657565
[  291.461926] x11: 00000000ffffbfff x10: ffff083f5fec3bc0 x9 : ffffcbd1770095cc
[  291.469048] x8 : 000000000005ffe8 x7 : c0000000ffffbfff x6 : 0000000000000000
[  291.476170] x5 : ffff083e5ffa8b50 x4 : ffff083e5ffa8b50 x3 : ffff083e5ffb4cb0
[  291.483292] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff07ff81582dc0
[  291.490414] Call trace:
[  291.492848]  dev_watchdog+0x278/0x280
[  291.496497]  call_timer_fn+0x3c/0x15c
[  291.500149]  __run_timers+0x2e8/0x3a0
[  291.503799]  run_timer_softirq+0x28/0x50
[  291.507709]  __do_softirq+0x128/0x368
[  291.511359]  ____do_softirq+0x18/0x24
[  291.515009]  call_on_irq_stack+0x2c/0x60
[  291.518919]  do_softirq_own_stack+0x24/0x3c
[  291.523089]  __irq_exit_rcu+0x148/0x150
[  291.526913]  irq_exit_rcu+0x18/0x24
[  291.530388]  el1_interrupt+0x38/0x54
[  291.533953]  el1h_64_irq_handler+0x18/0x2c
[  291.538036]  el1h_64_irq+0x64/0x68
[  291.541425]  cpuidle_enter_state+0xbc/0x440
[  291.545598]  cpuidle_enter+0x40/0x60
[  291.549162]  do_idle+0x234/0x2c0
[  291.552378]  cpu_startup_entry+0x30/0x3c
[  291.556288]  secondary_start_kernel+0x130/0x154
[  291.560807]  __secondary_switched+0xb0/0xb4
[  291.564978] ---[ end trace 0000000000000000 ]---

The 6.1.x series of kernels has busted networking on the community box,
so we pin to 5.15.x which works fine. I'll look around to see what
the problem may be, and see if I can either bisect the issue, or find
something in the kernel archives (somewhere) about what the problem
maybe, hopefully including a patch.

    [  291.298247] ------------[ cut here ]------------
    [  291.302859] NETDEV WATCHDOG: eth3 (mlx5_core): transmit queue 30 timed out
    [  291.309738] WARNING: CPU: 77 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x278/0x280
    [  291.318081] Modules linked in: cfg80211 rfkill mlx5_ib ib_uverbs ib_core acpi_ipmi crct10dif_ce mlx5_core polyval_ce ipmi_ssif polyval_generic arm_spe_pmu ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper mlxfw ipmi_devintf psample pci_hyperv_intf ipmi_msghandler arm_cmn arm_dmc620_pmu xgene_hwmon cppc_cpufreq arm_dsu_pmu acpi_tad ip6_tables xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_compat sch_fq_codel nf_tables libcrc32c nfnetlink bonding tls tap macvlan bridge stp llc fuse drm dmi_sysfs ip_tables x_tables nvme nvme_core xhci_pci xhci_pci_renesas dm_mod dax zfs(PO) zunicode(PO) zzstd(O) zlua(O) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) overlay
    [  291.383581] CPU: 77 PID: 0 Comm: swapper/77 Tainted: P           O       6.1.21 #1-NixOS
    [  291.391658] Hardware name: GIGABYTE R272-P30-JG/MP32-AR0-JG, BIOS F17a (SCP: 1.07.20210713) 07/22/2021
    [  291.400950] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [  291.407899] pc : dev_watchdog+0x278/0x280
    [  291.411895] lr : dev_watchdog+0x278/0x280
    [  291.415891] sp : ffff80000826bdd0
    [  291.419192] x29: ffff80000826bdd0 x28: ffffcbd179f87000 x27: ffff80000826bee0
    [  291.426315] x26: ffffcbd17962f008 x25: 0000000000000000 x24: ffffcbd179f8ea58
    [  291.433437] x23: 0000000000000100 x22: ffffcbd179f87000 x21: 000000000000001e
    [  291.440560] x20: ffff07ff9b6c0000 x19: ffff07ff9b6c0488 x18: 0000000000000006
    [  291.447682] x17: ffff3c6ce6977000 x16: ffff80000826c000 x15: ffff80000826b910
    [  291.454804] x14: 0000000000000000 x13: 74756f2064656d69 x12: 7420303320657565
    [  291.461926] x11: 00000000ffffbfff x10: ffff083f5fec3bc0 x9 : ffffcbd1770095cc
    [  291.469048] x8 : 000000000005ffe8 x7 : c0000000ffffbfff x6 : 0000000000000000
    [  291.476170] x5 : ffff083e5ffa8b50 x4 : ffff083e5ffa8b50 x3 : ffff083e5ffb4cb0
    [  291.483292] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff07ff81582dc0
    [  291.490414] Call trace:
    [  291.492848]  dev_watchdog+0x278/0x280
    [  291.496497]  call_timer_fn+0x3c/0x15c
    [  291.500149]  __run_timers+0x2e8/0x3a0
    [  291.503799]  run_timer_softirq+0x28/0x50
    [  291.507709]  __do_softirq+0x128/0x368
    [  291.511359]  ____do_softirq+0x18/0x24
    [  291.515009]  call_on_irq_stack+0x2c/0x60
    [  291.518919]  do_softirq_own_stack+0x24/0x3c
    [  291.523089]  __irq_exit_rcu+0x148/0x150
    [  291.526913]  irq_exit_rcu+0x18/0x24
    [  291.530388]  el1_interrupt+0x38/0x54
    [  291.533953]  el1h_64_irq_handler+0x18/0x2c
    [  291.538036]  el1h_64_irq+0x64/0x68
    [  291.541425]  cpuidle_enter_state+0xbc/0x440
    [  291.545598]  cpuidle_enter+0x40/0x60
    [  291.549162]  do_idle+0x234/0x2c0
    [  291.552378]  cpu_startup_entry+0x30/0x3c
    [  291.556288]  secondary_start_kernel+0x130/0x154
    [  291.560807]  __secondary_switched+0xb0/0xb4
    [  291.564978] ---[ end trace 0000000000000000 ]---
@cole-h
Copy link
Member Author

cole-h commented Apr 4, 2023

Drafted because I haven't yet tested this (I tested by rolling back to a nixos-unstable commit prior to the default switching to 6.1). Will undraft and merge once the machine comes back up in a good state (which I expect it to, but kernel issues are always fun).

@cole-h cole-h marked this pull request as ready for review April 4, 2023 19:28
@cole-h
Copy link
Member Author

cole-h commented Apr 4, 2023

It worked.

@mweinelt
Copy link
Member

mweinelt commented Apr 4, 2023

Bisecting this won't be fun, but probably worthwhile doing before the 23.05 release.

@cole-h cole-h merged commit 9286c08 into master Apr 4, 2023
@cole-h cole-h deleted the try-to-fix-the-box branch April 4, 2023 19:33
@cole-h
Copy link
Member Author

cole-h commented Apr 4, 2023

Yeah, I'll probably spend my day doing that tomorrow...

Note to self:

git bisect start
# status: waiting for both good and bad commits
# bad: [17d99ea98b6238e7e483fba27e8f7a7842d0f345] Linux 6.1.10
git bisect bad 17d99ea98b6238e7e483fba27e8f7a7842d0f345
# status: waiting for good commit(s), bad commit known
# good: [d383d0f28ecac0f3375bdfb9a0c4bfac979f6f8f] Linux 5.15.96
git bisect good d383d0f28ecac0f3375bdfb9a0c4bfac979f6f8f

@cole-h
Copy link
Member Author

cole-h commented Apr 6, 2023

Progress update:

I'm fairly certain I've found the problematic commit after bisecting. Running with that commit reliably stalled, and reverting it did not (in any meaningful amount of time).

I'll probably be drafting an email to the kernel mailing list about this tomorrow (which list specifically, though? I don't know yet :).

[   60.692387] rcu: INFO: rcu_sched self-detected stall on CPU
[   60.697952] rcu:     32-....: (5247 ticks this GP) idle=8ad4/1/0x4000000000000000 softirq=730/730 fqs=1050
git bisect log
git bisect start
# bad: [17d99ea98b6238e7e483fba27e8f7a7842d0f345] Linux 6.1.10
git bisect bad 17d99ea98b6238e7e483fba27e8f7a7842d0f345
# good: [d383d0f28ecac0f3375bdfb9a0c4bfac979f6f8f] Linux 5.15.96
git bisect good d383d0f28ecac0f3375bdfb9a0c4bfac979f6f8f
# skip: [17d99ea98b6238e7e483fba27e8f7a7842d0f345] Linux 6.1.10
git bisect skip 17d99ea98b6238e7e483fba27e8f7a7842d0f345
# good: [8bb7eca972ad531c9b149c0a51ab43a417385813] Linux 5.15
git bisect good 8bb7eca972ad531c9b149c0a51ab43a417385813
# good: [827060261cf3c7b79ee7185d5aa61c851beb9403] Merge tag 'media/v5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect good 827060261cf3c7b79ee7185d5aa61c851beb9403
# good: [746fc76b820dce8cbb17a1e5e70a1558db4d7406] Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
git bisect good 746fc76b820dce8cbb17a1e5e70a1558db4d7406
# good: [ff6862c23d2e83d12d1759bf4337d41248fb4dc8] Merge tag 'arm-drivers-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect good ff6862c23d2e83d12d1759bf4337d41248fb4dc8
# bad: [f311d498be8f1aa49d5cfca0b18d6db4f77845b7] Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
git bisect bad f311d498be8f1aa49d5cfca0b18d6db4f77845b7
# good: [a09476668e3016ea4a7b0a7ebd02f44e0546c12c] Merge tag 'char-misc-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
git bisect good a09476668e3016ea4a7b0a7ebd02f44e0546c12c
# bad: [3604a7f568d3f67be8c13736201411ee83b210a1] Merge tag 'v6.1-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
git bisect bad 3604a7f568d3f67be8c13736201411ee83b210a1
# bad: [2e64066dab157ffcd0e9ec2ff631862e6e222876] Merge tag 'riscv-for-linus-6.1-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
git bisect bad 2e64066dab157ffcd0e9ec2ff631862e6e222876
# good: [f01603979a4afaad7504a728918b678d572cda9e] Merge tag 'gpio-updates-for-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
git bisect good f01603979a4afaad7504a728918b678d572cda9e
# good: [a64b79c01c2836ddd8e1eb7c8173b44c3e66f999] Merge branches 'clk-samsung', 'clk-mtk', 'clk-rm', 'clk-ast' and 'clk-qcom' into clk-next
git bisect good a64b79c01c2836ddd8e1eb7c8173b44c3e66f999
# bad: [0e470763d84dcad27284067647dfb4b1a94dfce0] Merge tag 'efi-next-for-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
git bisect bad 0e470763d84dcad27284067647dfb4b1a94dfce0
# good: [b7f257ceb3c88ee3e2c6b0d1db703c818d3971f1] Merge branches 'clk-fixed-rate', 'clk-spreadtrum', 'clk-pxa' and 'clk-ti' into clk-next
git bisect good b7f257ceb3c88ee3e2c6b0d1db703c818d3971f1
# good: [a6afa4199d3d038fbfdff5511f7523b0e30cb774] Merge tag 'mailbox-v6.1' of git://git.linaro.org/landing-teams/working/fujitsu/integration
git bisect good a6afa4199d3d038fbfdff5511f7523b0e30cb774
# good: [a241d94bb532dcfb7ef3f723e6a0a0e7cf8f10ea] efi: libstub: fix type confusion for load_options_size
git bisect good a241d94bb532dcfb7ef3f723e6a0a0e7cf8f10ea
# good: [40cd01a9c324bd238e107d9d5ecb6824146a7836] efi/loongarch: libstub: remove dependency on flattened DT
git bisect good 40cd01a9c324bd238e107d9d5ecb6824146a7836
# good: [69e377b289376147c84cfd09bab1ad0328a0ecc6] efi/arm: libstub: move ARM specific code out of generic routines
git bisect good 69e377b289376147c84cfd09bab1ad0328a0ecc6
# good: [3c6edd9034240ce9582be3392112321336bd25bb] efi: zboot: create MemoryMapped() device path for the parent if needed
git bisect good 3c6edd9034240ce9582be3392112321336bd25bb
# bad: [d3549a938b73f203ef522562ae9f2d38aa43d234] efi/arm64: libstub: avoid SetVirtualAddressMap() when possible
git bisect bad d3549a938b73f203ef522562ae9f2d38aa43d234
# first bad commit: [d3549a938b73f203ef522562ae9f2d38aa43d234] efi/arm64: libstub: avoid SetVirtualAddressMap() when possible

@mweinelt
Copy link
Member

mweinelt commented Apr 6, 2023

❯ ./scripts/get_maintainer.pl d3549a938b73f203ef522562ae9f2d38aa43d234.patch
Ard Biesheuvel <ardb@kernel.org> (maintainer:EXTENSIBLE FIRMWARE INTERFACE (EFI),commit_signer:7/8=88%,authored:6/8=75%,added_lines:40/48=83%,removed_lines:19/24=79%,commit_signer:8/8=100%,added_lines:84/171=49%,removed_lines:84/127=66%)
Mark Brown <broonie@kernel.org> (commit_signer:1/8=12%,authored:1/8=12%,removed_lines:2/24=8%)
Catalin Marinas <catalin.marinas@arm.com> (commit_signer:1/8=12%)
Kristina Martsenko <kristina.martsenko@arm.com> (commit_signer:1/8=12%)
Greg Kroah-Hartman <gregkh@linuxfoundation.org> (commit_signer:1/8=12%)
Darren Hart <darren@os.amperecomputing.com> (authored:1/8=12%,added_lines:6/48=12%,removed_lines:3/24=12%)
Ilias Apalodimas <ilias.apalodimas@linaro.org> (commit_signer:2/8=25%,authored:2/8=25%,added_lines:87/171=51%,removed_lines:43/127=34%)
linux-efi@vger.kernel.org (open list:EXTENSIBLE FIRMWARE INTERFACE (EFI))
linux-kernel@vger.kernel.org (open list)

@cole-h
Copy link
Member Author

cole-h commented Apr 7, 2023

Did a little bit of searching on the kernel archives and noticed these two patches that are related:

I don't know how likely it is this gets backported because it doesn't apply cleanly to 6.1.23 (drivers/firmware/efi/libstub/arm64.c doesn't exist on that tag, which is where most of the magic is implemented)...

@misuzu
Copy link
Contributor

misuzu commented Nov 7, 2023

Is this still an issue?

@cole-h
Copy link
Member Author

cole-h commented Nov 7, 2023

I haven't tested any more recent kernels, so I don't know. It looks like Nixpkgs' linuxPackages is 6.1.61 right now, which still doesn't have drivers/firmware/efi/libstub/arm64.c, so at least that stall would likely still apply.

@cole-h cole-h mentioned this pull request Nov 10, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants