Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unset CONFIG_THERMAL_STATISTICS to prevent kernel crash #199

Merged
merged 1 commit into from
Mar 10, 2021

Conversation

allas-nvidia
Copy link
Contributor

@allas-nvidia allas-nvidia commented Mar 9, 2021

Fix sonic-net/sonic-buildimage#6866

Unset CONFIG_THERMAL_STATISTICS.
Reason:
Kernel thermal zones binding to the cooling device together with CONFIG_THERMAL_STATISTICS=y causes kernel crash as out of boundary:
trans_table is two-dimensional table allocated per max cooling state (10).
If statistics is configured, thermal_cooling_device_stats_update() will be called and will try to update out of boundary:
stats->trans_table[stats->state * stats->max_states + new_state]++

Kernel crash with the following stack trace:

[  269.474092] watchdog: watchdog1: watchdog did not stop!
[  269.533625] list_del corruption. prev->next should be ffff9e136bd57418, but was 677ac660ffffffff

[  269.543482] kernel BUG at lib/list_debug.c:53!
[  269.548458] invalid opcode: 0000 [#1] SMP PTI
[  269.553326] CPU: 1 PID: 8890 Comm: kexec Tainted: G           OE     4.19.0-9-2-amd64 #1 Debian 4.19.118-2+deb10u1
[  269.564891] Hardware name: Mellanox Technologies Ltd. MSN4700/VMOD0010, BIOS 5.11 11/03/2020
[  269.574323] RIP: 0010:__list_del_entry_valid.cold.1+0x34/0x4c
[  269.580740] Code: 9f 29 a5 e8 68 7a d0 ff 0f 0b 48 c7 c7 20 a0 29 a5 e8 5a 7a d0 ff 0f 0b 48 89 f2 48 89 fe 48 c7 c7 e0 9f 29 a5 e8 46 7a d0 ff <0f> 0b 48 89 fe 48 c7 c7 a8 9f 29 a5 e8 35 7a d0 ff 0f 0b 90 90 90
[  269.601726] RSP: 0018:ffffaddb83b5fdc0 EFLAGS: 00010246
[  269.607561] RAX: 0000000000000054 RBX: ffff9e136bd57418 RCX: 0000000000000000
[  269.615531] RDX: 0000000000000000 RSI: ffff9e136fa566b8 RDI: ffff9e136fa566b8
[  269.623500] RBP: ffff9e1364bd5070 R08: 00000000000005ce R09: 0000000000000004
[  269.631470] R10: 0000000000000766 R11: ffffffffa59f66ad R12: ffff9e136bd57400
[  269.639440] R13: ffffffffa52c6a12 R14: ffff9e1364bd30d0 R15: 0000000000000000
[  269.647410] FS:  00007f97227af740(0000) GS:ffff9e136fa40000(0000) knlGS:0000000000000000
[  269.656441] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  269.662857] CR2: 000055cfdb69e158 CR3: 00000004677f6001 CR4: 00000000003606e0
[  269.670820] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  269.678790] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  269.686760] Call Trace:
[  269.689489]  device_shutdown+0xc1/0x210
[  269.693773]  kernel_kexec+0x51/0x96
[  269.697666]  __do_sys_reboot+0x1be/0x210
[  269.702045]  ? kmem_cache_free+0x1aa/0x1d0
[  269.706618]  ? __dentry_kill+0x121/0x170
[  269.710998]  ? _cond_resched+0x15/0x30
[  269.715181]  ? dentry_kill+0x4d/0x190
[  269.719260]  ? _cond_resched+0x15/0x30
[  269.723444]  do_syscall_64+0x53/0x110
[  269.727531]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  269.733172] RIP: 0033:0x7f97228a3373
[  269.737161] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 e9 9a 0c 00 f7 d8
[  269.758147] RSP: 002b:00007ffe11d30fa8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a9
[  269.766602] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f97228a3373
[  269.774572] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
[  269.782541] RBP: 0000000000000002 R08: 0000000000000004 R09: 000055cfdb69e160
[  269.790511] R10: fffffffffffffb8e R11: 0000000000000202 R12: 00007ffe11d31238
[  269.798482] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000ffffffff
[  269.806443] Modules linked in: nft_chain_route_ipv4(E) xt_TCPMSS(E) sx_bfd(OE) sx_netdev(OE) psample(E) dummy(E) sx_core(OE) 8021q(E) garp(E) mrp(E) mst_pciconf(OE) mst_pci(OE) xt_hl(E) xt_tcpudp(E) ip6_tables(E) nft_compat(E) nft_counter(E) xt_conntrack(E) nf_nat(E) nf_conntrack_netlink(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) xfrm_user(E) xfrm_algo(E) intel_rapl(E) mlxsw_minimal(E) sb_edac(E) mlxsw_i2c(E) x86_pkg_temp_thermal(E) mlxsw_core(E) intel_powerclamp(E) devlink(E) kvm_intel(E) bonding(E) kvm(E) i2c_mux_reg(E) i2c_mux(E) mlxreg_hotplug(E) mlxreg_io(E) leds_mlxreg(E) i2c_mlxcpld(E) mlxreg_fan(E) mxm_wmi(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) evdev(E) mlx_platform(E) ghash_clmulni_intel(E) intel_cstate(E) sg(E) intel_uncore(E) iTCO_wdt(E) pcspkr(E)
[  269.885239]  intel_rapl_perf(E) ioatdma(E) iTCO_vendor_support(E) pcc_cpufreq(E) wmi(E) ebt_vlan(E) ebtable_broute(E) bridge(E) stp(E) llc(E) ebtable_nat(E) nf_tables(E) button(E) nfnetlink(E) ebtable_filter(E) ebtables(E) xdpe12284(E) at24(E) ledtrig_timer(E) tmp102(E) lm75(E) coretemp(E) max1363(E) industrialio_triggered_buffer(E) kfifo_buf(E) industrialio(E) tps53679(E) pmbus(E) pmbus_core(E) i2c_dev(E) ip_tables(E) x_tables(E) autofs4(E) loop(E) ext4(E) crc16(E) mbcache(E) jbd2(E) crc32c_generic(E) fscrypto(E) ecb(E) sd_mod(E) nvme(E) nvme_core(E) nls_utf8(E) nls_cp437(E) nls_ascii(E) vfat(E) fat(E) overlay(E) squashfs(E) zstd_decompress(E) xxhash(E) crc32c_intel(E) gpio_ich(E) ahci(E) aesni_intel(E) libahci(E) aes_x86_64(E) crypto_simd(E) xhci_pci(E) ehci_pci(E) libata(E) igb(E) ehci_hcd(E)
[  269.964036]  xhci_hcd(E) cryptd(E) glue_helper(E) scsi_mod(E) i2c_algo_bit(E) i2c_i801(E) lpc_ich(E) dca(E) mfd_core(E) usbcore(E) usb_common(E)
[  269.978536] ---[ end trace 8f56c678b52f9aee ]---
[  269.983698] RIP: 0010:__list_del_entry_valid.cold.1+0x34/0x4c
[  269.990123] Code: 9f 29 a5 e8 68 7a d0 ff 0f 0b 48 c7 c7 20 a0 29 a5 e8 5a 7a d0 ff 0f 0b 48 89 f2 48 89 fe 48 c7 c7 e0 9f 29 a5 e8 46 7a d0 ff <0f> 0b 48 89 fe 48 c7 c7 a8 9f 29 a5 e8 35 7a d0 ff 0f 0b 90 90 90
[  270.011117] RSP: 0018:ffffaddb83b5fdc0 EFLAGS: 00010246
[  270.016958] RAX: 0000000000000054 RBX: ffff9e136bd57418 RCX: 0000000000000000
[  270.024935] RDX: 0000000000000000 RSI: ffff9e136fa566b8 RDI: ffff9e136fa566b8
[  270.032912] RBP: ffff9e1364bd5070 R08: 00000000000005ce R09: 0000000000000004
[  270.040890] R10: 0000000000000766 R11: ffffffffa59f66ad R12: ffff9e136bd57400
[  270.048866] R13: ffffffffa52c6a12 R14: ffff9e1364bd30d0 R15: 0000000000000000
[  270.056844] FS:  00007f97227af740(0000) GS:ffff9e136fa40000(0000) knlGS:0000000000000000
[  270.065889] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  270.072312] CR2: 000055cfdb69e158 CR3: 00000004677f6001 CR4: 00000000003606e0
[  270.080289] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  270.088268] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

A temporary solution is to disable this config and to work with the linux community on fixing it.
The solution requires fan driver update which is not trivial and will take some time to have it available on next-net before can be backported to SONiC linux-kernel.

It was tested on:
HwSKU: ACS-MSN2410
HwSKU: Mellanox-SN2700

@liat-grozovik
Copy link
Collaborator

@allas-nvidia please fix type of ecxlude to exclude

@liat-grozovik
Copy link
Collaborator

@lguohan can you please take it for 202012 as well?

@paulmenzel
Copy link
Contributor

Set CONFIG_THERMAL_STATISTICS to no.

Although, I know what you mean, I think the Kconfig language actually uses the term unset (CONFIG_THERMAL_STATISTICS is not set).

A temporary solution is to disable this config and to work with the linux community on fixing it.

It’d be great if you pasted the URL to the bug report to the maintainers.

The solutions requires fan driver update which is not trivial and will take some time to have it available on next-net before can be backported to SONiC linux-kernel

  1. The solution (so plural s?)
  2. Please add a dot/period to the end of sentences.

Please add on what device you tested this, and maybe also add the Linux backtrace to the commit message, so people search for the error have a chance of finding this merge/pull request.

@paulmenzel
Copy link
Contributor

And please reference issues in the commit message:

Resolves: sonic-net/sonic-buildimage#6866

@allas-nvidia allas-nvidia changed the title ecxlude CONFIG_THERMAL_STATISTICS Unset CONFIG_THERMAL_STATISTICS to prevent kernel crash : Mar 10, 2021
@allas-nvidia allas-nvidia changed the title Unset CONFIG_THERMAL_STATISTICS to prevent kernel crash : Unset CONFIG_THERMAL_STATISTICS to prevent kernel crash Mar 10, 2021
@allas-nvidia allas-nvidia marked this pull request as ready for review March 10, 2021 09:53
@liat-grozovik
Copy link
Collaborator

@allas-nvidia can you please also add in the description a reference to linux kernel bug tracking?

@dprital
Copy link
Collaborator

dprital commented Mar 10, 2021

@lguohan - who can merge it to Master and to 202012 ?

@lguohan lguohan merged commit 30b9a59 into sonic-net:master Mar 10, 2021
@dprital
Copy link
Collaborator

dprital commented Mar 10, 2021

@daall - Can you please merge to 202012 ? Thanks.

daall pushed a commit that referenced this pull request Mar 10, 2021
Fix sonic-net/sonic-buildimage#6866

Unset CONFIG_THERMAL_STATISTICS.
Reason:
Kernel thermal zones binding to the cooling device together with CONFIG_THERMAL_STATISTICS=y causes kernel crash as out of boundary:
trans_table is two-dimensional table allocated per max cooling state (10).
If statistics is configured, thermal_cooling_device_stats_update() will be called and will try to update out of boundary:
stats->trans_table[stats->state * stats->max_states + new_state]++

Kernel crash with the following stack trace:

```
[  269.474092] watchdog: watchdog1: watchdog did not stop!
[  269.533625] list_del corruption. prev->next should be ffff9e136bd57418, but was 677ac660ffffffff

[  269.543482] kernel BUG at lib/list_debug.c:53!
[  269.548458] invalid opcode: 0000 [#1] SMP PTI
[  269.553326] CPU: 1 PID: 8890 Comm: kexec Tainted: G           OE     4.19.0-9-2-amd64 #1 Debian 4.19.118-2+deb10u1
[  269.564891] Hardware name: Mellanox Technologies Ltd. MSN4700/VMOD0010, BIOS 5.11 11/03/2020
[  269.574323] RIP: 0010:__list_del_entry_valid.cold.1+0x34/0x4c
[  269.580740] Code: 9f 29 a5 e8 68 7a d0 ff 0f 0b 48 c7 c7 20 a0 29 a5 e8 5a 7a d0 ff 0f 0b 48 89 f2 48 89 fe 48 c7 c7 e0 9f 29 a5 e8 46 7a d0 ff <0f> 0b 48 89 fe 48 c7 c7 a8 9f 29 a5 e8 35 7a d0 ff 0f 0b 90 90 90
[  269.601726] RSP: 0018:ffffaddb83b5fdc0 EFLAGS: 00010246
[  269.607561] RAX: 0000000000000054 RBX: ffff9e136bd57418 RCX: 0000000000000000
[  269.615531] RDX: 0000000000000000 RSI: ffff9e136fa566b8 RDI: ffff9e136fa566b8
[  269.623500] RBP: ffff9e1364bd5070 R08: 00000000000005ce R09: 0000000000000004
[  269.631470] R10: 0000000000000766 R11: ffffffffa59f66ad R12: ffff9e136bd57400
[  269.639440] R13: ffffffffa52c6a12 R14: ffff9e1364bd30d0 R15: 0000000000000000
[  269.647410] FS:  00007f97227af740(0000) GS:ffff9e136fa40000(0000) knlGS:0000000000000000
[  269.656441] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  269.662857] CR2: 000055cfdb69e158 CR3: 00000004677f6001 CR4: 00000000003606e0
[  269.670820] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  269.678790] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  269.686760] Call Trace:
[  269.689489]  device_shutdown+0xc1/0x210
[  269.693773]  kernel_kexec+0x51/0x96
[  269.697666]  __do_sys_reboot+0x1be/0x210
[  269.702045]  ? kmem_cache_free+0x1aa/0x1d0
[  269.706618]  ? __dentry_kill+0x121/0x170
[  269.710998]  ? _cond_resched+0x15/0x30
[  269.715181]  ? dentry_kill+0x4d/0x190
[  269.719260]  ? _cond_resched+0x15/0x30
[  269.723444]  do_syscall_64+0x53/0x110
[  269.727531]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  269.733172] RIP: 0033:0x7f97228a3373
[  269.737161] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 e9 9a 0c 00 f7 d8
[  269.758147] RSP: 002b:00007ffe11d30fa8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a9
[  269.766602] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f97228a3373
[  269.774572] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
[  269.782541] RBP: 0000000000000002 R08: 0000000000000004 R09: 000055cfdb69e160
[  269.790511] R10: fffffffffffffb8e R11: 0000000000000202 R12: 00007ffe11d31238
[  269.798482] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000ffffffff
[  269.806443] Modules linked in: nft_chain_route_ipv4(E) xt_TCPMSS(E) sx_bfd(OE) sx_netdev(OE) psample(E) dummy(E) sx_core(OE) 8021q(E) garp(E) mrp(E) mst_pciconf(OE) mst_pci(OE) xt_hl(E) xt_tcpudp(E) ip6_tables(E) nft_compat(E) nft_counter(E) xt_conntrack(E) nf_nat(E) nf_conntrack_netlink(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) xfrm_user(E) xfrm_algo(E) intel_rapl(E) mlxsw_minimal(E) sb_edac(E) mlxsw_i2c(E) x86_pkg_temp_thermal(E) mlxsw_core(E) intel_powerclamp(E) devlink(E) kvm_intel(E) bonding(E) kvm(E) i2c_mux_reg(E) i2c_mux(E) mlxreg_hotplug(E) mlxreg_io(E) leds_mlxreg(E) i2c_mlxcpld(E) mlxreg_fan(E) mxm_wmi(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) evdev(E) mlx_platform(E) ghash_clmulni_intel(E) intel_cstate(E) sg(E) intel_uncore(E) iTCO_wdt(E) pcspkr(E)
[  269.885239]  intel_rapl_perf(E) ioatdma(E) iTCO_vendor_support(E) pcc_cpufreq(E) wmi(E) ebt_vlan(E) ebtable_broute(E) bridge(E) stp(E) llc(E) ebtable_nat(E) nf_tables(E) button(E) nfnetlink(E) ebtable_filter(E) ebtables(E) xdpe12284(E) at24(E) ledtrig_timer(E) tmp102(E) lm75(E) coretemp(E) max1363(E) industrialio_triggered_buffer(E) kfifo_buf(E) industrialio(E) tps53679(E) pmbus(E) pmbus_core(E) i2c_dev(E) ip_tables(E) x_tables(E) autofs4(E) loop(E) ext4(E) crc16(E) mbcache(E) jbd2(E) crc32c_generic(E) fscrypto(E) ecb(E) sd_mod(E) nvme(E) nvme_core(E) nls_utf8(E) nls_cp437(E) nls_ascii(E) vfat(E) fat(E) overlay(E) squashfs(E) zstd_decompress(E) xxhash(E) crc32c_intel(E) gpio_ich(E) ahci(E) aesni_intel(E) libahci(E) aes_x86_64(E) crypto_simd(E) xhci_pci(E) ehci_pci(E) libata(E) igb(E) ehci_hcd(E)
[  269.964036]  xhci_hcd(E) cryptd(E) glue_helper(E) scsi_mod(E) i2c_algo_bit(E) i2c_i801(E) lpc_ich(E) dca(E) mfd_core(E) usbcore(E) usb_common(E)
[  269.978536] ---[ end trace 8f56c678b52f9aee ]---
[  269.983698] RIP: 0010:__list_del_entry_valid.cold.1+0x34/0x4c
[  269.990123] Code: 9f 29 a5 e8 68 7a d0 ff 0f 0b 48 c7 c7 20 a0 29 a5 e8 5a 7a d0 ff 0f 0b 48 89 f2 48 89 fe 48 c7 c7 e0 9f 29 a5 e8 46 7a d0 ff <0f> 0b 48 89 fe 48 c7 c7 a8 9f 29 a5 e8 35 7a d0 ff 0f 0b 90 90 90
[  270.011117] RSP: 0018:ffffaddb83b5fdc0 EFLAGS: 00010246
[  270.016958] RAX: 0000000000000054 RBX: ffff9e136bd57418 RCX: 0000000000000000
[  270.024935] RDX: 0000000000000000 RSI: ffff9e136fa566b8 RDI: ffff9e136fa566b8
[  270.032912] RBP: ffff9e1364bd5070 R08: 00000000000005ce R09: 0000000000000004
[  270.040890] R10: 0000000000000766 R11: ffffffffa59f66ad R12: ffff9e136bd57400
[  270.048866] R13: ffffffffa52c6a12 R14: ffff9e1364bd30d0 R15: 0000000000000000
[  270.056844] FS:  00007f97227af740(0000) GS:ffff9e136fa40000(0000) knlGS:0000000000000000
[  270.065889] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  270.072312] CR2: 000055cfdb69e158 CR3: 00000004677f6001 CR4: 00000000003606e0
[  270.080289] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  270.088268] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
```

A temporary solution is to disable this config and to work with the linux community on fixing it. 
The solution requires fan driver update which is not trivial and will take some time to have it available on next-net before can be backported to SONiC linux-kernel.

It was tested on:
HwSKU: ACS-MSN2410
HwSKU: Mellanox-SN2700
Junchao-Mellanox pushed a commit to Junchao-Mellanox/sonic-linux-kernel that referenced this pull request Mar 16, 2021
Fix sonic-net/sonic-buildimage#6866

Unset CONFIG_THERMAL_STATISTICS.
Reason:
Kernel thermal zones binding to the cooling device together with CONFIG_THERMAL_STATISTICS=y causes kernel crash as out of boundary:
trans_table is two-dimensional table allocated per max cooling state (10).
If statistics is configured, thermal_cooling_device_stats_update() will be called and will try to update out of boundary:
stats->trans_table[stats->state * stats->max_states + new_state]++

Kernel crash with the following stack trace:

```
[  269.474092] watchdog: watchdog1: watchdog did not stop!
[  269.533625] list_del corruption. prev->next should be ffff9e136bd57418, but was 677ac660ffffffff

[  269.543482] kernel BUG at lib/list_debug.c:53!
[  269.548458] invalid opcode: 0000 [#1] SMP PTI
[  269.553326] CPU: 1 PID: 8890 Comm: kexec Tainted: G           OE     4.19.0-9-2-amd64 #1 Debian 4.19.118-2+deb10u1
[  269.564891] Hardware name: Mellanox Technologies Ltd. MSN4700/VMOD0010, BIOS 5.11 11/03/2020
[  269.574323] RIP: 0010:__list_del_entry_valid.cold.1+0x34/0x4c
[  269.580740] Code: 9f 29 a5 e8 68 7a d0 ff 0f 0b 48 c7 c7 20 a0 29 a5 e8 5a 7a d0 ff 0f 0b 48 89 f2 48 89 fe 48 c7 c7 e0 9f 29 a5 e8 46 7a d0 ff <0f> 0b 48 89 fe 48 c7 c7 a8 9f 29 a5 e8 35 7a d0 ff 0f 0b 90 90 90
[  269.601726] RSP: 0018:ffffaddb83b5fdc0 EFLAGS: 00010246
[  269.607561] RAX: 0000000000000054 RBX: ffff9e136bd57418 RCX: 0000000000000000
[  269.615531] RDX: 0000000000000000 RSI: ffff9e136fa566b8 RDI: ffff9e136fa566b8
[  269.623500] RBP: ffff9e1364bd5070 R08: 00000000000005ce R09: 0000000000000004
[  269.631470] R10: 0000000000000766 R11: ffffffffa59f66ad R12: ffff9e136bd57400
[  269.639440] R13: ffffffffa52c6a12 R14: ffff9e1364bd30d0 R15: 0000000000000000
[  269.647410] FS:  00007f97227af740(0000) GS:ffff9e136fa40000(0000) knlGS:0000000000000000
[  269.656441] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  269.662857] CR2: 000055cfdb69e158 CR3: 00000004677f6001 CR4: 00000000003606e0
[  269.670820] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  269.678790] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  269.686760] Call Trace:
[  269.689489]  device_shutdown+0xc1/0x210
[  269.693773]  kernel_kexec+0x51/0x96
[  269.697666]  __do_sys_reboot+0x1be/0x210
[  269.702045]  ? kmem_cache_free+0x1aa/0x1d0
[  269.706618]  ? __dentry_kill+0x121/0x170
[  269.710998]  ? _cond_resched+0x15/0x30
[  269.715181]  ? dentry_kill+0x4d/0x190
[  269.719260]  ? _cond_resched+0x15/0x30
[  269.723444]  do_syscall_64+0x53/0x110
[  269.727531]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  269.733172] RIP: 0033:0x7f97228a3373
[  269.737161] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 e9 9a 0c 00 f7 d8
[  269.758147] RSP: 002b:00007ffe11d30fa8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a9
[  269.766602] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f97228a3373
[  269.774572] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
[  269.782541] RBP: 0000000000000002 R08: 0000000000000004 R09: 000055cfdb69e160
[  269.790511] R10: fffffffffffffb8e R11: 0000000000000202 R12: 00007ffe11d31238
[  269.798482] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000ffffffff
[  269.806443] Modules linked in: nft_chain_route_ipv4(E) xt_TCPMSS(E) sx_bfd(OE) sx_netdev(OE) psample(E) dummy(E) sx_core(OE) 8021q(E) garp(E) mrp(E) mst_pciconf(OE) mst_pci(OE) xt_hl(E) xt_tcpudp(E) ip6_tables(E) nft_compat(E) nft_counter(E) xt_conntrack(E) nf_nat(E) nf_conntrack_netlink(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) xfrm_user(E) xfrm_algo(E) intel_rapl(E) mlxsw_minimal(E) sb_edac(E) mlxsw_i2c(E) x86_pkg_temp_thermal(E) mlxsw_core(E) intel_powerclamp(E) devlink(E) kvm_intel(E) bonding(E) kvm(E) i2c_mux_reg(E) i2c_mux(E) mlxreg_hotplug(E) mlxreg_io(E) leds_mlxreg(E) i2c_mlxcpld(E) mlxreg_fan(E) mxm_wmi(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) evdev(E) mlx_platform(E) ghash_clmulni_intel(E) intel_cstate(E) sg(E) intel_uncore(E) iTCO_wdt(E) pcspkr(E)
[  269.885239]  intel_rapl_perf(E) ioatdma(E) iTCO_vendor_support(E) pcc_cpufreq(E) wmi(E) ebt_vlan(E) ebtable_broute(E) bridge(E) stp(E) llc(E) ebtable_nat(E) nf_tables(E) button(E) nfnetlink(E) ebtable_filter(E) ebtables(E) xdpe12284(E) at24(E) ledtrig_timer(E) tmp102(E) lm75(E) coretemp(E) max1363(E) industrialio_triggered_buffer(E) kfifo_buf(E) industrialio(E) tps53679(E) pmbus(E) pmbus_core(E) i2c_dev(E) ip_tables(E) x_tables(E) autofs4(E) loop(E) ext4(E) crc16(E) mbcache(E) jbd2(E) crc32c_generic(E) fscrypto(E) ecb(E) sd_mod(E) nvme(E) nvme_core(E) nls_utf8(E) nls_cp437(E) nls_ascii(E) vfat(E) fat(E) overlay(E) squashfs(E) zstd_decompress(E) xxhash(E) crc32c_intel(E) gpio_ich(E) ahci(E) aesni_intel(E) libahci(E) aes_x86_64(E) crypto_simd(E) xhci_pci(E) ehci_pci(E) libata(E) igb(E) ehci_hcd(E)
[  269.964036]  xhci_hcd(E) cryptd(E) glue_helper(E) scsi_mod(E) i2c_algo_bit(E) i2c_i801(E) lpc_ich(E) dca(E) mfd_core(E) usbcore(E) usb_common(E)
[  269.978536] ---[ end trace 8f56c678b52f9aee ]---
[  269.983698] RIP: 0010:__list_del_entry_valid.cold.1+0x34/0x4c
[  269.990123] Code: 9f 29 a5 e8 68 7a d0 ff 0f 0b 48 c7 c7 20 a0 29 a5 e8 5a 7a d0 ff 0f 0b 48 89 f2 48 89 fe 48 c7 c7 e0 9f 29 a5 e8 46 7a d0 ff <0f> 0b 48 89 fe 48 c7 c7 a8 9f 29 a5 e8 35 7a d0 ff 0f 0b 90 90 90
[  270.011117] RSP: 0018:ffffaddb83b5fdc0 EFLAGS: 00010246
[  270.016958] RAX: 0000000000000054 RBX: ffff9e136bd57418 RCX: 0000000000000000
[  270.024935] RDX: 0000000000000000 RSI: ffff9e136fa566b8 RDI: ffff9e136fa566b8
[  270.032912] RBP: ffff9e1364bd5070 R08: 00000000000005ce R09: 0000000000000004
[  270.040890] R10: 0000000000000766 R11: ffffffffa59f66ad R12: ffff9e136bd57400
[  270.048866] R13: ffffffffa52c6a12 R14: ffff9e1364bd30d0 R15: 0000000000000000
[  270.056844] FS:  00007f97227af740(0000) GS:ffff9e136fa40000(0000) knlGS:0000000000000000
[  270.065889] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  270.072312] CR2: 000055cfdb69e158 CR3: 00000004677f6001 CR4: 00000000003606e0
[  270.080289] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  270.088268] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
```

A temporary solution is to disable this config and to work with the linux community on fixing it.
The solution requires fan driver update which is not trivial and will take some time to have it available on next-net before can be backported to SONiC linux-kernel.

It was tested on:
HwSKU: ACS-MSN2410
HwSKU: Mellanox-SN2700
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kernel crash observed during kexec
6 participants