Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel crash at __sys_recvfrom() on NVMe TLS connections during port toggles #71

Open
hreinecke opened this issue Jul 18, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@hreinecke
Copy link
Contributor

(mirrored from our bugzilla)
A partner of ours has seen a crash in __sys_recvfrom during NVMe-oF port toggles:

[ 2455.760904] RIP: 0010:__sys_recvfrom+0x94/0x110
[ 2455.760920] Code: 64 48 8d 54 24 04 44 89 ef 48 89 e6 e8 55 cd ff ff 48 85 c0 49 89 c5 74 49 48 8b 50 10 89 d8 48 8d 74 24 18 83 c8 40 4c 89 ef 42 41 08 0f 45 d8 89 da e8 2e d8 ff ff 48 85 ed 89 04 24 74 1a
[ 2455.760932] RSP: 0018:ffffa523cbb1bd00 EFLAGS: 00010202
[ 2455.760939] RAX: 0000000000000040 RBX: 0000000000000000 RCX: 0000000000000000
[ 2455.760946] RDX: 0000000000000000 RSI: ffffa523cbb1bd18 RDI: ffff926f7dfca700
[ 2455.760952] RBP: 0000000000000000 R08: 0000000000000006 R09: 0000000000000000
[ 2455.760957] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa523cbb1bd80
[ 2455.760963] R13: ffff926f7dfca700 R14: 0000000000000000 R15: 0000000000000000
[ 2455.760970] FS: 00007f1aa080f940(0000) GS:ffff927327c00000(0000) knlGS:0000000000000000
[ 2455.760977] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2455.760984] CR2: 0000000000000041 CR3: 00000004d7dce003 CR4: 00000000003706e0
[ 2455.760990] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2455.760996] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2455.761003] Call Trace:
[ 2455.761009]
[ 2455.761033] ? __die_body+0x1a/0x60
[ 2455.761043] ? page_fault_oops+0x131/0x510
[ 2455.761053] ? ip_output+0x5d/0xf0
[ 2455.761064] ? ip_output+0x5d/0xf0
[ 2455.761070] ? exc_page_fault+0x69/0x150
[ 2455.761079] ? asm_exc_page_fault+0x22/0x30
[ 2455.761089] ? __sys_recvfrom+0x94/0x110
[ 2455.761095] ? fsnotify_destroy_marks+0x24/0x160
[ 2455.761104] ? __call_rcu_common.constprop.76+0x114/0x7f0
[ 2455.761111] ? __rseq_handle_notify_resume+0xab/0x4d0
[ 2455.761120] __x64_sys_recvfrom+0x24/0x30
[ 2455.761125] do_syscall_64+0x5b/0x80
[ 2455.761130] ? switch_fpu_return+0x4c/0xd0
[ 2455.761135] ? exit_to_user_mode_prepare+0x142/0x220
[ 2455.761141] ? syscall_exit_to_user_mode+0x1e/0x40
[ 2455.761147] ? do_syscall_64+0x67/0x80
[ 2455.761150] ? do_user_addr_fault+0x446/0x890
[ 2455.761157] ? exc_page_fault+0x69/0x150
[ 2455.761162] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 2455.761167] RIP: 0033:0x7f1a9fb30ef9
[ 2455.761203] Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 80 3d 39 cb 0d 00 00 41 89 ca 74 1c 45 31 c9 45 31 c0 b8 2d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 67 c3 66 0f 1f 44 00 00 55 48 83 ec 20 48 89
[ 2455.761210] RSP: 002b:00007ffff80abe78 EFLAGS: 00000246 ORIG_RAX: 000000000000002d
[ 2455.761215] RAX: ffffffffffffffda RBX: 00000000011521b0 RCX: 00007f1a9fb30ef9
[ 2455.761218] RDX: 0000000000000005 RSI: 0000000001153e9b RDI: 0000000000000006
[ 2455.761222] RBP: 0000000001153e40 R08: 0000000000000000 R09: 0000000000000000
[ 2455.761225] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 2455.761228] R13: 0000000000000005 R14: 0000000000000005 R15: 00007ffff80abf7c
[ 2455.761233]
[ 2455.761235] Modules linked in: nvme_tcp rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfsv3 nfs_acl nfs lockd grace fscache netfs rpcrdma sunrpc rdma_ucm ib_umad ib_iser rdma_cm iw_cm ib_ipoib libiscsi scsi_transport_iscsi ib_cm af_packet iscsi_ibft iscsi_boot_sysfs rfkill mlx5_ib ib_uverbs macsec ib_core ipmi_ssif intel_rapl_msr intel_rapl_common mlx5_core intel_uncore_frequency intel_uncore_frequency_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iTCO_wdt mlxfw intel_pmc_bxt acpi_ipmi psample iTCO_vendor_support ipmi_si kvm tls joydev lpc_ich i2c_i801 mei_me ipmi_devintf pci_hyperv_intf(X) i2c_smbus mei ipmi_msghandler be2net pcspkr mfd_core irqbypass button ac dm_multipath dm_mod fuse dmi_sysfs ip_tables x_tables hid_generic usbhid lpfc nvmet_fc nvmet nvme_keyring configfs nvme_fc ahci nvme_fabrics libahci nvme_core libata megaraid_sas crc32_pclmul nvme_auth scsi_transport_fc sd_mod scsi_dh_emc ghash_clmulni_intel scsi_dh_rdac sha512_ssse3 scsi_dh_alua t10_pi sha256_ssse3 xhci_pci
[ 2455.761326] sha1_ssse3 xhci_pci_renesas xhci_hcd crc64_rocksoft_generic ehci_pci crc64_rocksoft ehci_hcd sg aesni_intel usbcore crypto_simd scsi_mod cryptd mgag200 i2c_algo_bit crc64 wmi btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq msr
[ 2455.761371] Supported: No, Unreleased kernel
[ 2455.761376] CR2: 0000000000000041

@hreinecke
Copy link
Contributor Author

This looks like tlshd issuing recvfrom() on an invalid filehandle. But how this could happen I'm a bit at a loss.

@chucklever
Copy link
Member

The first thing that comes to mind is that the kernel code is releasing the socket during the handshake.

@hreinecke
Copy link
Contributor Author

Isn't that reference count protected?

@hreinecke
Copy link
Contributor Author

I bet it's the blasted 'sock' vs 'sock->file' duplicity, where a release on one kills the other.

@chucklever
Copy link
Member

It is reference count protected -- but there are a half-dozen ways that reference count can be screwed up or defeated.

@chucklever chucklever added the bug Something isn't working label Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants