be2net update for UEK4 qu7 #8

selvintxavier · 2019-03-30T06:06:31Z

Includes update of be2net driver to 12.0.0.0 version.
Couple of patches are for ocrdma because of some dependency to be2net patch.

Thanks

Orabug: 29475071 Dispatch only port event to IB stack when port state changes. Don't explicitly modify qps to error. Let application listen to port events on async event queue or let QP fail with retry-exceeded completion error. Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@avagotech.com> Signed-off-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Doug Ledford <dledford@redhat.com>

Orabug: 29475071 Recently Dough Ledford reported a deadlock happening between ocrdma-load sequence and NetworkManager service issuing "open" on be2net interface. The deadlock happens when any be2net hook (e.g. open/close) is called in parallel to insmod ocrdma.ko. A. be2net is sending administrative open/close event to ocrdma holding device_list_mutex. It does this from ndo_open/ndo_stop hooks of be2net. So sequence of locks is rtnl_lock---> device_list lock B. When new ocrdma roce device gets registered, infiniband stack now takes rtnl_lock in ib_register_device() in GID initialization routines. So sequence of locks in this path is device_list lock ---> rtnl_lock. This improper locking sequence causes deadlock. With this patch we stop using administrative open and close events injected by be2net driver. These events were used to dispatch PORT_ACTIVE and PORT_ERROR events to the IB-stack. This patch implements a logic to receive async-link-events generated from CNA whenever link-state-change is detected. Now on, these async-events will be used to dispatch PORT_ACTIVE and PORT_ERROR events to IB-stack. Depending on async-events from CNA removes the need to hold device-list-mutex and thus breaks the busy-wait scenario. Reported-by: Doug Ledford <dledford@redhat.com> CC: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@avagotech.com> Signed-off-by: Selvin Xavier <selvin.xavier@avagotech.com> Signed-off-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Doug Ledford <dledford@redhat.com>

Orabug: 29475071 Recently Dough Ledford reported a deadlock happening between ocrdma-load sequence and NetworkManager service issueing "open" on be2net interface. The deadlock happens when any be2net hook (e.g. open/close) is called in parallel to insmod ocrdma.ko. A. be2net is sending administrative open/close event to ocrdma holding device_list_mutex. It does this from ndo_open/ndo_stop hooks of be2net. So sequence of locks is rtnl_lock---> device_list lock B. When new ocrdma roce device gets registered, infiniband stack now takes rtnl_lock in ib_register_device() in GID initialization routines. So sequence of locks in this path is device_list lock ---> rtnl_lock. This improper locking sequence causes deadlock. In order to resolve the above deadlock condition, ocrdma intorduced a patch to stop listening to administrative open/close events generated from be2net driver. It now depends on link-state-change async-event generated from CNA. This change leaves behind dead code which used to generate administrative open/close events. This patch cleans-up all that dead code from be2net. Reported-by: Doug Ledford <dledford@redhat.com> CC: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@avagotech.com> Signed-off-by: Selvin Xavier <selvin.xavier@avagotech.com> Signed-off-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Doug Ledford <dledford@redhat.com>

Orabug: 29475071 warning: variable ‘netdev’ set but not used Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 Note: This is compile only tested as I have no access to the hw. No benefit gained except for some self-documenting. add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0) Function old new delta Total: Before=2757703, After=2757703, chg +0.00% Signed-off-by: Hernán Gonzalez <hernan@vanguardiasur.com.ar> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 Prefer the direct use of octal for permissions. Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace and some typing. Miscellanea: o Whitespace neatening around these conversions. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 Check for 0xE00 (RECOVERABLE_ERR) along with ARMFW UE (0x0) in be_detect_error() to know whether the error is valid error or not Fixes: 673c96e ("be2net: Fix UE detection logic for BE3") Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 The commit 2632baf ("be2net: fix adaptive interrupt coalescing") introduced a separate struct be_aic_obj to hold AIC information but unfortunately left the old stuff in be_eq_obj. So remove it. Fixes: 2632baf ("be2net: fix adaptive interrupt coalescing") Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 The commit fb6113e ("be2net: get rid of custom busy poll code") replaced custom busy-poll code by the generic one but left several macros and fields in struct be_eq_obj that are currently unused. Remove this stuff. Fixes: fb6113e ("be2net: get rid of custom busy poll code") Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 The event queue description (be_eq_obj.desc) field is used only to format string for IRQ name and it is not really needed to hold this value. Remove it and use local variable to format string for IRQ name. Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 Re-order fields in struct be_eq_obj to ensure that .napi field begins at start of cache-line. Also the .adapter field is moved to the first cache-line next to .q field and 3 fields (idx,msi_idx,spurious_intr) and the 4-bytes hole to 3rd cache-line. Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 Before patch: struct be_tx_obj { u32 db_offset; /* 0 4 */ /* XXX 4 bytes hole, try to pack */ struct be_queue_info q; /* 8 56 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct be_queue_info cq; /* 64 56 */ struct be_tx_compl_info txcp; /* 120 4 */ /* XXX 4 bytes hole, try to pack */ /* --- cacheline 2 boundary (128 bytes) --- */ struct sk_buff * sent_skb_list[2048]; /* 128 16384 */ ... }: After patch: struct be_tx_obj { u32 db_offset; /* 0 4 */ struct be_tx_compl_info txcp; /* 4 4 */ struct be_queue_info q; /* 8 56 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct be_queue_info cq; /* 64 56 */ struct sk_buff * sent_skb_list[2048]; /* 120 16384 */ ... }; Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 - Unionize two u8 fields where only one of them is used depending on NIC chipset. - Move recovery_supported field after that union These changes eliminate 7-bytes hole in the struct and makes it smaller by 8 bytes. Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 The current position of .rss_flags field in struct rss_info causes that fields .rsstable and .rssqueue (both 128 bytes long) crosses cache-line boundaries. Moving it at the end properly align all fields. Before patch: struct rss_info { u64 rss_flags; /* 0 8 */ u8 rsstable[128]; /* 8 128 */ /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ u8 rss_queue[128]; /* 136 128 */ /* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */ u8 rss_hkey[40]; /* 264 40 */ }; After patch: struct rss_info { u8 rsstable[128]; /* 0 128 */ /* --- cacheline 2 boundary (128 bytes) --- */ u8 rss_queue[128]; /* 128 128 */ /* --- cacheline 4 boundary (256 bytes) --- */ u8 rss_hkey[40]; /* 256 40 */ u64 rss_flags; /* 296 8 */ }; Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>

…-timeout Orabug: 29475071 This patch handles a TX-timeout as follows: 1) This patch gathers and prints the following info that can help in diagnosing the cause of a TX-timeout. a) TX queue and completion queue entries. b) SKB and TCP/UDP header details. 2) For Lancer NICs (TX-timeout recovery is not supported for BE3/Skyhawk-R NICs), it recovers from the TX timeout as follows: a) On a TX-timeout, driver sets the PHYSDEV_CONTROL_FW_RESET_MASK bit in the PHYSDEV_CONTROL register. Lancer firmware goes into an error state and indicates this back to the driver via a bit in a doorbell register. b) Driver detects this and calls be_err_recover(). DMA is disabled, all pending TX skbs are unmapped and freed (be_close()). All rings are destroyed (be_clear()). c) The driver waits for the FW to re-initialize and re-creates all rings along with other data structs (be_resume()) Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 Trivial fix to spelling mistake in dev_info message. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Addresses-Coverity-ID: 114787 ("Missing break in switch") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 Add flags to enable/disable supported chips in be2net. With disable support are removed coresponding PCI IDs and also codepaths with [BE2|BE3|BEx|lancer|skyhawk]_chip checks. Disable chip will reduce module size by: BE2 ~2kb BE3 ~3kb Lancer ~10kb Skyhawk ~9kb When enable skyhawk only it will reduce module size by ~20kb New help style in Kconfig Reviewed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Petr Oros <poros@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 DMA allocated memory is lost in be_cmd_get_profile_config() when we call it with non-NULL port_res parameter. Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 the be2net implementation of .ndo_tunnel_{add,del}() changes the value of NETIF_F_GSO_UDP_TUNNEL bit in 'features' and 'hw_features', but it forgets to call netdev_features_change(). Moreover, ethtool setting for that bit can potentially be reverted after a tunnel is added or removed. GSO already does software segmentation when 'hw_enc_features' is 0, even if VXLAN offload is turned on. In addition, commit 096de2f ("benet: stricter vxlan offloading check in be_features_check") avoids hardware segmentation of non-VXLAN tunneled packets, or VXLAN packets having wrong destination port. So, it's safe to avoid flipping the above feature on addition/deletion of VXLAN tunnels. Fixes: 630f4b7 ("be2net: Export tunnel offloads only when a VxLAN tunnel is created") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 The mentioned commit needs to be reverted because we cannot pass string allocated on stack to request_irq(). This function stores uses this pointer for later use (e.g. /proc/interrupts) so we need to keep this string persistently. Fixes: d6d9704 ("be2net: remove desc field from be_eq_obj") Signed-off-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>

Orabug: 29475071 is_broadcast_packet() expands to compare_ether_addr() which doesn't exist since commit 7367d0b ("drivers/net: Convert uses of compare_ether_addr to ether_addr_equal"). It turns out it's actually not used. Signed-off-by: Lubomir Rintel <lkundrak@v3.sk> Signed-off-by: David S. Miller <davem@davemloft.net>

This work around should be reverted when upstream commit (d8b91dd Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip) is available in uek. Issue appear is fixed in upstream tag 4.16.0 . Tag 4.15.0 still has this issue. The lask known commit on the perf topic branch that solve this issue is this: (c19d084 (tag: perf-core-for-mingo-4.16-20180125) perf trace beauty flock: Move to separate object file). Without this commit the perf topic branch has the below issue. With this commit the branch does not have the issue. Issue is that the above commit does not fix the issue on top of upstream tag 4.15.0. So the issue is probably fixed by this commit and some additional commits on the perf topic branch *or/and* on master branch below the point that the perf branch was branched. Also this specific commit is not a fix and the only possible relation to this bug is that it touches the 'flock' code which is used by bash/scripts to synchronize. To find the additional commits via git bisect I need to re-order the commits so that the above commit will be *below* the other commits that solve this issue. To do that I need to know what's the lowest commit that relate to this fix. I do not know and have no way to know that. Attempt to merge the perf topic on top of uek5 produce ~20k commits and tons of merge conflicts as uek5 is way behind the upstream. So can't even know if the topic branch with it's ~270 commits fix this issue for uek5. So I chose to work-around the issue and wait for the upstream topic merge to obsolite this commit. When issue occuer: Serial is flooded with messages: [71266.680745] bondib0: link status up for interface ib0, enabling it in 0 ms [71266.682740] bondib0: link status up for interface ib0, enabling it in 0 ms [71266.685738] bondib0: link status up for interface ib0, enabling it in 0 ms Then panic occur: [71266.695757] INFO: task NetworkManager:5837 blocked for more than 120 seconds. [71266.695759] Not tainted 4.14.35-1902.0.6.el7uek.x86_64 #2 [71266.695760] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [71266.695761] NetworkManager D 0 5837 1 0x00000082 [71266.695765] Call Trace: [71266.695778] __schedule+0x2bc/0x8da [71266.695782] schedule+0x36/0x7c [71266.695785] schedule_preempt_disabled+0xe/0x10 [71266.695788] __mutex_lock.isra.5+0x20c/0x634 [71266.695792] __mutex_lock_slowpath+0x13/0x15 [71266.695794] mutex_lock+0x2f/0x3a [71266.695800] rtnetlink_rcv_msg+0x1d0/0x289 [71266.695806] ? __skb_try_recv_datagram+0xca/0x174 [71266.695809] ? rtnl_calcit.isra.25+0x110/0x103 [71266.695812] netlink_rcv_skb+0xdf/0x111 [71266.695816] rtnetlink_rcv+0x15/0x17 [71266.695818] netlink_unicast+0x18d/0x255 [71266.695820] netlink_sendmsg+0x2df/0x3cc [71266.695825] sock_sendmsg+0x3e/0x4a [71266.695828] ___sys_sendmsg+0x2b5/0x2c6 [71266.695832] ? arch_tlb_finish_mmu+0x1b/0xcb [71266.695835] ? tlb_finish_mmu+0x23/0x30 [71266.695838] ? unmap_region+0xf4/0x12d [71266.695844] ? lockref_put_or_lock+0x44/0x72 [71266.695846] ? __vma_rb_erase+0x10f/0x1f4 [71266.695850] __sys_sendmsg+0x54/0x8d [71266.695854] SyS_sendmsg+0x12/0x1c [71266.695860] do_syscall_64+0x79/0x1ae [71266.695864] entry_SYSCALL_64_after_hwframe+0x151/0x0 [71266.695866] RIP: 0033:0x7f16f2553c5d [71266.695867] RSP: 002b:00007ffff7a493f0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e [71266.695870] RAX: ffffffffffffffda RBX: 00005570a5026380 RCX: 00007f16f2553c5d [71266.695874] RDX: 0000000000000000 RSI: 00007ffff7a49420 RDI: 0000000000000007 [71266.695875] RBP: 00007ffff7a49420 R08: 0000000000000001 R09: 0000000000000000 [71266.695876] R10: 0000000000000808 R11: 0000000000000293 R12: 00005570a5026380 [71266.695876] R13: 0000000000000000 R14: 0000000000000000 R15: 00007f16d4004b70 Issue analysis: The ip process is hung in addrconf_notify while trying to print to serial one of the below messages: "ADDRCONF(NETDEV_UP): %s: link is not ready\n" "ADDRCONF(NETDEV_CHANGE): %s: link becomes ready\n" The ip process hold the rtnl_lock while network-manager process try to grab this lock in 1 msec loop and every time it fail to grab the lock, the network-manager send additional line to the serial log as seen in the dmesg: "bondib0: link status up for interface ib0, enabling it in 0 ms" So the bond device flood the serial while waiting for the rtnl_lock while ip hold the rtnl_lock while waiting for the serial. Offending stack trace from vmcore is this: PID: 30063 TASK: ffff909c3f675a00 CPU: 7 COMMAND: "ip" #0 [fffffe000013ce38] crash_nmi_callback at ffffffff8e059ba7 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/paravirt.h: 99 #1 [fffffe000013ce48] nmi_handle at ffffffff8e032748 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 137 #2 [fffffe000013cea0] default_do_nmi at ffffffff8e032c96 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 336 #3 [fffffe000013cec8] do_nmi at ffffffff8e032e76 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 521 #4 [fffffe000013cef0] end_repeat_nmi at ffffffff8ea0436f /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/entry_64.S: 1750 [exception RIP: delay_tsc+51] RIP: ffffffff8e8558f3 RSP: ffff9f63c6c07390 RFLAGS: 00000046 RAX: 0000000016d23977 RBX: ffffffff903fbc00 RCX: 00009b7616d23038 RDX: 0000000000009b76 RSI: 0000000000000007 RDI: 000000000000095a RBP: ffff9f63c6c07390 R8: 00000000fffffffe R9: 0000000000000000 R10: 0000000000000005 R11: 0000000000020503 R12: 000000000000261f R13: 0000000000000020 R14: ffffffff8f96de2f R15: ffffffff903fbc00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/msr.h: 193 <NMI exception stack> #5 [ffff9f63c6c07390] delay_tsc at ffffffff8e8558f3 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/msr.h: 193 #6 [ffff9f63c6c07398] __const_udelay at ffffffff8e855838 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/lib/delay.c: 176 #7 [ffff9f63c6c073a8] wait_for_xmitr at ffffffff8e510dcc /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/nmi.h: 126 #8 [ffff9f63c6c073d0] serial8250_console_putchar at ffffffff8e510e6c /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/serial_core.h: 265 #9 [ffff9f63c6c073f0] uart_console_write at ffffffff8e509573 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/serial_core.c: 1886 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/8250/8250_port.c: 3256 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/8250/8250_core.c: 598 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1574 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1766 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1808 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk_safe.c: 402 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1842 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/ipv6/addrconf.c: 3532 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/notifier.c: 95 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/notifier.c: 402 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 1682 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 1697 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 6903 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 2072 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 2624 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 4255 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 2433 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 4268 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 1287 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 1877 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 646 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 2061 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/file.h: 26 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 2102 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/common.c: 295 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/entry_64.S: 247 RIP: 00007faf75ccafd0 RSP: 00007ffc710a9368 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 000000005c65f66d RCX: 00007faf75ccafd0 RDX: 0000000000000000 RSI: 00007ffc710a93b0 RDI: 0000000000000003 RBP: 00007ffc710a93b0 R8: 0000000000000000 R9: 0000000000000008 R10: 00007ffc710a8f30 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000066a440 R14: 00007ffc710a9458 R15: 00007ffc710a9b88 ORIG_RAX: 000000000000002e CS: 0033 SS: 002b Orabug: 29357838 Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: Aron Silverton <aron.silverton@oracle.com> Reviewed-by: John Haxby <john.haxby@oracle.com>

[ Upstream commit 001e465 ] A network device stack with multiple layers of bonding devices can trigger a false positive lockdep warning. Adding lockdep nest levels fixes this. Update the level on both enslave and unlink, to avoid the following series of events .. ip netns add test ip netns exec test bash ip link set dev lo addr 00:11:22:33:44:55 ip link set dev lo down ip link add dev bond1 type bond ip link add dev bond2 type bond ip link set dev lo master bond1 ip link set dev bond1 master bond2 ip link set dev bond1 nomaster ip link set dev bond2 master bond1 .. from still generating a splat: [ 193.652127] ====================================================== [ 193.658231] WARNING: possible circular locking dependency detected [ 193.664350] 4.20.0 #8 Not tainted [ 193.668310] ------------------------------------------------------ [ 193.674417] ip/15577 is trying to acquire lock: [ 193.678897] 00000000a40e3b69 (&(&bond->stats_lock)->rlock#3/3){+.+.}, at: bond_get_stats+0x58/0x290 [ 193.687851] but task is already holding lock: [ 193.693625] 00000000807b9d9f (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0x58/0x290 [..] [ 193.851092] lock_acquire+0xa7/0x190 [ 193.855138] _raw_spin_lock_nested+0x2d/0x40 [ 193.859878] bond_get_stats+0x58/0x290 [ 193.864093] dev_get_stats+0x5a/0xc0 [ 193.868140] bond_get_stats+0x105/0x290 [ 193.872444] dev_get_stats+0x5a/0xc0 [ 193.876493] rtnl_fill_stats+0x40/0x130 [ 193.880797] rtnl_fill_ifinfo+0x6c5/0xdc0 [ 193.885271] rtmsg_ifinfo_build_skb+0x86/0xe0 [ 193.890091] rtnetlink_event+0x5b/0xa0 [ 193.894320] raw_notifier_call_chain+0x43/0x60 [ 193.899225] netdev_change_features+0x50/0xa0 [ 193.904044] bond_compute_features.isra.46+0x1ab/0x270 [ 193.909640] bond_enslave+0x141d/0x15b0 [ 193.913946] do_set_master+0x89/0xa0 [ 193.918016] do_setlink+0x37c/0xda0 [ 193.921980] __rtnl_newlink+0x499/0x890 [ 193.926281] rtnl_newlink+0x48/0x70 [ 193.930238] rtnetlink_rcv_msg+0x171/0x4b0 [ 193.934801] netlink_rcv_skb+0xd1/0x110 [ 193.939103] rtnetlink_rcv+0x15/0x20 [ 193.943151] netlink_unicast+0x3b5/0x520 [ 193.947544] netlink_sendmsg+0x2fd/0x3f0 [ 193.951942] sock_sendmsg+0x38/0x50 [ 193.955899] ___sys_sendmsg+0x2ba/0x2d0 [ 193.960205] __x64_sys_sendmsg+0xad/0x100 [ 193.964687] do_syscall_64+0x5a/0x460 [ 193.968823] entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: 7e2556e ("bonding: avoid lockdep confusion in bond_get_stats()") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

This work around should be reverted when upstream commit (d8b91dd Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip) is available in uek. Issue appear is fixed in upstream tag 4.16.0 . Tag 4.15.0 still has this issue. The lask known commit on the perf topic branch that solve this issue is this: (c19d084 (tag: perf-core-for-mingo-4.16-20180125) perf trace beauty flock: Move to separate object file). Without this commit the perf topic branch has the below issue. With this commit the branch does not have the issue. Issue is that the above commit does not fix the issue on top of upstream tag 4.15.0. So the issue is probably fixed by this commit and some additional commits on the perf topic branch *or/and* on master branch below the point that the perf branch was branched. Also this specific commit is not a fix and the only possible relation to this bug is that it touches the 'flock' code which is used by bash/scripts to synchronize. To find the additional commits via git bisect I need to re-order the commits so that the above commit will be *below* the other commits that solve this issue. To do that I need to know what's the lowest commit that relate to this fix. I do not know and have no way to know that. Attempt to merge the perf topic on top of uek5 produce ~20k commits and tons of merge conflicts as uek5 is way behind the upstream. So can't even know if the topic branch with it's ~270 commits fix this issue for uek5. So I chose to work-around the issue and wait for the upstream topic merge to obsolite this commit. When issue occuer: Serial is flooded with messages: [71266.680745] bondib0: link status up for interface ib0, enabling it in 0 ms [71266.682740] bondib0: link status up for interface ib0, enabling it in 0 ms [71266.685738] bondib0: link status up for interface ib0, enabling it in 0 ms Then panic occur: [71266.695757] INFO: task NetworkManager:5837 blocked for more than 120 seconds. [71266.695759] Not tainted 4.14.35-1902.0.6.el7uek.x86_64 #2 [71266.695760] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [71266.695761] NetworkManager D 0 5837 1 0x00000082 [71266.695765] Call Trace: [71266.695778] __schedule+0x2bc/0x8da [71266.695782] schedule+0x36/0x7c [71266.695785] schedule_preempt_disabled+0xe/0x10 [71266.695788] __mutex_lock.isra.5+0x20c/0x634 [71266.695792] __mutex_lock_slowpath+0x13/0x15 [71266.695794] mutex_lock+0x2f/0x3a [71266.695800] rtnetlink_rcv_msg+0x1d0/0x289 [71266.695806] ? __skb_try_recv_datagram+0xca/0x174 [71266.695809] ? rtnl_calcit.isra.25+0x110/0x103 [71266.695812] netlink_rcv_skb+0xdf/0x111 [71266.695816] rtnetlink_rcv+0x15/0x17 [71266.695818] netlink_unicast+0x18d/0x255 [71266.695820] netlink_sendmsg+0x2df/0x3cc [71266.695825] sock_sendmsg+0x3e/0x4a [71266.695828] ___sys_sendmsg+0x2b5/0x2c6 [71266.695832] ? arch_tlb_finish_mmu+0x1b/0xcb [71266.695835] ? tlb_finish_mmu+0x23/0x30 [71266.695838] ? unmap_region+0xf4/0x12d [71266.695844] ? lockref_put_or_lock+0x44/0x72 [71266.695846] ? __vma_rb_erase+0x10f/0x1f4 [71266.695850] __sys_sendmsg+0x54/0x8d [71266.695854] SyS_sendmsg+0x12/0x1c [71266.695860] do_syscall_64+0x79/0x1ae [71266.695864] entry_SYSCALL_64_after_hwframe+0x151/0x0 [71266.695866] RIP: 0033:0x7f16f2553c5d [71266.695867] RSP: 002b:00007ffff7a493f0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e [71266.695870] RAX: ffffffffffffffda RBX: 00005570a5026380 RCX: 00007f16f2553c5d [71266.695874] RDX: 0000000000000000 RSI: 00007ffff7a49420 RDI: 0000000000000007 [71266.695875] RBP: 00007ffff7a49420 R08: 0000000000000001 R09: 0000000000000000 [71266.695876] R10: 0000000000000808 R11: 0000000000000293 R12: 00005570a5026380 [71266.695876] R13: 0000000000000000 R14: 0000000000000000 R15: 00007f16d4004b70 Issue analysis: The ip process is hung in addrconf_notify while trying to print to serial one of the below messages: "ADDRCONF(NETDEV_UP): %s: link is not ready\n" "ADDRCONF(NETDEV_CHANGE): %s: link becomes ready\n" The ip process hold the rtnl_lock while network-manager process try to grab this lock in 1 msec loop and every time it fail to grab the lock, the network-manager send additional line to the serial log as seen in the dmesg: "bondib0: link status up for interface ib0, enabling it in 0 ms" So the bond device flood the serial while waiting for the rtnl_lock while ip hold the rtnl_lock while waiting for the serial. Offending stack trace from vmcore is this: PID: 30063 TASK: ffff909c3f675a00 CPU: 7 COMMAND: "ip" #0 [fffffe000013ce38] crash_nmi_callback at ffffffff8e059ba7 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/paravirt.h: 99 #1 [fffffe000013ce48] nmi_handle at ffffffff8e032748 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 137 #2 [fffffe000013cea0] default_do_nmi at ffffffff8e032c96 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 336 #3 [fffffe000013cec8] do_nmi at ffffffff8e032e76 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 521 #4 [fffffe000013cef0] end_repeat_nmi at ffffffff8ea0436f /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/entry_64.S: 1750 [exception RIP: delay_tsc+51] RIP: ffffffff8e8558f3 RSP: ffff9f63c6c07390 RFLAGS: 00000046 RAX: 0000000016d23977 RBX: ffffffff903fbc00 RCX: 00009b7616d23038 RDX: 0000000000009b76 RSI: 0000000000000007 RDI: 000000000000095a RBP: ffff9f63c6c07390 R8: 00000000fffffffe R9: 0000000000000000 R10: 0000000000000005 R11: 0000000000020503 R12: 000000000000261f R13: 0000000000000020 R14: ffffffff8f96de2f R15: ffffffff903fbc00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/msr.h: 193 <NMI exception stack> #5 [ffff9f63c6c07390] delay_tsc at ffffffff8e8558f3 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/msr.h: 193 #6 [ffff9f63c6c07398] __const_udelay at ffffffff8e855838 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/lib/delay.c: 176 #7 [ffff9f63c6c073a8] wait_for_xmitr at ffffffff8e510dcc /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/nmi.h: 126 #8 [ffff9f63c6c073d0] serial8250_console_putchar at ffffffff8e510e6c /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/serial_core.h: 265 #9 [ffff9f63c6c073f0] uart_console_write at ffffffff8e509573 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/serial_core.c: 1886 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/8250/8250_port.c: 3256 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/8250/8250_core.c: 598 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1574 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1766 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1808 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk_safe.c: 402 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1842 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/ipv6/addrconf.c: 3532 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/notifier.c: 95 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/notifier.c: 402 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 1682 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 1697 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 6903 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 2072 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 2624 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 4255 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 2433 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 4268 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 1287 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 1877 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 646 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 2061 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/file.h: 26 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 2102 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/common.c: 295 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/entry_64.S: 247 RIP: 00007faf75ccafd0 RSP: 00007ffc710a9368 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 000000005c65f66d RCX: 00007faf75ccafd0 RDX: 0000000000000000 RSI: 00007ffc710a93b0 RDI: 0000000000000003 RBP: 00007ffc710a93b0 R8: 0000000000000000 R9: 0000000000000008 R10: 00007ffc710a8f30 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000066a440 R14: 00007ffc710a9458 R15: 00007ffc710a9b88 ORIG_RAX: 000000000000002e CS: 0033 SS: 002b Orabug: 29016284 Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Reviewed-by: John Haxby <john.haxby@oracle.com>

This work around should be reverted when upstream commit (d8b91dd Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip) is available in uek. Issue appear is fixed in upstream tag 4.16.0 . Tag 4.15.0 still has this issue. The lask known commit on the perf topic branch that solve this issue is this: (c19d084 (tag: perf-core-for-mingo-4.16-20180125) perf trace beauty flock: Move to separate object file). Without this commit the perf topic branch has the below issue. With this commit the branch does not have the issue. Issue is that the above commit does not fix the issue on top of upstream tag 4.15.0. So the issue is probably fixed by this commit and some additional commits on the perf topic branch *or/and* on master branch below the point that the perf branch was branched. Also this specific commit is not a fix and the only possible relation to this bug is that it touches the 'flock' code which is used by bash/scripts to synchronize. To find the additional commits via git bisect I need to re-order the commits so that the above commit will be *below* the other commits that solve this issue. To do that I need to know what's the lowest commit that relate to this fix. I do not know and have no way to know that. Attempt to merge the perf topic on top of uek5 produce ~20k commits and tons of merge conflicts as uek5 is way behind the upstream. So can't even know if the topic branch with it's ~270 commits fix this issue for uek5. So I chose to work-around the issue and wait for the upstream topic merge to obsolite this commit. When issue occuer: Serial is flooded with messages: [71266.680745] bondib0: link status up for interface ib0, enabling it in 0 ms [71266.682740] bondib0: link status up for interface ib0, enabling it in 0 ms [71266.685738] bondib0: link status up for interface ib0, enabling it in 0 ms Then panic occur: [71266.695757] INFO: task NetworkManager:5837 blocked for more than 120 seconds. [71266.695759] Not tainted 4.14.35-1902.0.6.el7uek.x86_64 #2 [71266.695760] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [71266.695761] NetworkManager D 0 5837 1 0x00000082 [71266.695765] Call Trace: [71266.695778] __schedule+0x2bc/0x8da [71266.695782] schedule+0x36/0x7c [71266.695785] schedule_preempt_disabled+0xe/0x10 [71266.695788] __mutex_lock.isra.5+0x20c/0x634 [71266.695792] __mutex_lock_slowpath+0x13/0x15 [71266.695794] mutex_lock+0x2f/0x3a [71266.695800] rtnetlink_rcv_msg+0x1d0/0x289 [71266.695806] ? __skb_try_recv_datagram+0xca/0x174 [71266.695809] ? rtnl_calcit.isra.25+0x110/0x103 [71266.695812] netlink_rcv_skb+0xdf/0x111 [71266.695816] rtnetlink_rcv+0x15/0x17 [71266.695818] netlink_unicast+0x18d/0x255 [71266.695820] netlink_sendmsg+0x2df/0x3cc [71266.695825] sock_sendmsg+0x3e/0x4a [71266.695828] ___sys_sendmsg+0x2b5/0x2c6 [71266.695832] ? arch_tlb_finish_mmu+0x1b/0xcb [71266.695835] ? tlb_finish_mmu+0x23/0x30 [71266.695838] ? unmap_region+0xf4/0x12d [71266.695844] ? lockref_put_or_lock+0x44/0x72 [71266.695846] ? __vma_rb_erase+0x10f/0x1f4 [71266.695850] __sys_sendmsg+0x54/0x8d [71266.695854] SyS_sendmsg+0x12/0x1c [71266.695860] do_syscall_64+0x79/0x1ae [71266.695864] entry_SYSCALL_64_after_hwframe+0x151/0x0 [71266.695866] RIP: 0033:0x7f16f2553c5d [71266.695867] RSP: 002b:00007ffff7a493f0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e [71266.695870] RAX: ffffffffffffffda RBX: 00005570a5026380 RCX: 00007f16f2553c5d [71266.695874] RDX: 0000000000000000 RSI: 00007ffff7a49420 RDI: 0000000000000007 [71266.695875] RBP: 00007ffff7a49420 R08: 0000000000000001 R09: 0000000000000000 [71266.695876] R10: 0000000000000808 R11: 0000000000000293 R12: 00005570a5026380 [71266.695876] R13: 0000000000000000 R14: 0000000000000000 R15: 00007f16d4004b70 Issue analysis: The ip process is hung in addrconf_notify while trying to print to serial one of the below messages: "ADDRCONF(NETDEV_UP): %s: link is not ready\n" "ADDRCONF(NETDEV_CHANGE): %s: link becomes ready\n" The ip process hold the rtnl_lock while network-manager process try to grab this lock in 1 msec loop and every time it fail to grab the lock, the network-manager send additional line to the serial log as seen in the dmesg: "bondib0: link status up for interface ib0, enabling it in 0 ms" So the bond device flood the serial while waiting for the rtnl_lock while ip hold the rtnl_lock while waiting for the serial. Offending stack trace from vmcore is this: PID: 30063 TASK: ffff909c3f675a00 CPU: 7 COMMAND: "ip" #0 [fffffe000013ce38] crash_nmi_callback at ffffffff8e059ba7 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/paravirt.h: 99 #1 [fffffe000013ce48] nmi_handle at ffffffff8e032748 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 137 #2 [fffffe000013cea0] default_do_nmi at ffffffff8e032c96 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 336 #3 [fffffe000013cec8] do_nmi at ffffffff8e032e76 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/kernel/nmi.c: 521 #4 [fffffe000013cef0] end_repeat_nmi at ffffffff8ea0436f /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/entry_64.S: 1750 [exception RIP: delay_tsc+51] RIP: ffffffff8e8558f3 RSP: ffff9f63c6c07390 RFLAGS: 00000046 RAX: 0000000016d23977 RBX: ffffffff903fbc00 RCX: 00009b7616d23038 RDX: 0000000000009b76 RSI: 0000000000000007 RDI: 000000000000095a RBP: ffff9f63c6c07390 R8: 00000000fffffffe R9: 0000000000000000 R10: 0000000000000005 R11: 0000000000020503 R12: 000000000000261f R13: 0000000000000020 R14: ffffffff8f96de2f R15: ffffffff903fbc00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/msr.h: 193 <NMI exception stack> #5 [ffff9f63c6c07390] delay_tsc at ffffffff8e8558f3 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/include/asm/msr.h: 193 #6 [ffff9f63c6c07398] __const_udelay at ffffffff8e855838 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/lib/delay.c: 176 #7 [ffff9f63c6c073a8] wait_for_xmitr at ffffffff8e510dcc /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/nmi.h: 126 #8 [ffff9f63c6c073d0] serial8250_console_putchar at ffffffff8e510e6c /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/serial_core.h: 265 #9 [ffff9f63c6c073f0] uart_console_write at ffffffff8e509573 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/serial_core.c: 1886 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/8250/8250_port.c: 3256 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/drivers/tty/serial/8250/8250_core.c: 598 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1574 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1766 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1808 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk_safe.c: 402 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/printk/printk.c: 1842 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/ipv6/addrconf.c: 3532 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/notifier.c: 95 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/kernel/notifier.c: 402 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 1682 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 1697 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/dev.c: 6903 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 2072 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 2624 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 4255 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 2433 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/core/rtnetlink.c: 4268 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 1287 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/netlink/af_netlink.c: 1877 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 646 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 2061 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/include/linux/file.h: 26 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/net/socket.c: 2102 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/common.c: 295 /usr/src/debug/kernel-4.14.35/linux-4.14.35-1902.0.6.el7uek/arch/x86/entry/entry_64.S: 247 RIP: 00007faf75ccafd0 RSP: 00007ffc710a9368 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 000000005c65f66d RCX: 00007faf75ccafd0 RDX: 0000000000000000 RSI: 00007ffc710a93b0 RDI: 0000000000000003 RBP: 00007ffc710a93b0 R8: 0000000000000000 R9: 0000000000000008 R10: 00007ffc710a8f30 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000066a440 R14: 00007ffc710a9458 R15: 00007ffc710a9b88 ORIG_RAX: 000000000000002e CS: 0033 SS: 002b Orabug: 29631452 Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: Aron Silverton <aron.silverton@oracle.com> Reviewed-by: John Haxby <john.haxby@oracle.com>

…_map [ Upstream commit 39df730 ] Detected via gcc's ASan: Direct leak of 2048 byte(s) in 64 object(s) allocated from: 6 #0 0x7f606512e370 in __interceptor_realloc (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xee370) 7 #1 0x556b0f1d7ddd in thread_map__realloc util/thread_map.c:43 8 #2 0x556b0f1d84c7 in thread_map__new_by_tid util/thread_map.c:85 9 #3 0x556b0f0e045e in is_event_supported util/parse-events.c:2250 10 #4 0x556b0f0e1aa1 in print_hwcache_events util/parse-events.c:2382 11 #5 0x556b0f0e3231 in print_events util/parse-events.c:2514 12 #6 0x556b0ee0a66e in cmd_list /home/changbin/work/linux/tools/perf/builtin-list.c:58 13 #7 0x556b0f01e0ae in run_builtin /home/changbin/work/linux/tools/perf/perf.c:302 14 #8 0x556b0f01e859 in handle_internal_command /home/changbin/work/linux/tools/perf/perf.c:354 15 #9 0x556b0f01edc8 in run_argv /home/changbin/work/linux/tools/perf/perf.c:398 16 #10 0x556b0f01f71f in main /home/changbin/work/linux/tools/perf/perf.c:520 17 #11 0x7f6062ccf09a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a) Signed-off-by: Changbin Du <changbin.du@gmail.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Fixes: 8989605 ("perf tools: Do not put a variable sized type not at the end of a struct") Link: http://lkml.kernel.org/r/20190316080556.3075-3-changbin.du@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit ffbf4fb ] When CONFIG_DEBUG_BUGVERBOSE=n, we fail to add necessary padding bytes to bug_table entries, and as a result the last entry in a bug table will be ignored, potentially leading to an unexpected panic(). All prior entries in the table will be handled correctly. The arm64 ABI requires that struct fields of up to 8 bytes are naturally-aligned, with padding added within a struct such that struct are suitably aligned within arrays. When CONFIG_DEBUG_BUGVERPOSE=y, the layout of a bug_entry is: struct bug_entry { signed int bug_addr_disp; // 4 bytes signed int file_disp; // 4 bytes unsigned short line; // 2 bytes unsigned short flags; // 2 bytes } ... with 12 bytes total, requiring 4-byte alignment. When CONFIG_DEBUG_BUGVERBOSE=n, the layout of a bug_entry is: struct bug_entry { signed int bug_addr_disp; // 4 bytes unsigned short flags; // 2 bytes < implicit padding > // 2 bytes } ... with 8 bytes total, with 6 bytes of data and 2 bytes of trailing padding, requiring 4-byte alginment. When we create a bug_entry in assembly, we align the start of the entry to 4 bytes, which implicitly handles padding for any prior entries. However, we do not align the end of the entry, and so when CONFIG_DEBUG_BUGVERBOSE=n, the final entry lacks the trailing padding bytes. For the main kernel image this is not a problem as find_bug() doesn't depend on the trailing padding bytes when searching for entries: for (bug = __start___bug_table; bug < __stop___bug_table; ++bug) if (bugaddr == bug_addr(bug)) return bug; However for modules, module_bug_finalize() depends on the trailing bytes when calculating the number of entries: mod->num_bugs = sechdrs[i].sh_size / sizeof(struct bug_entry); ... and as the last bug_entry lacks the necessary padding bytes, this entry will not be counted, e.g. in the case of a single entry: sechdrs[i].sh_size == 6 sizeof(struct bug_entry) == 8; sechdrs[i].sh_size / sizeof(struct bug_entry) == 0; Consequently module_find_bug() will miss the last bug_entry when it does: for (i = 0; i < mod->num_bugs; ++i, ++bug) if (bugaddr == bug_addr(bug)) goto out; ... which can lead to a kenrel panic due to an unhandled bug. This can be demonstrated with the following module: static int __init buginit(void) { WARN(1, "hello\n"); return 0; } static void __exit bugexit(void) { } module_init(buginit); module_exit(bugexit); MODULE_LICENSE("GPL"); ... which will trigger a kernel panic when loaded: ------------[ cut here ]------------ hello Unexpected kernel BRK exception at EL1 Internal error: BRK handler: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: hello(O+) CPU: 0 PID: 50 Comm: insmod Tainted: G O 6.9.1 #8 Hardware name: linux,dummy-virt (DT) pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : buginit+0x18/0x1000 [hello] lr : buginit+0x18/0x1000 [hello] sp : ffff800080533ae0 x29: ffff800080533ae0 x28: 0000000000000000 x27: 0000000000000000 x26: ffffaba8c4e70510 x25: ffff800080533c30 x24: ffffaba8c4a28a58 x23: 0000000000000000 x22: 0000000000000000 x21: ffff3947c0eab3c0 x20: ffffaba8c4e3f000 x19: ffffaba846464000 x18: 0000000000000006 x17: 0000000000000000 x16: ffffaba8c2492834 x15: 0720072007200720 x14: 0720072007200720 x13: ffffaba8c49b27c8 x12: 0000000000000312 x11: 0000000000000106 x10: ffffaba8c4a0a7c8 x9 : ffffaba8c49b27c8 x8 : 00000000ffffefff x7 : ffffaba8c4a0a7c8 x6 : 80000000fffff000 x5 : 0000000000000107 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff3947c0eab3c0 Call trace: buginit+0x18/0x1000 [hello] do_one_initcall+0x80/0x1c8 do_init_module+0x60/0x218 load_module+0x1ba4/0x1d70 __do_sys_init_module+0x198/0x1d0 __arm64_sys_init_module+0x1c/0x28 invoke_syscall+0x48/0x114 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x34/0xd8 el0t_64_sync_handler+0x120/0x12c el0t_64_sync+0x190/0x194 Code: d0ffffe0 910003fd 91000000 9400000b (d4210000) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: BRK handler: Fatal exception Fix this by always aligning the end of a bug_entry to 4 bytes, which is correct regardless of CONFIG_DEBUG_BUGVERBOSE. Fixes: 9fb7410 ("arm64/BUG: Use BRK instruction for generic BUG traps") Signed-off-by: Yuanbin Xie <xieyuanbin1@huawei.com> Signed-off-by: Jiangfeng Xiao <xiaojiangfeng@huawei.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/1716212077-43826-1-git-send-email-xiaojiangfeng@huawei.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> (cherry picked from commit f221bd58db0f6ca087ac0392284f6bce21f4f8ea) Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> (cherry picked from commit 0374a08aa5206c6200a88e0a122500dfd1d195f0) Signed-off-by: Yifei Liu <yifei.l.liu@oracle.com>

commit 1723f04 upstream. Currently if we request a feature that is not set in the Kernel config we fail silently and return all the available features. However, the man page indicates we should return an EINVAL. We need to fix this issue since we can end up with a Kernel warning should a program request the feature UFFD_FEATURE_WP_UNPOPULATED on a kernel with the config not set with this feature. [ 200.812896] WARNING: CPU: 91 PID: 13634 at mm/memory.c:1660 zap_pte_range+0x43d/0x660 [ 200.820738] Modules linked in: [ 200.869387] CPU: 91 PID: 13634 Comm: userfaultfd Kdump: loaded Not tainted 6.9.0-rc5+ #8 [ 200.877477] Hardware name: Dell Inc. PowerEdge R6525/0N7YGH, BIOS 2.7.3 03/30/2022 [ 200.885052] RIP: 0010:zap_pte_range+0x43d/0x660 Link: https://lkml.kernel.org/r/20240626130513.120193-1-audra@redhat.com Fixes: e06f1e1 ("userfaultfd: wp: enabled write protection in userfaultfd API") Signed-off-by: Audra Mitchell <audra@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit 519547760f16eae7803d2658d9524bc5ba7a20a7) Signed-off-by: Vijayendra Suman <vijayendra.suman@oracle.com>

[ Upstream commit a699781 ] A sysfs reader can race with a device reset or removal, attempting to read device state when the device is not actually present. eg: [exception RIP: qed_get_current_link+17] #8 [ffffb9e4f2907c48] qede_get_link_ksettings at ffffffffc07a994a [qede] #9 [ffffb9e4f2907cd8] __rh_call_get_link_ksettings at ffffffff992b01a3 #10 [ffffb9e4f2907d38] __ethtool_get_link_ksettings at ffffffff992b04e4 #11 [ffffb9e4f2907d90] duplex_show at ffffffff99260300 #12 [ffffb9e4f2907e38] dev_attr_show at ffffffff9905a01c #13 [ffffb9e4f2907e50] sysfs_kf_seq_show at ffffffff98e0145b #14 [ffffb9e4f2907e68] seq_read at ffffffff98d902e3 #15 [ffffb9e4f2907ec8] vfs_read at ffffffff98d657d1 #16 [ffffb9e4f2907f00] ksys_read at ffffffff98d65c3f #17 [ffffb9e4f2907f38] do_syscall_64 at ffffffff98a052fb crash> struct net_device.state ffff9a9d21336000 state = 5, state 5 is __LINK_STATE_START (0b1) and __LINK_STATE_NOCARRIER (0b100). The device is not present, note lack of __LINK_STATE_PRESENT (0b10). This is the same sort of panic as observed in commit 4224cfd ("net-sysfs: add check for netdevice being present to speed_show"). There are many other callers of __ethtool_get_link_ksettings() which don't have a device presence check. Move this check into ethtool to protect all callers. Fixes: d519e17 ("net: export device speed and duplex via sysfs") Fixes: 4224cfd ("net-sysfs: add check for netdevice being present to speed_show") Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com> Link: https://patch.msgid.link/8bae218864beaa44ed01628140475b9bf641c5b0.1724393671.git.jamie.bainbridge@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> (cherry picked from commit ec7b4f7f644018ac293cb1b02528a40a32917e62) Signed-off-by: Sherry Yang <sherry.yang@oracle.com>

Scenario: 1. Port down and do fail over 2. Ap do rds_bind syscall PID: 47039 TASK: ffff89887e2fe640 CPU: 47 COMMAND: "kworker/u:6" #0 [ffff898e35f159f0] machine_kexec at ffffffff8103abf9 #1 [ffff898e35f15a60] crash_kexec at ffffffff810b96e3 #2 [ffff898e35f15b30] oops_end at ffffffff8150f518 #3 [ffff898e35f15b60] no_context at ffffffff8104854c #4 [ffff898e35f15ba0] __bad_area_nosemaphore at ffffffff81048675 #5 [ffff898e35f15bf0] bad_area_nosemaphore at ffffffff810487d3 #6 [ffff898e35f15c00] do_page_fault at ffffffff815120b8 #7 [ffff898e35f15d10] page_fault at ffffffff8150ea95 [exception RIP: unknown or invalid address] RIP: 0000000000000000 RSP: ffff898e35f15dc8 RFLAGS: 00010282 RAX: 00000000fffffffe RBX: ffff889b77f6fc00 RCX:ffffffff81c99d88 RDX: 0000000000000000 RSI: ffff896019ee08e8 RDI:ffff889b77f6fc00 RBP: ffff898e35f15df0 R8: ffff896019ee08c8 R9:0000000000000000 R10: 0000000000000400 R11: 0000000000000000 R12:ffff896019ee08c0 R13: ffff889b77f6fe68 R14: ffffffff81c99d80 R15: ffffffffa022a1e0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff898e35f15dc8] cma_ndev_work_handler at ffffffffa022a228 [rdma_cm] #9 [ffff898e35f15df8] process_one_work at ffffffff8108a7c6 #10 [ffff898e35f15e58] worker_thread at ffffffff8108bda0 #11 [ffff898e35f15ee8] kthread at ffffffff81090fe6 PID: 45659 TASK: ffff880d313d2500 CPU: 31 COMMAND: "oracle_45659_ap" #0 [ffff881024ccfc98] __schedule at ffffffff8150bac4 #1 [ffff881024ccfd40] schedule at ffffffff8150c2cf #2 [ffff881024ccfd50] __mutex_lock_slowpath at ffffffff8150cee7 #3 [ffff881024ccfdc0] mutex_lock at ffffffff8150cdeb #4 [ffff881024ccfde0] rdma_destroy_id at ffffffffa022a027 [rdma_cm] #5 [ffff881024ccfe10] rds_ib_laddr_check at ffffffffa0357857 [rds_rdma] #6 [ffff881024ccfe50] rds_trans_get_preferred at ffffffffa0324c2a [rds] #7 [ffff881024ccfe80] rds_bind at ffffffffa031d690 [rds] #8 [ffff881024ccfeb0] sys_bind at ffffffff8142a670 PID: 45659 PID: 47039 rds_ib_laddr_check /* create id_priv with a null event_handler */ rdma_create_id rdma_bind_addr cma_acquire_dev /* add id_priv to cma_dev->id_list */ cma_attach_to_dev cma_ndev_work_handler /* event_hanlder is null */ id_priv->id.event_handler Orabug: 27530931 Signed-off-by: Guanglei Li <guanglei.li@oracle.com> Signed-off-by: Honglei Wang <honglei.wang@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Yanjun Zhu <yanjun.zhu@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Doug Ledford <dledford@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit 2c0aa08) Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Orabug: 33590097 UEK6 => UEK7 (cherry picked from commit 39e0939) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Orabug: 33590087 UEK7 => LUCI (cherry picked from commit 7d342f8) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>

The customer hit this crash few times. PID: 31556 TASK: ffff880f823caa00 CPU: 1 COMMAND: "cellsrv" #0 [ffff880f823db850] machine_kexec at ffffffff8105d93c #1 [ffff880f823db8b0] crash_kexec at ffffffff811103b3 #2 [ffff880f823db980] oops_end at ffffffff8101a788 #3 [ffff880f823db9b0] no_context at ffffffff8106b9cf #4 [ffff880f823dba20] __bad_area_nosemaphore at ffffffff8106bc9d #5 [ffff880f823dba70] bad_area at ffffffff8106be97 #6 [ffff880f823dbaa0] __do_page_fault at ffffffff8106c71e #7 [ffff880f823dbb00] do_page_fault at ffffffff8106c81f #8 [ffff880f823dbb40] page_fault at ffffffff816b5a9f [exception RIP: rds_ib_inc_copy_to_user+104] RIP: ffffffffa04607b8 RSP: ffff880f823dbbf8 RFLAGS: 00010287 RAX: 0000000000000340 RBX: 0000000000001000 RCX: 0000000000004000 RDX: 0000000000001000 RSI: ffff88176cea2000 RDI: ffff8817d291f520 RBP: ffff880f823dbc48 R8: 0000000000001340 R9: 0000000000001000 R10: 0000000000001200 R11: ffff880f823dc000 R12: ffff880f823dbed0 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000001000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff880f823dbc50] rds_recvmsg at ffffffffa041d837 [rds] int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to) ... ... ibinc = container_of(inc, struct rds_ib_incoming, ii_inc); frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item); len = be32_to_cpu(inc->i_hdr.h_len); sg = frag->f_sg; while (iov_iter_count(to) && copied < len) { to_copy = min_t(unsigned long, iov_iter_count(to), sg->length - frag_off); ... sg is NULL and it crashes accessing sg->length above. The cause looks like is due to ic->i_frag_sz returning incorrect value. 16KB when 4KB was expected. if (copied % ic->i_frag_sz == 0) { frag = list_entry(frag->f_item.next, struct rds_page_frag, f_item); frag_off = 0; sg = frag->f_sg; } The other end is using 4KB RDS fragsize (Solaris Super Cluster). This end is UEK4 (4.1.12-94.8.4.el6uek.x86_64). The message being copied arrived over 4KB RDS frag size connection. But during the above check ic->i_frag_sz is 16KB. This can happen during a reconnect at the connection setup phase. We start off with ic->i_frag_sz as 16KB. Then settle down at 4KB. Failing this check if (copied % ic->i_frag_sz == 0) { can result in sg not getting set correctly. Say, "copied" = 4KB but ic->i_frag_sz is 16KB when it should be 4KB. During race condition with a reconnect, ic->i_frag_sz can be 16KB even though once the connection is set up it settled down to 4KB. It can change from 4KB to 16KB and back to 4KB during connection setup due to reconnect. We started seeing this crash after bug 26848749. But prior to that the same scenario could result in data copied to user from incorrect "sg" resulting in data corruption. Orabug: 28748008 Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Orabug: 33590097 UEK6 => UEK7 (cherry picked from commit 14858a3) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Orabug: 33590087 UEK7 => LUCI (cherry picked from commit e86878f) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>

…error The sequence that leads to this state is as follows. 1) First we see CQ error logged. Sep 29 22:32:33 dm54cel14 kernel: [471472.784371] mlx4_core 0000:46:00.0: CQ access violation on CQN 000419 syndrome=0x2 vendor_error_syndrome=0x0 2) That is followed by the drop of the associated RDS connection. Sep 29 22:32:33 dm54cel14 kernel: [471472.784403] RDS/IB: connection <192.168.54.43,192.168.54.1,0> dropped due to 'qp event' 3) We don't get the WR_FLUSH_ERRs for the posted receive buffers after that. 4) RDS is stuck in rds_ib_conn_shutdown while shutting down that connection. crash64> bt 62577 PID: 62577 TASK: ffff88143f045400 CPU: 4 COMMAND: "kworker/u224:1" #0 [ffff8813663bbb58] __schedule at ffffffff816ab68b #1 [ffff8813663bbbb0] schedule at ffffffff816abca7 #2 [ffff8813663bbbd0] schedule_timeout at ffffffff816aee71 #3 [ffff8813663bbc80] rds_ib_conn_shutdown at ffffffffa041f7d1 [rds_rdma] #4 [ffff8813663bbd10] rds_conn_shutdown at ffffffffa03dc6e2 [rds] #5 [ffff8813663bbdb0] rds_shutdown_worker at ffffffffa03e2699 [rds] #6 [ffff8813663bbe00] process_one_work at ffffffff8109cda1 #7 [ffff8813663bbe50] worker_thread at ffffffff8109d92b #8 [ffff8813663bbec0] kthread at ffffffff810a304b #9 [ffff8813663bbf50] ret_from_fork at ffffffff816b0752 crash64> It was stuck here in rds_ib_conn_shutdown for ever: /* quiesce tx and rx completion before tearing down */ while (!wait_event_timeout(rds_ib_ring_empty_wait, rds_ib_ring_empty(&ic->i_recv_ring) && (atomic_read(&ic->i_signaled_sends) == 0), msecs_to_jiffies(5000))) { /* Try to reap pending RX completions every 5 secs */ if (!rds_ib_ring_empty(&ic->i_recv_ring)) { spin_lock_bh(&ic->i_rx_lock); rds_ib_rx(ic); spin_unlock_bh(&ic->i_rx_lock); } } The recv ring was not empty. w_alloc_ptr = 560 w_free_ptr = 256 This is what Mellanox had to say: When CQ moves to error (e.g. due to CQ Overrun, CQ Access violation) FW will generate Async event to notify this error, also the QPs that tries to access this CQ will be put to error state but will not be flushed since we must not post CQEs to a broken CQ. The QP that tries to access will also issue an Async catas event. In summary we cannot wait for any more WR_FLUSH_ERRs in that state. Orabug: 29180452 Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Orabug: 33590097 UEK6 => UEK7 (cherry picked from commit 964cad6) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Orabug: 33590087 UEK7 => LUCI (cherry picked from commit e40c8e4) cherry-pick-repo=UEK/production/linux-uek.git Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com>

One of our customers reported the following stack. crash-7.3.0> bt PID: 250515 TASK: ffff888189482f80 CPU: 1 COMMAND: "vmbackup" #0 [ffffc90025017878] die at ffffffff81033c22 #1 [ffffc900250178a8] do_trap at ffffffff81030990 #2 [ffffc900250178f8] do_error_trap at ffffffff810311d7 #3 [ffffc900250179c0] do_invalid_op at ffffffff81031310 #4 [ffffc900250179d0] invalid_op at ffffffff81a01f2a [exception RIP: ocfs2_truncate_rec+1914] RIP: ffffffffc1e73b4a RSP: ffffc90025017a80 RFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000053a75 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff8882d385be08 RDI: ffff8882d385be08 RBP: ffffc90025017b10 R8: 0000000000000000 R9: 0000000000005900 R10: 0000000000000001 R11: 0000000000aaaaaa R12: 0000000000000001 R13: ffff88829e5a9900 R14: ffffc90025017cf0 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: e030 SS: e02b #5 [ffffc90025017b18] ocfs2_remove_extent at ffffffffc1e73e6c [ocfs2] #6 [ffffc90025017bc8] ocfs2_remove_btree_range at ffffffffc1e745f2 [ocfs2] #7 [ffffc90025017c60] ocfs2_commit_truncate at ffffffffc1e75b1f [ocfs2] #8 [ffffc90025017d68] __dta_ocfs2_wipe_inode_606 at ffffffffc1e9a3e0 [ocfs2] #9 [ffffc90025017dd8] ocfs2_evict_inode at ffffffffc1e9ac10 [ocfs2] RIP: 00007f9b26ec8307 RSP: 00007ffc5a193f68 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000ddd0a0 RCX: 00007f9b26ec8307 RDX: 0000000000000001 RSI: 00007f9b2719e770 RDI: 0000000001010400 RBP: 0000000001263d80 R8: 0000000000000000 R9: 00000000012146a0 R10: 000000000000000d R11: 0000000000000246 R12: 0000000000ddd0a0 R13: 00007f9b27ba9595 R14: 00007f9b27ca4a50 R15: 00000000ffffffff ORIG_RAX: 0000000000000057 CS: 0033 SS: 002b crash-7.3.0> This crash resulted due to invalid extent record selected for truncate. At the top of the function ocfs2_truncate_rec(), the code checks if the first extent record at the leaf extent list corresponding to the input path is still empty. In that case the tree is rotated left to get rid of the empty extent record but this rotation did not happen. But the function ocfs2_truncate_rec() assumes that the top level call to ocfs2_rotate_tree_left() to get rid of the empty extent always succeeds and hence it decrements the input "index" value. This results in selection of a wrong record for truncate that causes to hit a call to BUG() with the message "Owner %llu: Invalid record truncate: (%u, %u) ". The stack above is the panic stack caused due to hitting BUG(). Though the function ocfs2_rotate_tree_left() was intended to get rid of the first empty record in the extent block, it did not call the function ocfs2_rotate_rightmost_leaf_left() as it did not find h_next_leaf_blk in the extentleaf block to be zero, instead, it proceeded to call __ocfs2_rotate_tree_left(). However the input "index" value was indeed pointing to the last extent record in the leaf block. The macro path_leaf_bh() was returning rightmost extent block as per the tree-depth. and the function ocfs2_find_cpos_for_right_leaf() also found out that the extent block in question is indeed the rightmost and hence there is nothing to rotate at the last extent record pointed by the input "index" value. Hence the extent tree in the leaf block was not totated at all. Hence, the real reason for the above panic is that the value of the field h_next_leaf_blk in the right most leaf block was non-zero that caused the tree not to rotate left resulting in selection of invalid record for truncate. The reason why h_next_leaf_blk was not cleared for the last extent block is still not known and the code changes here is a workaround to avoid the panic by verifying that the extent block in question is indeed the rightmost leaf block in the tree and then correcting the invalid h_next_leaf_blk value. These changes have been verified by the customer by running the provided rpm in their env. Orabug: 34393593 Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>

Add a check to mlx5e_xmit() for shorter frames. A corrupted/malformed packet, with shorter length can eventually cause system panic further down in the code path. Avoid it by validating the length and dropping it at the earliest. Following is seen in our env with shorter skb->len crash> bt PID: 76981 TASK: ff19828cfe508000 CPU: 106 COMMAND: "vhost-76942" #0 [ff2d20159b39f2c8] machine_kexec at ffffffffad884801 #1 [ff2d20159b39f328] __crash_kexec at ffffffffad976142 #2 [ff2d20159b39f3f8] panic at ffffffffad8b3640 #3 [ff2d20159b39f4a0] no_context at ffffffffad8954e1 #4 [ff2d20159b39f518] __bad_area_nosemaphore at ffffffffad8958de #5 [ff2d20159b39f578] bad_area_nosemaphore at ffffffffad895a96 #6 [ff2d20159b39f588] do_kern_addr_fault at ffffffffad89688e #7 [ff2d20159b39f5b0] __do_page_fault at ffffffffad896b30 #8 [ff2d20159b39f618] do_page_fault at ffffffffad896db6 #9 [ff2d20159b39f650] page_fault at ffffffffae402acd [exception RIP: memcpy_erms+6] RIP: ffffffffae261ab6 RSP: ff2d20159b39f700 RFLAGS: 00010293 RAX: ff198291741ecf2e RBX: ff19828e70d6a100 RCX: fffffffffea1af2b RDX: fffffffffffffffd RSI: ff19828eba6d7e5e RDI: ff198291757d2000 RBP: ff2d20159b39f760 R8: ff198291741ecf00 R9: 000000000000037c R10: 000000000000003c R11: ff19828ffe953940 R12: ff198291741ecf20 R13: ff198267dcb1b600 R14: ff19828eeebb09c0 R15: ff198291741ecf00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ff2d20159b39f700] mlx5e_sq_xmit_wqe at ffffffffc05c162e [mlx5_core] #11 [ff2d20159b39f768] mlx5e_xmit at ffffffffc05c1ca3 [mlx5_core] #12 [ff2d20159b39f800] dev_hard_start_xmit at ffffffffae083766 #13 [ff2d20159b39f860] sch_direct_xmit at ffffffffae0e2564 #14 [ff2d20159b39f8b0] __qdisc_run at ffffffffae0e294e #15 [ff2d20159b39f928] __dev_queue_xmit at ffffffffae083eee #16 [ff2d20159b39f9a8] dev_queue_xmit at ffffffffae084370 #17 [ff2d20159b39f9b8] vlan_dev_hard_start_xmit at ffffffffc2fb6fec [8021q] #18 [ff2d20159b39f9d8] dev_hard_start_xmit at ffffffffae083766 #19 [ff2d20159b39fa38] __dev_queue_xmit at ffffffffae08416a #20 [ff2d20159b39fab8] dev_queue_xmit_accel at ffffffffae08438e #21 [ff2d20159b39fac8] macvlan_start_xmit at ffffffffc2fc18d9 [macvlan] #22 [ff2d20159b39faf0] dev_hard_start_xmit at ffffffffae083766 #23 [ff2d20159b39fb50] sch_direct_xmit at ffffffffae0e2564 #24 [ff2d20159b39fba0] __qdisc_run at ffffffffae0e294e #25 [ff2d20159b39fc18] __dev_queue_xmit at ffffffffae083c81 #26 [ff2d20159b39fc90] dev_queue_xmit at ffffffffae084370 #27 [ff2d20159b39fca0] tap_sendmsg at ffffffffc07206ed [tap] #28 [ff2d20159b39fd20] vhost_tx_batch at ffffffffc2fd6590 [vhost_net] #29 [ff2d20159b39fd68] handle_tx_copy at ffffffffc2fd70f3 [vhost_net] #30 [ff2d20159b39fe80] handle_tx at ffffffffc2fd7651 [vhost_net] #31 [ff2d20159b39feb0] handle_tx_kick at ffffffffc2fd76b5 [vhost_net] #32 [ff2d20159b39fec0] vhost_worker at ffffffffc12a5be8 [vhost] #33 [ff2d20159b39ff08] kthread at ffffffffad8dbfe5 #34 [ff2d20159b39ff50] ret_from_fork at ffffffffae400364 This change was discussed with Nvidia and they are in agreement. Orabug: 36879156 CVE: CVE-2024-41090 CVE: CVE-2024-41091 Fixes: e4cf27b ("net/mlx5e: Re-eanble client vlan TX acceleration") Reported-and-tested-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> (cherry picked from commit 0dd4b99) Orabug: 36879126 CVE: CVE-2024-41090 CVE: CVE-2024-41091 Signed-off-by: Harshvardhan Jha <harshvardhan.j.jha@oracle.com> Reviewed-by: Vijayendra Suman <vijayendra.suman@oracle.com>

…ress ACPICA commit c14708336bd18552b28643575de7b5beb9b864e9 Before this change we see the following UBSAN stack trace in Fuchsia: #0 0x0000220c98288eba in acpi_rs_get_address_common(struct acpi_resource*, union aml_resource*) ../../third_party/acpica/source/components/resources/rsaddr.c:331 <platform-bus-x86.so>+0x8f6eba #1.2 0x000023625f46077f in ubsan_get_stack_trace() compiler-rt/lib/ubsan/ubsan_diag.cpp:41 <libclang_rt.asan.so>+0x3d77f #1.1 0x000023625f46077f in maybe_print_stack_trace() compiler-rt/lib/ubsan/ubsan_diag.cpp:51 <libclang_rt.asan.so>+0x3d77f #1 0x000023625f46077f in ~scoped_report() compiler-rt/lib/ubsan/ubsan_diag.cpp:387 <libclang_rt.asan.so>+0x3d77f #2 0x000023625f461385 in handletype_mismatch_impl() compiler-rt/lib/ubsan/ubsan_handlers.cpp:137 <libclang_rt.asan.so>+0x3e385 #3 0x000023625f460ead in compiler-rt/lib/ubsan/ubsan_handlers.cpp:142 <libclang_rt.asan.so>+0x3dead #4 0x0000220c98288eba in acpi_rs_get_address_common(struct acpi_resource*, union aml_resource*) ../../third_party/acpica/source/components/resources/rsaddr.c:331 <platform-bus-x86.so>+0x8f6eba #5 0x0000220c9828ea57 in acpi_rs_convert_aml_to_resource(struct acpi_resource*, union aml_resource*, struct acpi_rsconvert_info*) ../../third_party/acpica/source/components/resources/rsmisc.c:352 <platform-bus-x86.so>+0x8fca57 #6 0x0000220c9828992c in acpi_rs_convert_aml_to_resources(u8*, u32, u32, u8, void**) ../../third_party/acpica/source/components/resources/rslist.c:132 <platform-bus-x86.so>+0x8f792c #7 0x0000220c982d1cfc in acpi_ut_walk_aml_resources(struct acpi_walk_state*, u8*, acpi_size, acpi_walk_aml_callback, void**) ../../third_party/acpica/source/components/utilities/utresrc.c:234 <platform-bus-x86.so>+0x93fcfc #8 0x0000220c98281e46 in acpi_rs_create_resource_list(union acpi_operand_object*, struct acpi_buffer*) ../../third_party/acpica/source/components/resources/rscreate.c:199 <platform-bus-x86.so>+0x8efe46 #9 0x0000220c98293b51 in acpi_rs_get_method_data(acpi_handle, const char*, struct acpi_buffer*) ../../third_party/acpica/source/components/resources/rsutils.c:770 <platform-bus-x86.so>+0x901b51 #10 0x0000220c9829438d in acpi_walk_resources(acpi_handle, char*, acpi_walk_resource_callback, void*) ../../third_party/acpica/source/components/resources/rsxface.c:731 <platform-bus-x86.so>+0x90238d #11 0x0000220c97db272b in acpi::acpi_impl::walk_resources(acpi::acpi_impl*, acpi_handle, const char*, acpi::Acpi::resources_callable) ../../src/devices/board/lib/acpi/acpi-impl.cc:41 <platform-bus-x86.so>+0x42072b #12 0x0000220c97dcec59 in acpi::device_builder::gather_resources(acpi::device_builder*, acpi::Acpi*, fidl::any_arena&, acpi::Manager*, acpi::device_builder::gather_resources_callback) ../../src/devices/board/lib/acpi/device-builder.cc:52 <platform-bus-x86.so>+0x43cc59 #13 0x0000220c97f94a3f in acpi::Manager::configure_discovered_devices(acpi::Manager*) ../../src/devices/board/lib/acpi/manager.cc:75 <platform-bus-x86.so>+0x602a3f #14 0x0000220c97c642c7 in publish_acpi_devices(acpi::Manager*, zx_device_t*, zx_device_t*) ../../src/devices/board/drivers/x86/acpi-nswalk.cc:102 <platform-bus-x86.so>+0x2d22c7 #15 0x0000220c97caf3e6 in x86::X86::do_init(x86::X86*) ../../src/devices/board/drivers/x86/x86.cc:65 <platform-bus-x86.so>+0x31d3e6 #16 0x0000220c97cd72ae in λ(x86::X86::ddk_init::(anon class)*) ../../src/devices/board/drivers/x86/x86.cc:82 <platform-bus-x86.so>+0x3452ae #17 0x0000220c97cd7223 in fit::internal::target<(lambda at../../src/devices/board/drivers/x86/x86.cc:81:19), false, false, void>::invoke(void*) ../../sdk/lib/fit/include/lib/fit/internal/function.h:181 <platform-bus-x86.so>+0x345223 #18 0x0000220c97f48eb0 in fit::internal::function_base<16UL, false, void()>::invoke(const fit::internal::function_base<16UL, false, void ()>*) ../../sdk/lib/fit/include/lib/fit/internal/function.h:505 <platform-bus-x86.so>+0x5b6eb0 #19 0x0000220c97f48d2a in fit::function_impl<16UL, false, void()>::operator()(const fit::function_impl<16UL, false, void ()>*) ../../sdk/lib/fit/include/lib/fit/function.h:300 <platform-bus-x86.so>+0x5b6d2a #20 0x0000220c982f9245 in async::internal::retained_task::Handler(async_dispatcher_t*, async_task_t*, zx_status_t) ../../zircon/system/ulib/async/task.cc:25 <platform-bus-x86.so>+0x967245 #21 0x000022e2aa1cd91e in λ(const driver_runtime::Dispatcher::post_task::(anon class)*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, zx_status_t) ../../src/devices/bin/driver_runtime/dispatcher.cc:715 <libdriver_runtime.so>+0xed91e #22 0x000022e2aa1cd621 in fit::internal::target<(lambda at../../src/devices/bin/driver_runtime/dispatcher.cc:714:7), true, false, void, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request>>, int>::invoke(void*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int) ../../sdk/lib/fit/include/lib/fit/internal/function.h:128 <libdriver_runtime.so>+0xed621 #23 0x000022e2aa1a8482 in fit::internal::function_base<24UL, true, void(std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request>>, int)>::invoke(const fit::internal::function_base<24UL, true, void (std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int)>*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int) ../../sdk/lib/fit/include/lib/fit/internal/function.h:505 <libdriver_runtime.so>+0xc8482 #24 0x000022e2aa1a80f8 in fit::callback_impl<24UL, true, void(std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request>>, int)>::operator()(fit::callback_impl<24UL, true, void (std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int)>*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int) ../../sdk/lib/fit/include/lib/fit/function.h:451 <libdriver_runtime.so>+0xc80f8 #25 0x000022e2aa17fc76 in driver_runtime::callback_request::Call(driver_runtime::callback_request*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, zx_status_t) ../../src/devices/bin/driver_runtime/callback_request.h:67 <libdriver_runtime.so>+0x9fc76 #26 0x000022e2aa18c7ef in driver_runtime::Dispatcher::dispatch_callback(driver_runtime::Dispatcher*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >) ../../src/devices/bin/driver_runtime/dispatcher.cc:1093 <libdriver_runtime.so>+0xac7ef #27 0x000022e2aa18fd67 in driver_runtime::Dispatcher::dispatch_callbacks(driver_runtime::Dispatcher*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../src/devices/bin/driver_runtime/dispatcher.cc:1169 <libdriver_runtime.so>+0xafd67 #28 0x000022e2aa1bc9a2 in λ(const driver_runtime::Dispatcher::create_with_adder::(anon class)*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../src/devices/bin/driver_runtime/dispatcher.cc:338 <libdriver_runtime.so>+0xdc9a2 #29 0x000022e2aa1bc6d2 in fit::internal::target<(lambda at../../src/devices/bin/driver_runtime/dispatcher.cc:337:7), true, false, void, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter>>, fbl::ref_ptr<driver_runtime::Dispatcher>>::invoke(void*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../sdk/lib/fit/include/lib/fit/internal/function.h:128 <libdriver_runtime.so>+0xdc6d2 #30 0x000022e2aa1aa1e5 in fit::internal::function_base<8UL, true, void(std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter>>, fbl::ref_ptr<driver_runtime::Dispatcher>)>::invoke(const fit::internal::function_base<8UL, true, void (std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>)>*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../sdk/lib/fit/include/lib/fit/internal/function.h:505 <libdriver_runtime.so>+0xca1e5 #31 0x000022e2aa1a9e32 in fit::function_impl<8UL, true, void(std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter>>, fbl::ref_ptr<driver_runtime::Dispatcher>)>::operator()(const fit::function_impl<8UL, true, void (std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>)>*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../sdk/lib/fit/include/lib/fit/function.h:300 <libdriver_runtime.so>+0xc9e32 #32 0x000022e2aa193444 in driver_runtime::Dispatcher::event_waiter::invoke_callback(driver_runtime::Dispatcher::event_waiter*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../src/devices/bin/driver_runtime/dispatcher.h:299 <libdriver_runtime.so>+0xb3444 #33 0x000022e2aa192feb in driver_runtime::Dispatcher::event_waiter::handle_event(std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, async_dispatcher_t*, async::wait_base*, zx_status_t, zx_packet_signal_t const*) ../../src/devices/bin/driver_runtime/dispatcher.cc:1259 <libdriver_runtime.so>+0xb2feb #34 0x000022e2aa1bcf74 in async_loop_owned_event_handler<driver_runtime::Dispatcher::event_waiter>::handle_event(async_loop_owned_event_handler<driver_runtime::Dispatcher::event_waiter>*, async_dispatcher_t*, async::wait_base*, zx_status_t, zx_packet_signal_t const*) ../../src/devices/bin/driver_runtime/async_loop_owned_event_handler.h:59 <libdriver_runtime.so>+0xdcf74 #35 0x000022e2aa1bd1cb in async::wait_method<async_loop_owned_event_handler<driver_runtime::Dispatcher::event_waiter>, &async_loop_owned_event_handler<driver_runtime::Dispatcher::event_waiter>::handle_event>::call_handler(async_dispatcher_t*, async_wait_t*, zx_status_t, zx_packet_signal_t const*) ../../zircon/system/ulib/async/include/lib/async/cpp/wait.h:201 <libdriver_runtime.so>+0xdd1cb #36 0x000022e2aa2303a9 in async_loop_dispatch_wait(async_loop_t*, async_wait_t*, zx_status_t, zx_packet_signal_t const*) ../../zircon/system/ulib/async-loop/loop.c:381 <libdriver_runtime.so>+0x1503a9 #37 0x000022e2aa229a82 in async_loop_run_once(async_loop_t*, zx_time_t) ../../zircon/system/ulib/async-loop/loop.c:330 <libdriver_runtime.so>+0x149a82 #38 0x000022e2aa229102 in async_loop_run(async_loop_t*, zx_time_t, _Bool) ../../zircon/system/ulib/async-loop/loop.c:288 <libdriver_runtime.so>+0x149102 #39 0x000022e2aa22aeb7 in async_loop_run_thread(void*) ../../zircon/system/ulib/async-loop/loop.c:840 <libdriver_runtime.so>+0x14aeb7 #40 0x000041a874980f1c in start_c11(void*) ../../zircon/third_party/ulib/musl/pthread/pthread_create.c:55 <libc.so>+0xd7f1c #41 0x000041a874aabe8d in thread_trampoline(uintptr_t, uintptr_t) ../../zircon/system/ulib/runtime/thread.cc:100 <libc.so>+0x202e8d Link: acpica/acpica@c1470833 Signed-off-by: Bob Moore <robert.moore@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (cherry picked from commit 24d9609) Orabug: 37311726 Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Reviewed-by: Thomas Tai <thomas.tai@oracle.com> Signed-off-by: Vijayendra Suman <vijayendra.suman@oracle.com>

[ Upstream commit 9ca3144 ] The referenced commits introduced a two-step process for deleting FTEs: - Lock the FTE, delete it from hardware, set the hardware deletion function to NULL and unlock the FTE. - Lock the parent flow group, delete the software copy of the FTE, and remove it from the xarray. However, this approach encounters a race condition if a rule with the same match value is added simultaneously. In this scenario, fs_core may set the hardware deletion function to NULL prematurely, causing a panic during subsequent rule deletions. To prevent this, ensure the active flag of the FTE is checked under a lock, which will prevent the fs_core layer from attaching a new steering rule to an FTE that is in the process of deletion. [ 438.967589] MOSHE: 2496 mlx5_del_flow_rules del_hw_func [ 438.968205] ------------[ cut here ]------------ [ 438.968654] refcount_t: decrement hit 0; leaking memory. [ 438.969249] WARNING: CPU: 0 PID: 8957 at lib/refcount.c:31 refcount_warn_saturate+0xfb/0x110 [ 438.970054] Modules linked in: act_mirred cls_flower act_gact sch_ingress openvswitch nsh mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core zram zsmalloc fuse [last unloaded: cls_flower] [ 438.973288] CPU: 0 UID: 0 PID: 8957 Comm: tc Not tainted 6.12.0-rc1+ #8 [ 438.973888] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 438.974874] RIP: 0010:refcount_warn_saturate+0xfb/0x110 [ 438.975363] Code: 40 66 3b 82 c6 05 16 e9 4d 01 01 e8 1f 7c a0 ff 0f 0b c3 cc cc cc cc 48 c7 c7 10 66 3b 82 c6 05 fd e8 4d 01 01 e8 05 7c a0 ff <0f> 0b c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 [ 438.976947] RSP: 0018:ffff888124a53610 EFLAGS: 00010286 [ 438.977446] RAX: 0000000000000000 RBX: ffff888119d56de0 RCX: 0000000000000000 [ 438.978090] RDX: ffff88852c828700 RSI: ffff88852c81b3c0 RDI: ffff88852c81b3c0 [ 438.978721] RBP: ffff888120fa0e88 R08: 0000000000000000 R09: ffff888124a534b0 [ 438.979353] R10: 0000000000000001 R11: 0000000000000001 R12: ffff888119d56de0 [ 438.979979] R13: ffff888120fa0ec0 R14: ffff888120fa0ee8 R15: ffff888119d56de0 [ 438.980607] FS: 00007fe6dcc0f800(0000) GS:ffff88852c800000(0000) knlGS:0000000000000000 [ 438.983984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 438.984544] CR2: 00000000004275e0 CR3: 0000000186982001 CR4: 0000000000372eb0 [ 438.985205] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 438.985842] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 438.986507] Call Trace: [ 438.986799] <TASK> [ 438.987070] ? __warn+0x7d/0x110 [ 438.987426] ? refcount_warn_saturate+0xfb/0x110 [ 438.987877] ? report_bug+0x17d/0x190 [ 438.988261] ? prb_read_valid+0x17/0x20 [ 438.988659] ? handle_bug+0x53/0x90 [ 438.989054] ? exc_invalid_op+0x14/0x70 [ 438.989458] ? asm_exc_invalid_op+0x16/0x20 [ 438.989883] ? refcount_warn_saturate+0xfb/0x110 [ 438.990348] mlx5_del_flow_rules+0x2f7/0x340 [mlx5_core] [ 438.990932] __mlx5_eswitch_del_rule+0x49/0x170 [mlx5_core] [ 438.991519] ? mlx5_lag_is_sriov+0x3c/0x50 [mlx5_core] [ 438.992054] ? xas_load+0x9/0xb0 [ 438.992407] mlx5e_tc_rule_unoffload+0x45/0xe0 [mlx5_core] [ 438.993037] mlx5e_tc_del_fdb_flow+0x2a6/0x2e0 [mlx5_core] [ 438.993623] mlx5e_flow_put+0x29/0x60 [mlx5_core] [ 438.994161] mlx5e_delete_flower+0x261/0x390 [mlx5_core] [ 438.994728] tc_setup_cb_destroy+0xb9/0x190 [ 438.995150] fl_hw_destroy_filter+0x94/0xc0 [cls_flower] [ 438.995650] fl_change+0x11a4/0x13c0 [cls_flower] [ 438.996105] tc_new_tfilter+0x347/0xbc0 [ 438.996503] ? ___slab_alloc+0x70/0x8c0 [ 438.996929] rtnetlink_rcv_msg+0xf9/0x3e0 [ 438.997339] ? __netlink_sendskb+0x4c/0x70 [ 438.997751] ? netlink_unicast+0x286/0x2d0 [ 438.998171] ? __pfx_rtnetlink_rcv_msg+0x10/0x10 [ 438.998625] netlink_rcv_skb+0x54/0x100 [ 438.999020] netlink_unicast+0x203/0x2d0 [ 438.999421] netlink_sendmsg+0x1e4/0x420 [ 438.999820] __sock_sendmsg+0xa1/0xb0 [ 439.000203] ____sys_sendmsg+0x207/0x2a0 [ 439.000600] ? copy_msghdr_from_user+0x6d/0xa0 [ 439.001072] ___sys_sendmsg+0x80/0xc0 [ 439.001459] ? ___sys_recvmsg+0x8b/0xc0 [ 439.001848] ? generic_update_time+0x4d/0x60 [ 439.002282] __sys_sendmsg+0x51/0x90 [ 439.002658] do_syscall_64+0x50/0x110 [ 439.003040] entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: 718ce4d ("net/mlx5: Consolidate update FTE for all removal changes") Fixes: cefc235 ("net/mlx5: Fix FTE cleanup") Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> (cherry picked from commit 5b47c2f47c2fe921681f4a4fe2790375e6c04cdd) Signed-off-by: Vijayendra Suman <vijayendra.suman@oracle.com>

[ Upstream commit 9ca3144 ] The referenced commits introduced a two-step process for deleting FTEs: - Lock the FTE, delete it from hardware, set the hardware deletion function to NULL and unlock the FTE. - Lock the parent flow group, delete the software copy of the FTE, and remove it from the xarray. However, this approach encounters a race condition if a rule with the same match value is added simultaneously. In this scenario, fs_core may set the hardware deletion function to NULL prematurely, causing a panic during subsequent rule deletions. To prevent this, ensure the active flag of the FTE is checked under a lock, which will prevent the fs_core layer from attaching a new steering rule to an FTE that is in the process of deletion. [ 438.967589] MOSHE: 2496 mlx5_del_flow_rules del_hw_func [ 438.968205] ------------[ cut here ]------------ [ 438.968654] refcount_t: decrement hit 0; leaking memory. [ 438.969249] WARNING: CPU: 0 PID: 8957 at lib/refcount.c:31 refcount_warn_saturate+0xfb/0x110 [ 438.970054] Modules linked in: act_mirred cls_flower act_gact sch_ingress openvswitch nsh mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core zram zsmalloc fuse [last unloaded: cls_flower] [ 438.973288] CPU: 0 UID: 0 PID: 8957 Comm: tc Not tainted 6.12.0-rc1+ #8 [ 438.973888] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 438.974874] RIP: 0010:refcount_warn_saturate+0xfb/0x110 [ 438.975363] Code: 40 66 3b 82 c6 05 16 e9 4d 01 01 e8 1f 7c a0 ff 0f 0b c3 cc cc cc cc 48 c7 c7 10 66 3b 82 c6 05 fd e8 4d 01 01 e8 05 7c a0 ff <0f> 0b c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 [ 438.976947] RSP: 0018:ffff888124a53610 EFLAGS: 00010286 [ 438.977446] RAX: 0000000000000000 RBX: ffff888119d56de0 RCX: 0000000000000000 [ 438.978090] RDX: ffff88852c828700 RSI: ffff88852c81b3c0 RDI: ffff88852c81b3c0 [ 438.978721] RBP: ffff888120fa0e88 R08: 0000000000000000 R09: ffff888124a534b0 [ 438.979353] R10: 0000000000000001 R11: 0000000000000001 R12: ffff888119d56de0 [ 438.979979] R13: ffff888120fa0ec0 R14: ffff888120fa0ee8 R15: ffff888119d56de0 [ 438.980607] FS: 00007fe6dcc0f800(0000) GS:ffff88852c800000(0000) knlGS:0000000000000000 [ 438.983984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 438.984544] CR2: 00000000004275e0 CR3: 0000000186982001 CR4: 0000000000372eb0 [ 438.985205] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 438.985842] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 438.986507] Call Trace: [ 438.986799] <TASK> [ 438.987070] ? __warn+0x7d/0x110 [ 438.987426] ? refcount_warn_saturate+0xfb/0x110 [ 438.987877] ? report_bug+0x17d/0x190 [ 438.988261] ? prb_read_valid+0x17/0x20 [ 438.988659] ? handle_bug+0x53/0x90 [ 438.989054] ? exc_invalid_op+0x14/0x70 [ 438.989458] ? asm_exc_invalid_op+0x16/0x20 [ 438.989883] ? refcount_warn_saturate+0xfb/0x110 [ 438.990348] mlx5_del_flow_rules+0x2f7/0x340 [mlx5_core] [ 438.990932] __mlx5_eswitch_del_rule+0x49/0x170 [mlx5_core] [ 438.991519] ? mlx5_lag_is_sriov+0x3c/0x50 [mlx5_core] [ 438.992054] ? xas_load+0x9/0xb0 [ 438.992407] mlx5e_tc_rule_unoffload+0x45/0xe0 [mlx5_core] [ 438.993037] mlx5e_tc_del_fdb_flow+0x2a6/0x2e0 [mlx5_core] [ 438.993623] mlx5e_flow_put+0x29/0x60 [mlx5_core] [ 438.994161] mlx5e_delete_flower+0x261/0x390 [mlx5_core] [ 438.994728] tc_setup_cb_destroy+0xb9/0x190 [ 438.995150] fl_hw_destroy_filter+0x94/0xc0 [cls_flower] [ 438.995650] fl_change+0x11a4/0x13c0 [cls_flower] [ 438.996105] tc_new_tfilter+0x347/0xbc0 [ 438.996503] ? ___slab_alloc+0x70/0x8c0 [ 438.996929] rtnetlink_rcv_msg+0xf9/0x3e0 [ 438.997339] ? __netlink_sendskb+0x4c/0x70 [ 438.997751] ? netlink_unicast+0x286/0x2d0 [ 438.998171] ? __pfx_rtnetlink_rcv_msg+0x10/0x10 [ 438.998625] netlink_rcv_skb+0x54/0x100 [ 438.999020] netlink_unicast+0x203/0x2d0 [ 438.999421] netlink_sendmsg+0x1e4/0x420 [ 438.999820] __sock_sendmsg+0xa1/0xb0 [ 439.000203] ____sys_sendmsg+0x207/0x2a0 [ 439.000600] ? copy_msghdr_from_user+0x6d/0xa0 [ 439.001072] ___sys_sendmsg+0x80/0xc0 [ 439.001459] ? ___sys_recvmsg+0x8b/0xc0 [ 439.001848] ? generic_update_time+0x4d/0x60 [ 439.002282] __sys_sendmsg+0x51/0x90 [ 439.002658] do_syscall_64+0x50/0x110 [ 439.003040] entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: 718ce4d ("net/mlx5: Consolidate update FTE for all removal changes") Fixes: cefc235 ("net/mlx5: Fix FTE cleanup") Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> (cherry picked from commit 0d568258f99f2076ab02e9234cbabbd43e12f30e) Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>

…le_direct_reclaim() commit 6aaced5abd32e2a57cd94fd64f824514d0361da8 upstream. The task sometimes continues looping in throttle_direct_reclaim() because allow_direct_reclaim(pgdat) keeps returning false. #0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac #1 [ffff80002cb6f900] __schedule at ffff800008abbd1c #2 [ffff80002cb6f990] schedule at ffff800008abc50c #3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550 #4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68 #5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660 #6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98 #7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8 #8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974 #9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4 At this point, the pgdat contains the following two zones: NODE: 4 ZONE: 0 ADDR: ffff00817fffe540 NAME: "DMA32" SIZE: 20480 MIN/LOW/HIGH: 11/28/45 VM_STAT: NR_FREE_PAGES: 359 NR_ZONE_INACTIVE_ANON: 18813 NR_ZONE_ACTIVE_ANON: 0 NR_ZONE_INACTIVE_FILE: 50 NR_ZONE_ACTIVE_FILE: 0 NR_ZONE_UNEVICTABLE: 0 NR_ZONE_WRITE_PENDING: 0 NR_MLOCK: 0 NR_BOUNCE: 0 NR_ZSPAGES: 0 NR_FREE_CMA_PAGES: 0 NODE: 4 ZONE: 1 ADDR: ffff00817fffec00 NAME: "Normal" SIZE: 8454144 PRESENT: 98304 MIN/LOW/HIGH: 68/166/264 VM_STAT: NR_FREE_PAGES: 146 NR_ZONE_INACTIVE_ANON: 94668 NR_ZONE_ACTIVE_ANON: 3 NR_ZONE_INACTIVE_FILE: 735 NR_ZONE_ACTIVE_FILE: 78 NR_ZONE_UNEVICTABLE: 0 NR_ZONE_WRITE_PENDING: 0 NR_MLOCK: 0 NR_BOUNCE: 0 NR_ZSPAGES: 0 NR_FREE_CMA_PAGES: 0 In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of inactive/active file-backed pages calculated in zone_reclaimable_pages() based on the result of zone_page_state_snapshot() is zero. Additionally, since this system lacks swap, the calculation of inactive/ active anonymous pages is skipped. crash> p nr_swap_pages nr_swap_pages = $1937 = { counter = 0 } As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on to the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 having free pages significantly exceeding the high watermark. The problem is that the pgdat->kswapd_failures hasn't been incremented. crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures $1935 = 0x0 This is because the node deemed balanced. The node balancing logic in balance_pgdat() evaluates all zones collectively. If one or more zones (e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the entire node is deemed balanced. This causes balance_pgdat() to exit early before incrementing the kswapd_failures, as it considers the overall memory state acceptable, even though some zones (like ZONE_NORMAL) remain under significant pressure. The patch ensures that zone_reclaimable_pages() includes free pages (NR_FREE_PAGES) in its calculation when no other reclaimable pages are available (e.g., file-backed or anonymous pages). This change prevents zones like ZONE_DMA32, which have sufficient free pages, from being mistakenly deemed unreclaimable. By doing so, the patch ensures proper node balancing, avoids masking pressure on other zones like ZONE_NORMAL, and prevents infinite loops in throttle_direct_reclaim() caused by allow_direct_reclaim(pgdat) repeatedly returning false. The kernel hangs due to a task stuck in throttle_direct_reclaim(), caused by a node being incorrectly deemed balanced despite pressure in certain zones, such as ZONE_NORMAL. This issue arises from zone_reclaimable_pages() returning 0 for zones without reclaimable file- backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient free pages to be skipped. The lack of swap or reclaimable pages results in ZONE_DMA32 being ignored during reclaim, masking pressure in other zones. Consequently, pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback mechanisms in allow_direct_reclaim() from being triggered, leading to an infinite loop in throttle_direct_reclaim(). This patch modifies zone_reclaimable_pages() to account for free pages (NR_FREE_PAGES) when no other reclaimable pages exist. This ensures zones with sufficient free pages are not skipped, enabling proper balancing and reclaim behavior. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20241130164346.436469-1-snishika@redhat.com Link: https://lkml.kernel.org/r/20241130161236.433747-2-snishika@redhat.com Fixes: 5a1c84b ("mm: remove reclaim and compaction retry approximations") Signed-off-by: Seiji Nishikawa <snishika@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit 63eac98d6f0898229f515cb62fe4e4db2430e99c) Signed-off-by: Vijayendra Suman <vijayendra.suman@oracle.com>

The cited commit adds a compeletion to remove dependency on rtnl lock. But it causes a deadlock for multiple encapsulations: crash> bt ffff8aece8a64000 PID: 1514557 TASK: ffff8aece8a64000 CPU: 3 COMMAND: "tc" #0 [ffffa6d14183f368] __schedule at ffffffffb8ba7f45 #1 [ffffa6d14183f3f8] schedule at ffffffffb8ba8418 #2 [ffffa6d14183f418] schedule_preempt_disabled at ffffffffb8ba8898 #3 [ffffa6d14183f428] __mutex_lock at ffffffffb8baa7f8 #4 [ffffa6d14183f4d0] mutex_lock_nested at ffffffffb8baabeb #5 [ffffa6d14183f4e0] mlx5e_attach_encap at ffffffffc0f48c17 [mlx5_core] #6 [ffffa6d14183f628] mlx5e_tc_add_fdb_flow at ffffffffc0f39680 [mlx5_core] #7 [ffffa6d14183f688] __mlx5e_add_fdb_flow at ffffffffc0f3b636 [mlx5_core] #8 [ffffa6d14183f6f0] mlx5e_tc_add_flow at ffffffffc0f3bcdf [mlx5_core] #9 [ffffa6d14183f728] mlx5e_configure_flower at ffffffffc0f3c1d1 [mlx5_core] #10 [ffffa6d14183f790] mlx5e_rep_setup_tc_cls_flower at ffffffffc0f3d529 [mlx5_core] #11 [ffffa6d14183f7a0] mlx5e_rep_setup_tc_cb at ffffffffc0f3d714 [mlx5_core] #12 [ffffa6d14183f7b0] tc_setup_cb_add at ffffffffb8931bb8 #13 [ffffa6d14183f810] fl_hw_replace_filter at ffffffffc0dae901 [cls_flower] #14 [ffffa6d14183f8d8] fl_change at ffffffffc0db5c57 [cls_flower] #15 [ffffa6d14183f970] tc_new_tfilter at ffffffffb8936047 #16 [ffffa6d14183fac8] rtnetlink_rcv_msg at ffffffffb88c7c31 #17 [ffffa6d14183fb50] netlink_rcv_skb at ffffffffb8942853 #18 [ffffa6d14183fbc0] rtnetlink_rcv at ffffffffb88c1835 #19 [ffffa6d14183fbd0] netlink_unicast at ffffffffb8941f27 #20 [ffffa6d14183fc18] netlink_sendmsg at ffffffffb8942245 #21 [ffffa6d14183fc98] sock_sendmsg at ffffffffb887d482 #22 [ffffa6d14183fcb8] ____sys_sendmsg at ffffffffb887d81a #23 [ffffa6d14183fd38] ___sys_sendmsg at ffffffffb88806e2 #24 [ffffa6d14183fe90] __sys_sendmsg at ffffffffb88807a2 #25 [ffffa6d14183ff28] __x64_sys_sendmsg at ffffffffb888080f #26 [ffffa6d14183ff38] do_syscall_64 at ffffffffb8b9b6a8 #27 [ffffa6d14183ff50] entry_SYSCALL_64_after_hwframe at ffffffffb8c0007c crash> bt 0xffff8aeb07544000 PID: 1110766 TASK: ffff8aeb07544000 CPU: 0 COMMAND: "kworker/u20:9" #0 [ffffa6d14e6b7bd8] __schedule at ffffffffb8ba7f45 #1 [ffffa6d14e6b7c68] schedule at ffffffffb8ba8418 #2 [ffffa6d14e6b7c88] schedule_timeout at ffffffffb8baef88 #3 [ffffa6d14e6b7d10] wait_for_completion at ffffffffb8ba968b #4 [ffffa6d14e6b7d60] mlx5e_take_all_encap_flows at ffffffffc0f47ec4 [mlx5_core] #5 [ffffa6d14e6b7da0] mlx5e_rep_update_flows at ffffffffc0f3e734 [mlx5_core] #6 [ffffa6d14e6b7df8] mlx5e_rep_neigh_update at ffffffffc0f400bb [mlx5_core] #7 [ffffa6d14e6b7e50] process_one_work at ffffffffb80acc9c #8 [ffffa6d14e6b7ed0] worker_thread at ffffffffb80ad012 #9 [ffffa6d14e6b7f10] kthread at ffffffffb80b615d #10 [ffffa6d14e6b7f50] ret_from_fork at ffffffffb8001b2f After the first encap is attached, flow will be added to encap entry's flows list. If neigh update is running at this time, the following encaps of the flow can't hold the encap_tbl_lock and sleep. If neigh update thread is waiting for that flow's init_done, deadlock happens. Fix it by holding lock outside of the for loop. If neigh update is running, prevent encap flows from offloading. Since the lock is held outside of the for loop, concurrent creation of encap entries is not allowed. So remove unnecessary wait_for_completion call for res_ready. Fixes: 95435ad ("net/mlx5e: Only access fully initialized flows in neigh update") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Orabug: 35383105 (cherry picked from commit 37c3b9f) cherry-pick-repo=kernel/git/torvalds/linux.git unmodified-from-upstream: 37c3b9f Signed-off-by: Mikhael Goikhman <migo@nvidia.com> Signed-off-by: Qing Huang <qing.huang@oracle.com> Reviewed-by: Devesh Sharma <devesh.s.sharma@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>

The cited commit holds encap tbl lock unconditionally when setting up dests. But it may cause the following deadlock: PID: 1063722 TASK: ffffa062ca5d0000 CPU: 13 COMMAND: "handler8" #0 [ffffb14de05b7368] __schedule at ffffffffa1d5aa91 #1 [ffffb14de05b7410] schedule at ffffffffa1d5afdb #2 [ffffb14de05b7430] schedule_preempt_disabled at ffffffffa1d5b528 #3 [ffffb14de05b7440] __mutex_lock at ffffffffa1d5d6cb #4 [ffffb14de05b74e8] mutex_lock_nested at ffffffffa1d5ddeb #5 [ffffb14de05b74f8] mlx5e_tc_tun_encap_dests_set at ffffffffc12f2096 [mlx5_core] #6 [ffffb14de05b7568] post_process_attr at ffffffffc12d9fc5 [mlx5_core] #7 [ffffb14de05b75a0] mlx5e_tc_add_fdb_flow at ffffffffc12de877 [mlx5_core] #8 [ffffb14de05b75f0] __mlx5e_add_fdb_flow at ffffffffc12e0eef [mlx5_core] #9 [ffffb14de05b7660] mlx5e_tc_add_flow at ffffffffc12e12f7 [mlx5_core] #10 [ffffb14de05b76b8] mlx5e_configure_flower at ffffffffc12e1686 [mlx5_core] #11 [ffffb14de05b7720] mlx5e_rep_indr_offload at ffffffffc12e3817 [mlx5_core] #12 [ffffb14de05b7730] mlx5e_rep_indr_setup_tc_cb at ffffffffc12e388a [mlx5_core] #13 [ffffb14de05b7740] tc_setup_cb_add at ffffffffa1ab2ba8 #14 [ffffb14de05b77a0] fl_hw_replace_filter at ffffffffc0bdec2f [cls_flower] #15 [ffffb14de05b7868] fl_change at ffffffffc0be6caa [cls_flower] #16 [ffffb14de05b7908] tc_new_tfilter at ffffffffa1ab71f0 [1031218.028143] wait_for_completion+0x24/0x30 [1031218.028589] mlx5e_update_route_decap_flows+0x9a/0x1e0 [mlx5_core] [1031218.029256] mlx5e_tc_fib_event_work+0x1ad/0x300 [mlx5_core] [1031218.029885] process_one_work+0x24e/0x510 Actually no need to hold encap tbl lock if there is no encap action. Fix it by checking if encap action exists or not before holding encap tbl lock. Fixes: 37c3b9f ("net/mlx5e: Prevent encap offload when neigh update is running") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Orabug: 35622106 (cherry picked from commit 93a3319) cherry-pick-repo=kernel/git/torvalds/linux.git unmodified-from-upstream: 93a3319 Signed-off-by: Mikhael Goikhman <migo@nvidia.com> Signed-off-by: Qing Huang <qing.huang@oracle.com> Reviewed-by: Devesh Sharma <devesh.s.sharma@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>