-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVIDIA: SAUCE: iommu/arm-smmu-v3: Allow default substream bypass with… #22
Closed
jamieNguyenNVIDIA
wants to merge
128
commits into
NVIDIA:24.04_linux-nvidia
from
jamieNguyenNVIDIA:jamien/SAUCE_pasid
Closed
NVIDIA: SAUCE: iommu/arm-smmu-v3: Allow default substream bypass with… #22
jamieNguyenNVIDIA
wants to merge
128
commits into
NVIDIA:24.04_linux-nvidia
from
jamieNguyenNVIDIA:jamien/SAUCE_pasid
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Ian May <ian.may@canonical.com>
…dversion" This reverts commit 47d27f2. We need to revert this to avoid regressing any modules used in Jammy. Signed-off-by: Ian May <ian.may@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Ian May <ian.may@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Ian May <ian.may@canonical.com>
Ignore: yes Signed-off-by: Ian May <ian.may@canonical.com>
Signed-off-by: Ian May <ian.may@canonical.com>
Signed-off-by: Ian May <ian.may@canonical.com>
Ignore: yes Signed-off-by: Ian May <ian.may@canonical.com>
Ignore: yes Signed-off-by: Ian May <ian.may@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2038972 Properties: no-test-build Signed-off-by: Ian May <ian.may@canonical.com>
Ignore: yes Signed-off-by: Ian May <ian.may@canonical.com>
Signed-off-by: Ian May <ian.may@canonical.com>
Ignore: yes Signed-off-by: Paolo Pisati <paolo.pisati@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2046137 Properties: no-test-build Signed-off-by: Paolo Pisati <paolo.pisati@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Paolo Pisati <paolo.pisati@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Paolo Pisati <paolo.pisati@canonical.com>
Ignore: yes Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Paolo Pisati <paolo.pisati@canonical.com>
Signed-off-by: Paolo Pisati <paolo.pisati@canonical.com>
Signed-off-by: Paolo Pisati <paolo.pisati@canonical.com>
Ignore: yes Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Ignore: yes Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Ignore: yes Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2055128 Properties: no-test-build Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
…ain/d2024.02.07) BugLink: https://bugs.launchpad.net/bugs/1786013 Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Ignore: yes Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2069777 Currently, the driver has two behaviors to deal with new & unsupported performance blocks reported by the firmware: 1. For register and unknown block types, the driver will fail to load with the following error message: [ 4510.956369] mlxbf-pmc: probe of MLNXBFD2:00 failed with error -22 2. For counter and crspace blocks, the driver will load and sysfs files will be created but getting the contents of event_list or trying to setup the counter will fail Instead, let's ignore and log unsupported blocks. This means the driver will always load and unsupported blocks will never show up in sysfs. Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/f8e2e6210b43e825b69824b420c801cd513d401d.1708635408.git.luizcap@redhat.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> (cherry picked from commit c0459ee) Signed-off-by: David Thompson <davthompson@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Noah Wager <noah.wager@canonical.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2069777 These need to be signed for the error handling to work. The mlxbf_pmc_get_event_num() function returns int so int type is correct. Fixes: 1ae9ffd ("platform/mellanox: mlxbf-pmc: Cleanup signed/unsigned mix-up") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/a4af764e-990b-4ebd-b342-852844374032@moroto.mountain Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> (cherry picked from commit 7c8772f) Signed-off-by: David Thompson <davthompson@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Noah Wager <noah.wager@canonical.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2071654 We enumerate devices by attempting config reads to the Vendor ID of each possible device. On conventional PCI, if no device responds, the read terminates with a Master Abort (PCI r3.0, sec 6.1). On PCIe, the config read is terminated as an Unsupported Request (PCIe r6.0, sec 2.3.2, 7.5.1.3.7). In either case, if the read addressed a device below a bridge, it is logged by setting "Received Master Abort" in the bridge Secondary Status register. Clear any errors logged in the Secondary Status register after enumeration. Link: https://lore.kernel.org/r/20240116143258.483235-1-vidyas@nvidia.com Signed-off-by: Vidya Sagar <vidyas@nvidia.com> [bhelgaas: simplify commit log] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> (cherry picked from commit 7bf9d2a) Signed-off-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Noah Wager <noah.wager@canonical.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2071655 Start switching iomap_copy routines over to use #define and arch provided inline/macro functions instead of weak symbols. Inline functions allow more compiler optimization and this is often a driver hot path. x86 has the only weak implementation for __iowrite32_copy(), so replace it with a static inline containing the same single instruction inline assembly. The compiler will generate the "mov edx,ecx" in a more optimal way. Remove iomap_copy_64.S Link: https://lore.kernel.org/r/1-v3-1893cd8b9369+1925-mlx5_arm_wc_jgg@nvidia.com Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> (cherry picked from commit 20516d6) Signed-off-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Noah Wager <noah.wager@canonical.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2071655 It is trivial to implement an inline to do this, so provide it in the s390 headers. Like the 64 bit version it should just invoke zpci_memcpy_toio() with the correct size. Link: https://lore.kernel.org/r/2-v3-1893cd8b9369+1925-mlx5_arm_wc_jgg@nvidia.com Acked-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> (cherry picked from commit 6ae798c) Signed-off-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Noah Wager <noah.wager@canonical.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2071655 Complete switching the __iowriteXX_copy() routines over to use #define and arch provided inline/macro functions instead of weak symbols. S390 has an implementation that simply calls another memcpy function. Inline this so the callers don't have to do two jumps. Link: https://lore.kernel.org/r/3-v3-1893cd8b9369+1925-mlx5_arm_wc_jgg@nvidia.com Acked-by: Niklas Schnelle <schnelle@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> (cherry picked from commit e7bc47b) Signed-off-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Noah Wager <noah.wager@canonical.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2071655 The kernel provides driver support for using write combining IO memory through the __iowriteXX_copy() API which is commonly used as an optional optimization to generate 16/32/64 byte MemWr TLPs in a PCIe environment. iomap_copy.c provides a generic implementation as a simple 4/8 byte at a time copy loop that has worked well with past ARM64 CPUs, giving a high frequency of large TLPs being successfully formed. However modern ARM64 CPUs are quite sensitive to how the write combining CPU HW is operated and a compiler generated loop with intermixed load/store is not sufficient to frequently generate a large TLP. The CPUs would like to see the entire TLP generated by consecutive store instructions from registers. Compilers like gcc tend to intermix loads and stores and have poor code generation, in part, due to the ARM64 situation that writeq() does not codegen anything other than "[xN]". However even with that resolved compilers like clang still do not have good code generation. This means on modern ARM64 CPUs the rate at which __iowriteXX_copy() successfully generates large TLPs is very small (less than 1 in 10,000) tries), to the point that the use of WC is pointless. Implement __iowrite32/64_copy() specifically for ARM64 and use inline assembly to build consecutive blocks of STR instructions. Provide direct support for 64/32/16 large TLP generation in this manner. Optimize for common constant lengths so that the compiler can directly inline the store blocks. This brings the frequency of large TLP generation up to a high level that is comparable with older CPU generations. As the __iowriteXX_copy() family of APIs is intended for use with WC incorporate the DGH hint directly into the function. Link: https://lore.kernel.org/r/4-v3-1893cd8b9369+1925-mlx5_arm_wc_jgg@nvidia.com Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> (cherry picked from commit ead7911) Signed-off-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Noah Wager <noah.wager@canonical.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2071655 Now that the ARM64 arch implementation does the DGH as part of __iowrite64_copy() there is no reason to open code this in drivers. Link: https://lore.kernel.org/r/5-v3-1893cd8b9369+1925-mlx5_arm_wc_jgg@nvidia.com Reviewed-by: Jijie Shao<shaojijie@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> (cherry picked from commit 2b7a5e1) Signed-off-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Noah Wager <noah.wager@canonical.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
Ignore: yes Signed-off-by: Jacob Martin <jacob.martin@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2072186 Properties: no-test-build Signed-off-by: Jacob Martin <jacob.martin@canonical.com>
Signed-off-by: Jacob Martin <jacob.martin@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2073811 PCIe ACS settings control the level of isolation and the possible P2P paths between devices. With greater isolation the kernel will create smaller iommu_groups and with less isolation there is more HW that can achieve P2P transfers. From a virtualization perspective all devices in the same iommu_group must be assigned to the same VM as they lack security isolation. There is no way for the kernel to automatically know the correct ACS settings for any given system and workload. Existing command line options (e.g., disable_acs_redir) allow only for large scale change, disabling all isolation, but this is not sufficient for more complex cases. Add a kernel command-line option 'config_acs' to directly control all the ACS bits for specific devices, which allows the operator to setup the right level of isolation to achieve the desired P2P configuration. The definition is future proof; when new ACS bits are added to the spec the open syntax can be extended. ACS needs to be setup early in the kernel boot as the ACS settings affect how iommu_groups are formed. iommu_group formation is a one time event during initial device discovery, so changing ACS bits after kernel boot can result in an inaccurate view of the iommu_groups compared to the current isolation configuration. ACS applies to PCIe Downstream Ports and multi-function devices. The default ACS settings are strict and deny any direct traffic between two functions. This results in the smallest iommu_group the HW can support. Frequently these values result in slow or non-working P2PDMA. ACS offers a range of security choices controlling how traffic is allowed to go directly between two devices. Some popular choices: - Full prevention - Translated requests can be direct, with various options - Asymmetric direct traffic, A can reach B but not the reverse - All traffic can be direct Along with some other less common ones for special topologies. The intention is that this option would be used with expert knowledge of the HW capability and workload to achieve the desired configuration. Link: https://lore.kernel.org/r/20240625153150.159310-1-vidyas@nvidia.com Signed-off-by: Vidya Sagar <vidyas@nvidia.com> [bhelgaas: add example, tidy printk formats] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> (cherry picked from commit 47c8846) Signed-off-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Noah Wager <noah.wager@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2075396 Commit 3bd786f ("mm: convert do_set_pte() to set_pte_range()") replaced do_set_pte() with set_pte_range() and that introduced a regression in the following faulting path of non-anonymous vmas which caused the PTE for the faulting address to be marked as old instead of young. handle_pte_fault() do_pte_missing() do_fault() do_read_fault() || do_cow_fault() || do_shared_fault() finish_fault() set_pte_range() The polarity of prefault calculation is incorrect. This leads to prefault being incorrectly set for the faulting address. The following check will incorrectly mark the PTE old rather than young. On some architectures this will cause a double fault to mark it young when the access is retried. if (prefault && arch_wants_old_prefaulted_pte()) entry = pte_mkold(entry); On a subsequent fault on the same address, the faulting path will see a non NULL vmf->pte and instead of reaching the do_pte_missing() path, PTE will then be correctly marked young in handle_pte_fault() itself. Due to this bug, performance degradation in the fault handling path will be observed due to unnecessary double faulting. Link: https://lkml.kernel.org/r/20240710014539.746200-1-rtummala@nvidia.com Fixes: 3bd786f ("mm: convert do_set_pte() to set_pte_range()") Signed-off-by: Ram Tummala <rtummala@nvidia.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (backported from commit 4cd7ba1) [context changes] Signed-off-by: Jamie Nguyen <jamien@nvidia.com> Tested-by: Carol Soto <csoto@nvidia.com> Acked-by: Brad Figg <bfigg@nvidia.com> Acked-by: Noah Wager <noah.wager@canonical.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
… a pasid support When an iommu_domain is set to IOMMU_DOMAIN_IDENTITY, the driver would skip the allocation of a CD table and set the CONFIG field of the STE to STRTAB_STE_0_CFG_BYPASS. This works well for devices that only have one substream, i.e. PASID disabled. However, there could be a use case, for a pasid capable device, that allows bypassing the translation at the default substream while still enabling the pasid feature, which means the driver should not skip the allocation of a CD table nor simply bypass the CONFIG field. Instead, the S1DSS field should be set to STRTAB_STE_1_S1DSS_BYPASS and the SHCFG field should be set to STRTAB_STE_1_SHCFG_INCOMING. Add s1dss in struct arm_smmu_s1_cfg, to allow a configuration in the finalise() to support this use case. Also, according to "13.5 Summary of attribute/permission configuration fields" in the reference manual, the SHCFG field value is irrelevant. So, set the SHCFG field of the STE always to STRTAB_STE_1_SHCFG_INCOMING for simplification. Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Pritesh Raithatha <praithatha@nvidia.com> Signed-off-by: Jamie Nguyen <jamien@nvidia.com> Tested-by: Matt Ochs <mochs@nvidia.com>
|
nvidia-bfigg
force-pushed
the
24.04_linux-nvidia
branch
from
August 10, 2024 15:00
29686e1
to
3200be9
Compare
nvidia-bfigg
force-pushed
the
24.04_linux-nvidia
branch
from
August 17, 2024 15:00
3200be9
to
6ad2527
Compare
nvidia-bfigg
pushed a commit
that referenced
this pull request
Aug 17, 2024
BugLink: https://bugs.launchpad.net/bugs/2073603 [ Upstream commit 769e6a1 ] ui_browser__show() is capturing the input title that is stack allocated memory in hist_browser__run(). Avoid a use after return by strdup-ing the string. Committer notes: Further explanation from Ian Rogers: My command line using tui is: $ sudo bash -c 'rm /tmp/asan.log*; export ASAN_OPTIONS="log_path=/tmp/asan.log"; /tmp/perf/perf mem record -a sleep 1; /tmp/perf/perf mem report' I then go to the perf annotate view and quit. This triggers the asan error (from the log file): ``` ==1254591==ERROR: AddressSanitizer: stack-use-after-return on address 0x7f2813331920 at pc 0x7f28180 65991 bp 0x7fff0a21c750 sp 0x7fff0a21bf10 READ of size 80 at 0x7f2813331920 thread T0 #0 0x7f2818065990 in __interceptor_strlen ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:461 #1 0x7f2817698251 in SLsmg_write_wrapped_string (/lib/x86_64-linux-gnu/libslang.so.2+0x98251) #2 0x7f28176984b9 in SLsmg_write_nstring (/lib/x86_64-linux-gnu/libslang.so.2+0x984b9) #3 0x55c94045b365 in ui_browser__write_nstring ui/browser.c:60 #4 0x55c94045c558 in __ui_browser__show_title ui/browser.c:266 #5 0x55c94045c776 in ui_browser__show ui/browser.c:288 #6 0x55c94045c06d in ui_browser__handle_resize ui/browser.c:206 #7 0x55c94047979b in do_annotate ui/browsers/hists.c:2458 #8 0x55c94047fb17 in evsel__hists_browse ui/browsers/hists.c:3412 #9 0x55c940480a0c in perf_evsel_menu__run ui/browsers/hists.c:3527 #10 0x55c940481108 in __evlist__tui_browse_hists ui/browsers/hists.c:3613 #11 0x55c9404813f7 in evlist__tui_browse_hists ui/browsers/hists.c:3661 #12 0x55c93ffa253f in report__browse_hists tools/perf/builtin-report.c:671 #13 0x55c93ffa58ca in __cmd_report tools/perf/builtin-report.c:1141 #14 0x55c93ffaf159 in cmd_report tools/perf/builtin-report.c:1805 #15 0x55c94000c05c in report_events tools/perf/builtin-mem.c:374 #16 0x55c94000d96d in cmd_mem tools/perf/builtin-mem.c:516 #17 0x55c9400e44ee in run_builtin tools/perf/perf.c:350 #18 0x55c9400e4a5a in handle_internal_command tools/perf/perf.c:403 #19 0x55c9400e4e22 in run_argv tools/perf/perf.c:447 #20 0x55c9400e53ad in main tools/perf/perf.c:561 #21 0x7f28170456c9 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 #22 0x7f2817045784 in __libc_start_main_impl ../csu/libc-start.c:360 #23 0x55c93ff544c0 in _start (/tmp/perf/perf+0x19a4c0) (BuildId: 84899b0e8c7d3a3eaa67b2eb35e3d8b2f8cd4c93) Address 0x7f2813331920 is located in stack of thread T0 at offset 32 in frame #0 0x55c94046e85e in hist_browser__run ui/browsers/hists.c:746 This frame has 1 object(s): [32, 192) 'title' (line 747) <== Memory access at offset 32 is inside this variable HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork ``` hist_browser__run isn't on the stack so the asan error looks legit. There's no clean init/exit on struct ui_browser so I may be trading a use-after-return for a memory leak, but that seems look a good trade anyway. Fixes: 05e8b08 ("perf ui browser: Stop using 'self'") Signed-off-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Cc: Ben Gainey <ben.gainey@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kajol Jain <kjain@linux.ibm.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Li Dong <lidong@vivo.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Paran Lee <p4ranlee@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Sun Haiyong <sunhaiyong@loongson.cn> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Yanteng Si <siyanteng@loongson.cn> Cc: Yicong Yang <yangyicong@hisilicon.com> Link: https://lore.kernel.org/r/20240507183545.1236093-2-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Portia Stephens <portia.stephens@canonical.com> Signed-off-by: Roxana Nicolescu <roxana.nicolescu@canonical.com>
nvidia-bfigg
pushed a commit
that referenced
this pull request
Aug 17, 2024
…uddy pages BugLink: https://bugs.launchpad.net/bugs/2073788 commit 8cf360b upstream. When I did memory failure tests recently, below panic occurs: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00 flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff) raw: 06fffe0000000000 dead000000000100 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000009 00000000ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(!PageBuddy(page)) ------------[ cut here ]------------ kernel BUG at include/linux/page-flags.h:1009! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:__del_page_from_free_list+0x151/0x180 RSP: 0018:ffffa49c90437998 EFLAGS: 00000046 RAX: 0000000000000035 RBX: 0000000000000009 RCX: ffff8dd8dfd1c9c8 RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8dd8dfd1c9c0 RBP: ffffd901233b8000 R08: ffffffffab5511f8 R09: 0000000000008c69 R10: 0000000000003c15 R11: ffffffffab5511f8 R12: ffff8dd8fffc0c80 R13: 0000000000000001 R14: ffff8dd8fffc0c80 R15: 0000000000000009 FS: 00007ff916304740(0000) GS:ffff8dd8dfd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055eae50124c8 CR3: 00000008479e0000 CR4: 00000000000006f0 Call Trace: <TASK> __rmqueue_pcplist+0x23b/0x520 get_page_from_freelist+0x26b/0xe40 __alloc_pages_noprof+0x113/0x1120 __folio_alloc_noprof+0x11/0xb0 alloc_buddy_hugetlb_folio.isra.0+0x5a/0x130 __alloc_fresh_hugetlb_folio+0xe7/0x140 alloc_pool_huge_folio+0x68/0x100 set_max_huge_pages+0x13d/0x340 hugetlb_sysctl_handler_common+0xe8/0x110 proc_sys_call_handler+0x194/0x280 vfs_write+0x387/0x550 ksys_write+0x64/0xe0 do_syscall_64+0xc2/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ff916114887 RSP: 002b:00007ffec8a2fd78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 000055eae500e350 RCX: 00007ff916114887 RDX: 0000000000000004 RSI: 000055eae500e390 RDI: 0000000000000003 RBP: 000055eae50104c0 R08: 0000000000000000 R09: 000055eae50104c0 R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000004 R13: 0000000000000004 R14: 00007ff916216b80 R15: 00007ff916216a00 </TASK> Modules linked in: mce_inject hwpoison_inject ---[ end trace 0000000000000000 ]--- And before the panic, there had an warning about bad page state: BUG: Bad page state in process page-types pfn:8cee00 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00 flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff) page_type: 0xffffff7f(buddy) raw: 06fffe0000000000 ffffd901241c0008 ffffd901240f8008 0000000000000000 raw: 0000000000000000 0000000000000009 00000000ffffff7f 0000000000000000 page dumped because: nonzero mapcount Modules linked in: mce_inject hwpoison_inject CPU: 8 PID: 154211 Comm: page-types Not tainted 6.9.0-rc4-00499-g5544ec3178e2-dirty #22 Call Trace: <TASK> dump_stack_lvl+0x83/0xa0 bad_page+0x63/0xf0 free_unref_page+0x36e/0x5c0 unpoison_memory+0x50b/0x630 simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110 debugfs_attr_write+0x42/0x60 full_proxy_write+0x5b/0x80 vfs_write+0xcd/0x550 ksys_write+0x64/0xe0 do_syscall_64+0xc2/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f189a514887 RSP: 002b:00007ffdcd899718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f189a514887 RDX: 0000000000000009 RSI: 00007ffdcd899730 RDI: 0000000000000003 RBP: 00007ffdcd8997a0 R08: 0000000000000000 R09: 00007ffdcd8994b2 R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcda199a8 R13: 0000000000404af1 R14: 000000000040ad78 R15: 00007f189a7a5040 </TASK> The root cause should be the below race: memory_failure try_memory_failure_hugetlb me_huge_page __page_handle_poison dissolve_free_hugetlb_folio drain_all_pages -- Buddy page can be isolated e.g. for compaction. take_page_off_buddy -- Failed as page is not in the buddy list. -- Page can be putback into buddy after compaction. page_ref_inc -- Leads to buddy page with refcnt = 1. Then unpoison_memory() can unpoison the page and send the buddy page back into buddy list again leading to the above bad page state warning. And bad_page() will call page_mapcount_reset() to remove PageBuddy from buddy page leading to later VM_BUG_ON_PAGE(!PageBuddy(page)) when trying to allocate this page. Fix this issue by only treating __page_handle_poison() as successful when it returns 1. Link: https://lkml.kernel.org/r/20240523071217.1696196-1-linmiaohe@huawei.com Fixes: ceaf8fb ("mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Portia Stephens <portia.stephens@canonical.com> Signed-off-by: Roxana Nicolescu <roxana.nicolescu@canonical.com>
nvidia-bfigg
force-pushed
the
24.04_linux-nvidia
branch
from
August 21, 2024 15:00
6ad2527
to
9b809a2
Compare
nvidia-bfigg
force-pushed
the
24.04_linux-nvidia
branch
from
September 3, 2024 15:01
9b809a2
to
7d88485
Compare
nvidia-bfigg
force-pushed
the
24.04_linux-nvidia
branch
3 times, most recently
from
October 8, 2024 15:00
d1895f3
to
b0724a3
Compare
nvidia-bfigg
force-pushed
the
24.04_linux-nvidia
branch
from
October 18, 2024 15:00
296a1e0
to
e688282
Compare
nvidia-bfigg
pushed a commit
that referenced
this pull request
Nov 30, 2024
When a binder reference is cleaned up, any freeze work queued in the associated process should also be removed. Otherwise, the reference is freed while its ref->freeze.work is still queued in proc->work leading to a use-after-free issue as shown by the following KASAN report: ================================================================== BUG: KASAN: slab-use-after-free in binder_release_work+0x398/0x3d0 Read of size 8 at addr ffff31600ee91488 by task kworker/5:1/211 CPU: 5 UID: 0 PID: 211 Comm: kworker/5:1 Not tainted 6.11.0-rc7-00382-gfc6c92196396 #22 Hardware name: linux,dummy-virt (DT) Workqueue: events binder_deferred_func Call trace: binder_release_work+0x398/0x3d0 binder_deferred_func+0xb60/0x109c process_one_work+0x51c/0xbd4 worker_thread+0x608/0xee8 Allocated by task 703: __kmalloc_cache_noprof+0x130/0x280 binder_thread_write+0xdb4/0x42a0 binder_ioctl+0x18f0/0x25ac __arm64_sys_ioctl+0x124/0x190 invoke_syscall+0x6c/0x254 Freed by task 211: kfree+0xc4/0x230 binder_deferred_func+0xae8/0x109c process_one_work+0x51c/0xbd4 worker_thread+0x608/0xee8 ================================================================== This commit fixes the issue by ensuring any queued freeze work is removed when cleaning up a binder reference. Fixes: d579b04 ("binder: frozen notification") Cc: stable@vger.kernel.org Acked-by: Todd Kjos <tkjos@android.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Link: https://lore.kernel.org/r/20240926233632.821189-4-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
nvidia-bfigg
pushed a commit
that referenced
this pull request
Dec 13, 2024
This updates iso_sock_accept to use nested locking for the parent socket, to avoid lockdep warnings caused because the parent and child sockets are locked by the same thread: [ 41.585683] ============================================ [ 41.585688] WARNING: possible recursive locking detected [ 41.585694] 6.12.0-rc6+ #22 Not tainted [ 41.585701] -------------------------------------------- [ 41.585705] iso-tester/3139 is trying to acquire lock: [ 41.585711] ffff988b29530a58 (sk_lock-AF_BLUETOOTH) at: bt_accept_dequeue+0xe3/0x280 [bluetooth] [ 41.585905] but task is already holding lock: [ 41.585909] ffff988b29533a58 (sk_lock-AF_BLUETOOTH) at: iso_sock_accept+0x61/0x2d0 [bluetooth] [ 41.586064] other info that might help us debug this: [ 41.586069] Possible unsafe locking scenario: [ 41.586072] CPU0 [ 41.586076] ---- [ 41.586079] lock(sk_lock-AF_BLUETOOTH); [ 41.586086] lock(sk_lock-AF_BLUETOOTH); [ 41.586093] *** DEADLOCK *** [ 41.586097] May be due to missing lock nesting notation [ 41.586101] 1 lock held by iso-tester/3139: [ 41.586107] #0: ffff988b29533a58 (sk_lock-AF_BLUETOOTH) at: iso_sock_accept+0x61/0x2d0 [bluetooth] Fixes: ccf74f2 ("Bluetooth: Add BTPROTO_ISO socket type") Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
nvidia-bfigg
pushed a commit
that referenced
this pull request
Dec 13, 2024
This fixes the circular locking dependency warning below, by releasing the socket lock before enterning iso_listen_bis, to avoid any potential deadlock with hdev lock. [ 75.307983] ====================================================== [ 75.307984] WARNING: possible circular locking dependency detected [ 75.307985] 6.12.0-rc6+ #22 Not tainted [ 75.307987] ------------------------------------------------------ [ 75.307987] kworker/u81:2/2623 is trying to acquire lock: [ 75.307988] ffff8fde1769da58 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO) at: iso_connect_cfm+0x253/0x840 [bluetooth] [ 75.308021] but task is already holding lock: [ 75.308022] ffff8fdd61a10078 (&hdev->lock) at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth] [ 75.308053] which lock already depends on the new lock. [ 75.308054] the existing dependency chain (in reverse order) is: [ 75.308055] -> #1 (&hdev->lock){+.+.}-{3:3}: [ 75.308057] __mutex_lock+0xad/0xc50 [ 75.308061] mutex_lock_nested+0x1b/0x30 [ 75.308063] iso_sock_listen+0x143/0x5c0 [bluetooth] [ 75.308085] __sys_listen_socket+0x49/0x60 [ 75.308088] __x64_sys_listen+0x4c/0x90 [ 75.308090] x64_sys_call+0x2517/0x25f0 [ 75.308092] do_syscall_64+0x87/0x150 [ 75.308095] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 75.308098] -> #0 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}: [ 75.308100] __lock_acquire+0x155e/0x25f0 [ 75.308103] lock_acquire+0xc9/0x300 [ 75.308105] lock_sock_nested+0x32/0x90 [ 75.308107] iso_connect_cfm+0x253/0x840 [bluetooth] [ 75.308128] hci_connect_cfm+0x6c/0x190 [bluetooth] [ 75.308155] hci_le_per_adv_report_evt+0x27b/0x2f0 [bluetooth] [ 75.308180] hci_le_meta_evt+0xe7/0x200 [bluetooth] [ 75.308206] hci_event_packet+0x21f/0x5c0 [bluetooth] [ 75.308230] hci_rx_work+0x3ae/0xb10 [bluetooth] [ 75.308254] process_one_work+0x212/0x740 [ 75.308256] worker_thread+0x1bd/0x3a0 [ 75.308258] kthread+0xe4/0x120 [ 75.308259] ret_from_fork+0x44/0x70 [ 75.308261] ret_from_fork_asm+0x1a/0x30 [ 75.308263] other info that might help us debug this: [ 75.308264] Possible unsafe locking scenario: [ 75.308264] CPU0 CPU1 [ 75.308265] ---- ---- [ 75.308265] lock(&hdev->lock); [ 75.308267] lock(sk_lock- AF_BLUETOOTH-BTPROTO_ISO); [ 75.308268] lock(&hdev->lock); [ 75.308269] lock(sk_lock-AF_BLUETOOTH-BTPROTO_ISO); [ 75.308270] *** DEADLOCK *** [ 75.308271] 4 locks held by kworker/u81:2/2623: [ 75.308272] #0: ffff8fdd66e52148 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_one_work+0x443/0x740 [ 75.308276] #1: ffffafb488b7fe48 ((work_completion)(&hdev->rx_work)), at: process_one_work+0x1ce/0x740 [ 75.308280] #2: ffff8fdd61a10078 (&hdev->lock){+.+.}-{3:3} at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth] [ 75.308304] #3: ffffffffb6ba4900 (rcu_read_lock){....}-{1:2}, at: hci_connect_cfm+0x29/0x190 [bluetooth] Fixes: 02171da ("Bluetooth: ISO: Add hcon for listening bis sk") Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
nvidia-bfigg
pushed a commit
that referenced
this pull request
Jan 18, 2025
syzkaller reported such a BUG_ON(): ------------[ cut here ]------------ kernel BUG at mm/khugepaged.c:1835! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP ... CPU: 6 UID: 0 PID: 8009 Comm: syz.15.106 Kdump: loaded Tainted: G W 6.13.0-rc6 #22 Tainted: [W]=WARN Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : collapse_file+0xa44/0x1400 lr : collapse_file+0x88/0x1400 sp : ffff80008afe3a60 ... Call trace: collapse_file+0xa44/0x1400 (P) hpage_collapse_scan_file+0x278/0x400 madvise_collapse+0x1bc/0x678 madvise_vma_behavior+0x32c/0x448 madvise_walk_vmas.constprop.0+0xbc/0x140 do_madvise.part.0+0xdc/0x2c8 __arm64_sys_madvise+0x68/0x88 invoke_syscall+0x50/0x120 el0_svc_common.constprop.0+0xc8/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x34/0x128 el0t_64_sync_handler+0xc8/0xd0 el0t_64_sync+0x190/0x198 This indicates that the pgoff is unaligned. After analysis, I confirm the vma is mapped to /dev/zero. Such a vma certainly has vm_file, but it is set to anonymous by mmap_zero(). So even if it's mmapped by 2m-unaligned, it can pass the check in thp_vma_allowable_order() as it is an anonymous-mmap, but then be collapsed as a file-mmap. It seems the problem has existed for a long time, but actually, since we have khugepaged_max_ptes_none check before, we will skip collapse it as it is /dev/zero and so has no present page. But commit d8ea7cc limit the check for only khugepaged, so the BUG_ON() can be triggered by madvise_collapse(). Add vma_is_anonymous() check to make such vma be processed by hpage_collapse_scan_pmd(). Link: https://lkml.kernel.org/r/20250111034511.2223353-1-liushixin2@huawei.com Fixes: d8ea7cc ("mm/khugepaged: add flag to predicate khugepaged-only behavior") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reviewed-by: Yang Shi <yang@os.amperecomputing.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mattew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
… a pasid support
When an iommu_domain is set to IOMMU_DOMAIN_IDENTITY, the driver would skip the allocation of a CD table and set the CONFIG field of the STE to STRTAB_STE_0_CFG_BYPASS. This works well for devices that only have one substream, i.e. PASID disabled.
However, there could be a use case, for a pasid capable device, that allows bypassing the translation at the default substream while still enabling the pasid feature, which means the driver should not skip the allocation of a CD table nor simply bypass the CONFIG field. Instead, the S1DSS field should be set to STRTAB_STE_1_S1DSS_BYPASS and the SHCFG field should be set to STRTAB_STE_1_SHCFG_INCOMING.
Add s1dss in struct arm_smmu_s1_cfg, to allow a configuration in the finalise() to support this use case.
Also, according to "13.5 Summary of attribute/permission configuration fields" in the reference manual, the SHCFG field value is irrelevant. So, set the SHCFG field of the STE always to STRTAB_STE_1_SHCFG_INCOMING for simplification.
Reviewed-by: Pritesh Raithatha praithatha@nvidia.com
Tested-by: Matt Ochs mochs@nvidia.com