Fix underflow in Triton's highestPowOf2Divisor function when the input is INT_MIN #11

Moerafaat · 2024-08-23T10:33:20Z

highestPowOf2Divisor function here https://github.com/triton-lang/triton/blob/8e63999da7ef87fc6ac908f26a9b9c05ce85ab70/include/triton/Dialect/Triton/IR/Utility.h#L33 will run during analysis when coalesce pass is invoked. This function will underflow if the given input is INT_MIN and will fail when run under ASAN.

This change handles this edge case and adds a minimal lit test to make sure the case is covered.

Not sure this is worthy to make it? I was annoyed by long sha256-based cache directory names, mostly 64 chars. So I quickly added base64-based shorter cache directory names. Instead of fixing a dozen places that use `hashlib.sha256`, I patched the cache manager. 64-char names are mostly reduced to 43-44 chars. A comparison: ``` > % ls -l $TRITON_CACHE_DIR total 0 drwxr-xr-x 1 minjang users 40 Aug 21 19:02 44ae4aee7ef0ee0dd54e860cf44627e3b6cedabe87a228ac75988301b8a6bf60 drwxr-xr-x 1 minjang users 26 Aug 21 19:02 82dc2c9a5508bf07c72e02353c1e751dc54aae85666f139b2867b0a1e95e0e7b drwxr-xr-x 1 minjang users 226 Aug 21 19:02 b8e240968a85711ba57b17bf8450f1ffbc85a8de8cd1f47aa87b241b53f9bf60 drwxr-xr-x 1 minjang users 26 Aug 21 19:03 gtwsmlUIvwfHLgI1PB51HcVKroVmbxObKGewoeleDns drwxr-xr-x 1 minjang users 40 Aug 21 19:03 RK5K7n7w7g3VToYM9EYn47bO2r6HoiisdZiDAbimv2A drwxr-xr-x 1 minjang users 226 Aug 21 19:03 uOJAloqFcRulexe_hFDx_7yFqN6M0fR6qHskG1P5v2A ``` `test_core.py` runs without any errors, and the cache directory has all base64-based shorter names.

The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [x ] I am not making a trivial change, such as fixing a typo in a comment. - [] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x ] This PR does not need a test because `PR merely modifies the implementation of the function, removing the superfluous parts.`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

…ng#4539) This is motivated by triton-lang#4509. The crux of the problem is that the Triton code generator needs to inspect a function's arguments / attributes / types in order to determine how it should be called. This meant that "implementation details" like whether a function is a builtin needed to be exposed in the "interface" `tl.extra.libdevice` module, instead of just residing in `tl.extra.cuda.libdevice`. Moreover, this meant that libdevice functions marked as `@core.extern` in the interface could not be implemented via JitFunctions. Allowing each backend to provide its own module map solves this problem as the code generator can inspect the actual function implementation.

Firmware updated after upgrading kernel-mode driver to the newer version 6.2.0. Tested test_randn and test_ptx_cast for an hour as a usual reproducer and no longer shows any failure. --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com>

Missed this one test the last time.

chsigg

LGTM

1. Fixed the errors in the formulas in the comments. 2. When assert is enabled in DEBUG mode, using the list type will report an error. Additionally, the Tensor returned by the internal interface can ensure that the shape is a list and does not require checking. The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because it does not take any effect to working code. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Co-authored-by: 谢双镱 <xieshuangyi@bytedance.com>

…atchet in Proton (triton-lang#4565)

… code (triton-lang#4561) The knob is only for debugging purposes and should be removed once we've figured out the issue

…ay escape out of the loop (triton-lang#4567) Merging 7c90081: '[SCF][PIPELINE] Handle the case when values from the peeled prologue may escape out of the loop' from the upstream LLVM into Triton Pipeline Expander "Previously the values in the peeled prologue that weren't treated with the predicateFn were passed to the loop body without any other predication. If those values are later used outside of the loop body, they may be incorrect if the num iterations is smaller than num stages - 1. We need similar masking for those, as is done in the main loop body, using already existing predicates."

)

…-lang#4568)

…ng#4255) PyTorch have ways to load the libamdhip64.so library, and the original logic has problems detecting this shared object file because it can be placed at any location with `venv`/`DT_RUNPATH`/`LD_LIBRARY_PATH`/ `/etc/ld.so.conf.d/*.conf`/`LD_PRELOAD`. Notably the `RUNPATH` has become the preferred approach for multi-version ROCm installations. This patch let Triton enumerate the address space with `dl_iterate_phdr`, and choose the libamdhip64.so that is already loaded into address space unless `TRITON_LIBHIP_PATH` is already set. --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com>

Cast dot arguments from unsupported FP8 to supported FP16 in order to use MFMA instructions instead of FMA. This approach is expected to give better performance and be more stable compared to FMA implementation. --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com>

…on-lang#4571) Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>

Proposing this change because we use these tests for CPU backend here: https://github.com/microsoft/triton-shared/ and these clauses lead to wrong assumptions about supported features. The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `this is a fix to tests`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Co-authored-by: Renat Idrisov <parsifal-47@users.noreply.github.com>

…ing input and output dimensions (triton-lang#4530) Before this patch, if `a * b = c`, `c.divideRight(b)` might return `nullopt` even if `a' * b = c`, where `a'` is the potential result of `divideRight`. This PR addresses the issue by conservatively removing input and output dimensions, ensuring that the division returns a non-nullopt result when a valid solution exists. However, it does not guarantee that `a` and `a'` will have identical input and output dimensions. In addition, this PR also fixes a bug in `TritonGPURemoveLayoutConversionsPass`. The backward slice should be continued when encountering a free conversion. This includes cases where `c.divideRight(b)` results in a layout that only permutes register values within individual threads. --------- Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>

… 1 (triton-lang#4579) If the subview for the mma operand is in the previous iteration of the loop, it would be skipped by the code that looks for it, if it feeds directly into the loop yield. This PR fixes that issue allowing for them subview to be found and enabling the pipelining in some cases skipped previously.

…ng#4582) When `other` is there we should use it to initalize the reg before doing the load instead of initializing the reg with 0. Note that this does add a scoreboard dependency between the `other` def and the load but user can remove it by using a select if other comes from a high latency op.

The command in readme isn't correct, I finally come out to get the compile command with `find python/build -name 'compile_commands.json' | xargs readlink -f`. I think although the improvement in readme is trivial but it is helpful for other new comer go through the environment setup easier. --------- Signed-off-by: jayzhan <jayzhan211@gmail.com>

) This PR adds verbosity to assembly code after LLVM backend passes. This adds references to the source code for both NV and AMD. Additionally, it adds `Kernel Info` at the end of the dump for AMD. For example: ``` ; Kernel info: ; codeLenInByte = 7732 ; NumSgprs: 24 ; NumVgprs: 154 ; NumAgprs: 128 ; TotalNumVgprs: 284 ... ```

According to the [table](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#release-notes-ptx-release-history), both CUDA 12.5 and 12.6 use PTX ISA 8.5

- Supported linear layout for the 2nd version of WMMA - Supported legacy emit indices related helper functions for now - Removed remaining assertions related to WMMAv2 Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>

…lang#4584) Get ready to separate scheduling out of SWP, so we can move scheduleLoads and schedulePrologueAndEpilogue to a separate scheduling pass. Lowering will happen after inside SWP.

Dump warning if SWP fails in the inner loop and dump option is enabled in the CL. --------- Co-authored-by: Zeng Wu <zengwu@fb.com>

…s8->bf16 conversions (triton-lang#4563) Hopper has very low throughput of conversion instructions that cause this operations to quickly become an ALU bottleneck. Restating it in terms of bitwise ops and SIMD bf16 instructions increases the throughput significantly and translates to meaningful speedups (e.g. 10% end-to-end on one matmul I was looking at). Co-authored-by: Adam Paszke <adam.paszke@gmail.com>

…t is INT_MIN

minjang and others added 5 commits August 22, 2024 07:52

[TEST] Use device fixture for test_math_extern (triton-lang#4558)

f210090

Missed this one test the last time.

Moerafaat requested review from gflegar and chsigg August 23, 2024 10:33

chsigg reviewed Aug 23, 2024

View reviewed changes

Mwsxy and others added 21 commits August 23, 2024 13:30

[Proton] Add a better description when possibly importing incorrect h…

d14f59a

…atchet in Proton (triton-lang#4565)

[BACKEND] Add a knob to fall back to the legacy mma layout conversion…

8c03e46

… code (triton-lang#4561) The knob is only for debugging purposes and should be removed once we've figured out the issue

[FRONTEND] Print full file name when overriding kernel (triton-lang#4566

89ecee2

)

[Proton] Move additional hatchet import into try/except check (triton…

1b095f0

…-lang#4568)

[BACKEND] Continue the backward slice when finding free convert (trit…

2d38ffa

…on-lang#4571) Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>

[CODEGEN] Support CUDA 12.6 (triton-lang#4588)

f48dbc1

According to the [table](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#release-notes-ptx-release-history), both CUDA 12.5 and 12.6 use PTX ISA 8.5

[AMD] Support emit indices logic WMMAv2 layout (triton-lang#4518)

e88a7fe

- Supported linear layout for the 2nd version of WMMA - Supported legacy emit indices related helper functions for now - Removed remaining assertions related to WMMAv2 Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>

[SWP] move schedulePrologueAndEpilogue to be before lowering (triton-…

b2c5d36

…lang#4584) Get ready to separate scheduling out of SWP, so we can move scheduleLoads and schedulePrologueAndEpilogue to a separate scheduling pass. Lowering will happen after inside SWP.

Emit remarks for SWP and vectorization failures (triton-lang#4350)

cf696d4

Dump warning if SWP fails in the inner loop and dump option is enabled in the CL. --------- Co-authored-by: Zeng Wu <zengwu@fb.com>

Fix underflow in Triton's highestPowOf2Divisor function when the inpu…

5d6033c

…t is INT_MIN

Moerafaat force-pushed the export_cl666302296 branch from 061c945 to 5d6033c Compare August 28, 2024 15:28

Merge branch 'main' into export_cl666302296

2c9072f

gflegar approved these changes Sep 2, 2024

View reviewed changes

Moerafaat closed this Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix underflow in Triton's highestPowOf2Divisor function when the input is INT_MIN #11

Fix underflow in Triton's highestPowOf2Divisor function when the input is INT_MIN #11

Moerafaat commented Aug 23, 2024

chsigg left a comment

Fix underflow in Triton's highestPowOf2Divisor function when the input is INT_MIN #11

Fix underflow in Triton's highestPowOf2Divisor function when the input is INT_MIN #11

Conversation

Moerafaat commented Aug 23, 2024

chsigg left a comment

Choose a reason for hiding this comment