forked from triton-lang/triton
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix underflow in Triton's highestPowOf2Divisor function when the input is INT_MIN #11
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Not sure this is worthy to make it? I was annoyed by long sha256-based cache directory names, mostly 64 chars. So I quickly added base64-based shorter cache directory names. Instead of fixing a dozen places that use `hashlib.sha256`, I patched the cache manager. 64-char names are mostly reduced to 43-44 chars. A comparison: ``` > % ls -l $TRITON_CACHE_DIR total 0 drwxr-xr-x 1 minjang users 40 Aug 21 19:02 44ae4aee7ef0ee0dd54e860cf44627e3b6cedabe87a228ac75988301b8a6bf60 drwxr-xr-x 1 minjang users 26 Aug 21 19:02 82dc2c9a5508bf07c72e02353c1e751dc54aae85666f139b2867b0a1e95e0e7b drwxr-xr-x 1 minjang users 226 Aug 21 19:02 b8e240968a85711ba57b17bf8450f1ffbc85a8de8cd1f47aa87b241b53f9bf60 drwxr-xr-x 1 minjang users 26 Aug 21 19:03 gtwsmlUIvwfHLgI1PB51HcVKroVmbxObKGewoeleDns drwxr-xr-x 1 minjang users 40 Aug 21 19:03 RK5K7n7w7g3VToYM9EYn47bO2r6HoiisdZiDAbimv2A drwxr-xr-x 1 minjang users 226 Aug 21 19:03 uOJAloqFcRulexe_hFDx_7yFqN6M0fR6qHskG1P5v2A ``` `test_core.py` runs without any errors, and the cache directory has all base64-based shorter names.
The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [x ] I am not making a trivial change, such as fixing a typo in a comment. - [] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x ] This PR does not need a test because `PR merely modifies the implementation of the function, removing the superfluous parts.`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)
…ng#4539) This is motivated by triton-lang#4509. The crux of the problem is that the Triton code generator needs to inspect a function's arguments / attributes / types in order to determine how it should be called. This meant that "implementation details" like whether a function is a builtin needed to be exposed in the "interface" `tl.extra.libdevice` module, instead of just residing in `tl.extra.cuda.libdevice`. Moreover, this meant that libdevice functions marked as `@core.extern` in the interface could not be implemented via JitFunctions. Allowing each backend to provide its own module map solves this problem as the code generator can inspect the actual function implementation.
Firmware updated after upgrading kernel-mode driver to the newer version 6.2.0. Tested test_randn and test_ptx_cast for an hour as a usual reproducer and no longer shows any failure. --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com>
Missed this one test the last time.
chsigg
reviewed
Aug 23, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
1. Fixed the errors in the formulas in the comments. 2. When assert is enabled in DEBUG mode, using the list type will report an error. Additionally, the Tensor returned by the internal interface can ensure that the shape is a list and does not require checking. The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because it does not take any effect to working code. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Co-authored-by: 谢双镱 <xieshuangyi@bytedance.com>
… code (triton-lang#4561) The knob is only for debugging purposes and should be removed once we've figured out the issue
…ay escape out of the loop (triton-lang#4567) Merging 7c90081: '[SCF][PIPELINE] Handle the case when values from the peeled prologue may escape out of the loop' from the upstream LLVM into Triton Pipeline Expander "Previously the values in the peeled prologue that weren't treated with the predicateFn were passed to the loop body without any other predication. If those values are later used outside of the loop body, they may be incorrect if the num iterations is smaller than num stages - 1. We need similar masking for those, as is done in the main loop body, using already existing predicates."
…ng#4255) PyTorch have ways to load the libamdhip64.so library, and the original logic has problems detecting this shared object file because it can be placed at any location with `venv`/`DT_RUNPATH`/`LD_LIBRARY_PATH`/ `/etc/ld.so.conf.d/*.conf`/`LD_PRELOAD`. Notably the `RUNPATH` has become the preferred approach for multi-version ROCm installations. This patch let Triton enumerate the address space with `dl_iterate_phdr`, and choose the libamdhip64.so that is already loaded into address space unless `TRITON_LIBHIP_PATH` is already set. --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com>
Cast dot arguments from unsupported FP8 to supported FP16 in order to use MFMA instructions instead of FMA. This approach is expected to give better performance and be more stable compared to FMA implementation. --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com>
…on-lang#4571) Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>
Proposing this change because we use these tests for CPU backend here: https://github.com/microsoft/triton-shared/ and these clauses lead to wrong assumptions about supported features. The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `this is a fix to tests`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Co-authored-by: Renat Idrisov <parsifal-47@users.noreply.github.com>
…ing input and output dimensions (triton-lang#4530) Before this patch, if `a * b = c`, `c.divideRight(b)` might return `nullopt` even if `a' * b = c`, where `a'` is the potential result of `divideRight`. This PR addresses the issue by conservatively removing input and output dimensions, ensuring that the division returns a non-nullopt result when a valid solution exists. However, it does not guarantee that `a` and `a'` will have identical input and output dimensions. In addition, this PR also fixes a bug in `TritonGPURemoveLayoutConversionsPass`. The backward slice should be continued when encountering a free conversion. This includes cases where `c.divideRight(b)` results in a layout that only permutes register values within individual threads. --------- Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>
… 1 (triton-lang#4579) If the subview for the mma operand is in the previous iteration of the loop, it would be skipped by the code that looks for it, if it feeds directly into the loop yield. This PR fixes that issue allowing for them subview to be found and enabling the pipelining in some cases skipped previously.
…ng#4582) When `other` is there we should use it to initalize the reg before doing the load instead of initializing the reg with 0. Note that this does add a scoreboard dependency between the `other` def and the load but user can remove it by using a select if other comes from a high latency op.
The command in readme isn't correct, I finally come out to get the compile command with `find python/build -name 'compile_commands.json' | xargs readlink -f`. I think although the improvement in readme is trivial but it is helpful for other new comer go through the environment setup easier. --------- Signed-off-by: jayzhan <jayzhan211@gmail.com>
) This PR adds verbosity to assembly code after LLVM backend passes. This adds references to the source code for both NV and AMD. Additionally, it adds `Kernel Info` at the end of the dump for AMD. For example: ``` ; Kernel info: ; codeLenInByte = 7732 ; NumSgprs: 24 ; NumVgprs: 154 ; NumAgprs: 128 ; TotalNumVgprs: 284 ... ```
According to the [table](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#release-notes-ptx-release-history), both CUDA 12.5 and 12.6 use PTX ISA 8.5
- Supported linear layout for the 2nd version of WMMA - Supported legacy emit indices related helper functions for now - Removed remaining assertions related to WMMAv2 Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
…lang#4584) Get ready to separate scheduling out of SWP, so we can move scheduleLoads and schedulePrologueAndEpilogue to a separate scheduling pass. Lowering will happen after inside SWP.
Dump warning if SWP fails in the inner loop and dump option is enabled in the CL. --------- Co-authored-by: Zeng Wu <zengwu@fb.com>
…s8->bf16 conversions (triton-lang#4563) Hopper has very low throughput of conversion instructions that cause this operations to quickly become an ALU bottleneck. Restating it in terms of bitwise ops and SIMD bf16 instructions increases the throughput significantly and translates to meaningful speedups (e.g. 10% end-to-end on one matmul I was looking at). Co-authored-by: Adam Paszke <adam.paszke@gmail.com>
Moerafaat
force-pushed
the
export_cl666302296
branch
from
August 28, 2024 15:28
061c945
to
5d6033c
Compare
gflegar
approved these changes
Sep 2, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
highestPowOf2Divisor function here https://github.com/triton-lang/triton/blob/8e63999da7ef87fc6ac908f26a9b9c05ce85ab70/include/triton/Dialect/Triton/IR/Utility.h#L33 will run during analysis when coalesce pass is invoked. This function will underflow if the given input is INT_MIN and will fail when run under ASAN.
This change handles this edge case and adds a minimal lit test to make sure the case is covered.