Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix underflow in Triton's highestPowOf2Divisor function when the input is INT_MIN #11

Closed
wants to merge 27 commits into from

Conversation

Moerafaat
Copy link
Member

highestPowOf2Divisor function here https://github.com/triton-lang/triton/blob/8e63999da7ef87fc6ac908f26a9b9c05ce85ab70/include/triton/Dialect/Triton/IR/Utility.h#L33 will run during analysis when coalesce pass is invoked. This function will underflow if the given input is INT_MIN and will fail when run under ASAN.

This change handles this edge case and adds a minimal lit test to make sure the case is covered.

minjang and others added 5 commits August 22, 2024 07:52
Not sure this is worthy to make it? I was annoyed by long sha256-based
cache directory names, mostly 64 chars. So I quickly added base64-based
shorter cache directory names.

Instead of fixing a dozen places that use `hashlib.sha256`, I patched
the cache manager. 64-char names are mostly reduced to 43-44 chars.

A comparison:
```
> % ls -l $TRITON_CACHE_DIR
total 0
drwxr-xr-x 1 minjang users  40 Aug 21 19:02 44ae4aee7ef0ee0dd54e860cf44627e3b6cedabe87a228ac75988301b8a6bf60
drwxr-xr-x 1 minjang users  26 Aug 21 19:02 82dc2c9a5508bf07c72e02353c1e751dc54aae85666f139b2867b0a1e95e0e7b
drwxr-xr-x 1 minjang users 226 Aug 21 19:02 b8e240968a85711ba57b17bf8450f1ffbc85a8de8cd1f47aa87b241b53f9bf60
drwxr-xr-x 1 minjang users  26 Aug 21 19:03 gtwsmlUIvwfHLgI1PB51HcVKroVmbxObKGewoeleDns
drwxr-xr-x 1 minjang users  40 Aug 21 19:03 RK5K7n7w7g3VToYM9EYn47bO2r6HoiisdZiDAbimv2A
drwxr-xr-x 1 minjang users 226 Aug 21 19:03 uOJAloqFcRulexe_hFDx_7yFqN6M0fR6qHskG1P5v2A
```

`test_core.py` runs without any errors, and the cache directory has all
base64-based shorter names.
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.

- [x ] I am not making a trivial change, such as fixing a typo in a
comment.

- [] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x ] This PR does not need a test because `PR merely modifies the
implementation of the function, removing the superfluous parts.`.

- Select one of the following.
  - [ ] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
…ng#4539)

This is motivated by triton-lang#4509. The crux of the problem is that the Triton
code generator needs to inspect a function's arguments / attributes /
types in order to determine how it should be called. This meant that
"implementation details" like whether a function is a builtin needed to
be exposed in the "interface" `tl.extra.libdevice` module, instead of
just residing in `tl.extra.cuda.libdevice`. Moreover, this meant that
libdevice functions marked as `@core.extern` in the interface could not
be implemented via JitFunctions.

Allowing each backend to provide its own module map solves this problem
as the code generator can inspect the actual function implementation.
Firmware updated after upgrading kernel-mode driver to
the newer version 6.2.0.
Tested test_randn and test_ptx_cast for an hour as a
usual reproducer and no longer shows any failure.

---------

Co-authored-by: Lei Zhang <antiagainst@gmail.com>
Copy link
Member

@chsigg chsigg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Mwsxy and others added 21 commits August 23, 2024 13:30
1. Fixed the errors in the formulas in the comments.
2. When assert is enabled in DEBUG mode, using the list type will report
an error. Additionally, the Tensor returned by the internal interface
can ensure that the shape is a list and does not require checking.


The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.

- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x] This PR does not need a test because it does not take any effect
to working code.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

Co-authored-by: 谢双镱 <xieshuangyi@bytedance.com>
… code (triton-lang#4561)

The knob is only for debugging purposes and should be removed once we've
figured out the issue
…ay escape out of the loop (triton-lang#4567)

Merging 7c90081: '[SCF][PIPELINE] Handle the case when values from the
peeled prologue may escape out of the loop' from the upstream LLVM into
Triton Pipeline Expander

"Previously the values in the peeled prologue that weren't treated with
the predicateFn were passed to the loop body without any other
predication. If those values are later used outside of the loop body,
they may be incorrect if the num iterations is smaller than num stages -
1. We need similar masking for those, as is done in the main loop body,
using already existing predicates."
…ng#4255)

PyTorch have ways to load the libamdhip64.so library, and the original
logic has problems detecting this shared object file because it can be
placed at any location with `venv`/`DT_RUNPATH`/`LD_LIBRARY_PATH`/
`/etc/ld.so.conf.d/*.conf`/`LD_PRELOAD`. Notably the `RUNPATH`
has become the preferred approach for multi-version ROCm installations.

This patch let Triton enumerate the address space with
`dl_iterate_phdr`, and choose the libamdhip64.so that is already loaded
into address space unless `TRITON_LIBHIP_PATH` is already set.

---------

Co-authored-by: Lei Zhang <antiagainst@gmail.com>
Cast dot arguments from unsupported FP8 to supported FP16 in order to
use MFMA instructions instead of FMA.
This approach is expected to give better performance and be more stable
compared to FMA implementation.

---------

Co-authored-by: Lei Zhang <antiagainst@gmail.com>
Proposing this change because we use these tests for CPU backend here:
https://github.com/microsoft/triton-shared/ and these clauses lead to
wrong assumptions about supported features.

The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.

- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [x] This PR does not need a test because `this is a fix to tests`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

Co-authored-by: Renat Idrisov <parsifal-47@users.noreply.github.com>
…ing input and output dimensions (triton-lang#4530)

Before this patch, if `a * b = c`, `c.divideRight(b)` might return
`nullopt` even if `a' * b = c`, where `a'` is the potential result of
`divideRight`.
This PR addresses the issue by conservatively removing input and output
dimensions, ensuring that the division returns a non-nullopt result when
a valid solution exists.
However, it does not guarantee that `a` and `a'` will have identical
input and output dimensions.

In addition, this PR also fixes a bug in
`TritonGPURemoveLayoutConversionsPass`. The backward slice should be
continued when encountering a free conversion. This includes cases where
`c.divideRight(b)` results in a layout that only permutes register
values within individual threads.

---------

Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>
… 1 (triton-lang#4579)

If the subview for the mma operand is in the previous iteration of the
loop, it would be skipped by the code that looks for it, if it feeds
directly into the loop yield. This PR fixes that issue allowing for them
subview to be found and enabling the pipelining in some cases skipped
previously.
…ng#4582)

When `other` is there we should use it to initalize the reg before doing
the load instead of initializing the reg with 0.

Note that this does add a scoreboard dependency between the `other` def
and the load but user can remove it by using a select if other comes
from a high latency op.
The command in readme isn't correct, I finally come out to get the
compile command with `find python/build -name 'compile_commands.json' |
xargs readlink -f`. I think although the improvement in readme is
trivial but it is helpful for other new comer go through the environment
setup easier.

---------

Signed-off-by: jayzhan <jayzhan211@gmail.com>
)

This PR adds verbosity to assembly code after LLVM backend passes.

This adds references to the source code for both NV and AMD.
Additionally, it adds `Kernel Info` at the end of the dump for AMD.
For example:

```
; Kernel info:                                                                                                                                                                                                
; codeLenInByte = 7732                                                                                                                                                                                        
; NumSgprs: 24                                                                                                                                                                                                
; NumVgprs: 154                                                                                                                                                                                               
; NumAgprs: 128                                                                                                                                                                                               
; TotalNumVgprs: 284                                                                                                                                                                                          
...
```
- Supported linear layout for the 2nd version of WMMA
- Supported legacy emit indices related helper functions for now
- Removed remaining assertions related to WMMAv2

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
…lang#4584)

Get ready to separate scheduling out of SWP, so we can move
scheduleLoads and schedulePrologueAndEpilogue to a separate scheduling
pass. Lowering will happen after inside SWP.
Dump warning if SWP fails in the inner loop and dump option is enabled
in the CL.

---------

Co-authored-by: Zeng Wu <zengwu@fb.com>
…s8->bf16 conversions (triton-lang#4563)

Hopper has very low throughput of conversion instructions that cause
this operations to quickly become an ALU bottleneck. Restating it in
terms of bitwise ops and SIMD bf16 instructions increases the throughput
significantly and translates to meaningful speedups (e.g. 10% end-to-end
on one matmul I was looking at).

Co-authored-by: Adam Paszke <adam.paszke@gmail.com>
@Moerafaat Moerafaat closed this Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.