-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ttnn.repeat throws runtime memory error #6361
Comments
Updating this issue from chat: I think it’s this many because they’re trying repeat the tensor 2048 times, and repeat is implemented under the hood as concat. For concat, it works on a variable number of tensor inputs but for each tensor, it requires 4 args, so repeat is probably a naive wrapper that is calling concat with 2048 args of the same tensor, which would result in 32kb of args For an actual repeat op that works similarly to concat, it would only need 4 + some other args instead of 4 * num tensors/repeats |
@umadevimcw, the repeat OP is required for performant mamba model implementation which we are targeting by the end of this month. RequirementEquivalent OP for torch.repeat
Non-performant workaroundI have created a transformation matrix to do equivalent repeat OP but its a big matmul (1, 1, 32, 16) @(1, 1, 16, 16*2048). We want an implementation which is performant than this matmul. Unit test for above workaround
|
For ttnn.repeat bug found the below issues :
|
Answering to @tt-aho comment |
@KalaivaniMCW |
@kpaigwar Hello, just to be sure, did you mean to tag @KalaivaniMCW? |
Summarizing the requirements
|
The restriction on shapes being tiled sized is because the concat op does not natively support RM layout, and pads/formats to tile layout, runs concat, and unpads, which means it can't concat on dim 2/3 if there is padding on the same dim (no restriction on concat of other dims). Restriction on number of inputs/args is because we have a limit on the number of runtime args (I think 256?) which would result in an assert if over. However I am not sure about what would be causing the hang/segmentation fault. Would need to look into this. As for support for this op, there would be 2 steps I see for supporting the interleaved case.
This would add basic interleaved support for rm concat, and a more optimal repeat (there would be better optimizations that could be done for repeat since it can have more efficient read/write logic compared to concat). However, since the target is sharding I don't think it's useful to look into optimizing the interleaved version beyond basic support for debug unless it's needed elsewhere. @kpaigwar do you have the specs for sharding for this op? Also just want to check if this issue/request is just a workaround for a temporary broadcast (similar to #5769) or if this is an actual needed standalone op? |
@tt-aho both issues are similar. But they differ on how the broadcasting is happening. In this particular issue the last dim |
Okay, I think this makes sense. We can try and move forward with my suggestions above on implementing interleaved repeat to unblock functionality/basic perf, and once you have the shard specs we can review the requirements for sharded repeat, otherwise we can look into more perf optimizations for interleaved repeat. |
Thanks @jliangTT |
Update: We found that the workaround matmul op for torch.repeat is the bottleneck for our performance taking .39 ms (12% of device time). See row 177 in the profiling sheet attached. We are hoping this issue should be prioritized. |
You can try branch |
Thanks, we will try that and update on the sharding performance |
This is now merged to main. Please confirm if it addresses the issue and follow @tarafdarTT 's suggestion of opening a new pr once you have perf requirements/specifications. |
@tt-aho I tested this on main, I am getting shape mismatch errors on my example as well as test_repeat.py (modified to reproduce error) Test
Error
|
I see. I'm not entirely familiar with and need to look at why ttnn does additional reshaping under the hood, but my test with tt_lib version is passing for your shape case so might just be something in ttnn needs updating. |
I have fix for ttnn version with pr #6526. Note that for your specific case with input shape |
@tt-aho I tested with your fix, it worked for cases when tensor doesn't have any padding. Although it failed on following tensors
|
I will close this issue and address padding issue on your PR #6526. |
Describe the bug
When trying to expand a tensor of shape [1, 1, 32, 16] to [1, 1, 32, 16*2048] using
ttnn.repeat
, the kernel fails with the following errorFATAL | 32KB runtime args targeting kernel reader_concat_interleaved_start_id on (x=0,y=0) are too large. Cannot be written as they will run into memory region reserved for result. Max allowable size is 1 KB.
However, the same output tensor can be successfully created using
ttnn.repeat_interleave
apiTo Reproduce
Run below test code in your tt-metal environment
Expected behavior
If ttnn.repeat_interleave passes then it is also expected to work with ttnn.repeat
Screenshots
![Screenshot 2024-03-13 at 11 57 20 AM](https://private-user-images.githubusercontent.com/132708568/312532027-79780745-3a54-4815-9f77-78028adc0451.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0OTk0MjEsIm5iZiI6MTczOTQ5OTEyMSwicGF0aCI6Ii8xMzI3MDg1NjgvMzEyNTMyMDI3LTc5NzgwNzQ1LTNhNTQtNDgxNS05Zjc3LTc4MDI4YWRjMDQ1MS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjE0JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxNFQwMjEyMDFaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1mMWMwZTYzMTIwNzM0MzZiZDgzN2M3YjA4MDgzNjVlMTFiZjQ1NjhkZjVkMDE5MDNjMGVkNjBkYmJmYzIyMWZiJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.1XWMPVjftni_k2yTfg0Xs8lPKcpmYIN_eJ-RUplGmO4)
The text was updated successfully, but these errors were encountered: