-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add couple configs into generator.py for mixed input MM #1350
Conversation
What do you mean by symmetric (same?). Tensor Core math_instruction shape for both upcast_a and upcast_b is 16816. The supported tile_description (more precisely tile shape) may need to be different for upcast_a vs upcast_b. |
93dc359
to
6ee70e7
Compare
By symmetry, I meant on As far as |
For For |
6ee70e7
to
33f48a7
Compare
Thanks for the clarification. I've updated
that matching
produces at least one line of profiling output, which should mean that the kernel compiled and ran successfully. As a matter of fact, in these tests of mine |
Asking again: how to properly run verification after my changes? |
cmake --no-warn-unused-cli -DCMAKE_BUILD_TYPE:STRING=Release -DCUTLASS_NVCC_ARCHS:STRING=80 -DCUTLASS_NVCC_KEEP:STRING=OFF -DCUTLASS_ENABLE_F16C:STRING=ON -DCUTLASS_LIBRARY_KERNELS:STRING=f16_s16816gemm_f16_s8_128x128_64x*,f16_s16816gemm_s8_f16_128x128_64x*,f16_s16816gemm_u8_f16_128x128_64x*,f16_s16816gemm_f16_u8_128x128_64x*,bf16_s16816gemm_bf16_s8_128x128_64x*,bf16_s16816gemm_s8_bf16_128x128_64x*,bf16_s16816gemm_bf16_u8_128x128_64x*,bf16_s16816gemm_u8_bf16_128x128_64x*,f16_s16816gemm_f16_128x128_64x*_tn_align8,bf16_s16816gemm_bf16_128x128_64x*_tn_align8 -DCUTLASS_LIBRARY_IGNORE_KERNELS:STRING=gemm_grouped*,gemm_planar* -DCMAKE_EXPORT_COMPILE_COMMANDS:BOOL=TRUE -DCMAKE_C_COMPILER:FILEPATH=/usr/bin/gcc -DCMAKE_CXX_COMPILER:FILEPATH=/usr/bin/g++ -S/mnt/disks/gcloud_workspace/repos/cutlass/cutlass_tree_2/cutlass -B/mnt/disks/gcloud_workspace/repos/cutlass/cutlass_tree_2/build -G Ninja The cmake flags to play with are
Apologies for the delayed response. I have been OOO for last few weeks. |
33f48a7
to
e130f14
Compare
Thanks for the clarifications. PR is updated with the changes suggested: Added number of tests, so that it should be all consistent now between tests, Script used to validate that
|
e130f14
to
36223dd
Compare
fe7e3ed
to
4e57cc2
Compare
This PR has been labeled |
4e57cc2
to
14cbeab
Compare
@hwu36: Thanks for the test fix! The problem with the configurations added in your commit is that they won't work - one could try to change for example exactly the test you fixed, to one of these configurations (I'm attaching an example in the patch below), and then to do
(The reason I removed some configurations from the generator in my commit is exactly that I tried them all, and removed those that won't compile.) |
I did not change unit test. The reason that profiler cannot do 128x32 is due to epilogue alignment. I fixed that. So 128x32 is fine now. |
I meant on fixing the test name :-)
Ah, I see - indeed it works this way. So, do you think this PR could be merged now? And, unrelated: any chance to have PR 1190 reviewed by someone from the team - it's stalled for very long time, and this functionality would be still very useful (at least for PyTorch)? |
i will merge this pr after 3.5.1 pr is merged. @manishucsd is reviewing pr1190. that one changed mainloop, it will take a while. |
Thank you for the change. Overall looks good. Can do the following?
Dump the output put CSV here and make sure all the mixed input variants
|
@manishucsd , do you mean this pr or 1190? i did all the testing for this pr myself. |
I meant this one. However, this one is adding only unit tests and more instances to the It will be good know the following two items:
|
I've built and ran the profiler according to instructions you provided above, and the |
@manishucsd: Maybe you could review PR 1413 before 1190? The 1413 is third, and last, of my PRs awaiting for the review, the changes closely follow int8/float16 functionality you implemented initially, and the only thing that may need changes is the conversion. So it should be possible to review/update/merge this one relatively quickly, and this particular mixed data-types combination is also highly requested (well, at least for PyTorch). The 1190 will need more time. It's not ready for merging, just has couple commits providing proof of the concept in slightly different ways, each one with its disadvantages. This one may need input from other members of the team, and will certainly take more time to review, and additional work by me to complete. |
Here is my run |
test/unit/gemm/device/CMakeLists.txt
Outdated
# Following test could be created by making a copy of | ||
# gemm_universal_u8t_f16n_f16t_mixed_input_tensor_op_f32_sm80.cu, | ||
# and then replacing occurences of "half_t" in the code with | ||
# "bfloat16_t". Such test would fail, but with only a single value | ||
# differing from the reference. | ||
#gemm_universal_u8t_bf16n_bf16t_mixed_input_tensor_op_f32_sm80.cu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for sharing this. There is nothing in the instance that I could say causing the issue. Can you try smaller range for initializing operandB of BF16. The current range is from -5 to 5, can you do try -3 to 3?
Here is mine: Edit: I re-built, without rebasing the branch on latest main, and all rows in the .csv file still have |
1e1e58f
to
834076b
Compare
LGTM. @hwu36 and CUTLASS team can you please merge this? cc: @alexsamardzic |
Checking the status on this reviewed PR. If this is already merged? |
It doesn't seem to be merged yet. |
* Add couple configs into generator.py for mixed input MM * change one unit test name; reenable 128x32 in the profiler * Added U8/BF16 tests. --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com> Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
* Add couple configs into generator.py for mixed input MM * change one unit test name; reenable 128x32 in the profiler * Added U8/BF16 tests. --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com> Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
I'm adding (PR here) CUTLASS kernels as an auto-tune option for PyTorch compiler, and it would be nice to have these additional configurations available. This is not urgent, and more of alike changes may be further desired, so if it's OK to make changes like this then this PR could be kept open for while, and I'll make further additions, as needed, to it.
@manishucsd : Would it make sense for
GenerateSM80_TensorOp_16816_mixed_input_upcast_a
andGenerateSM80_TensorOp_16816_mixed_input_upcast_b
to be symmetric w.r.t.math_instructions
andtile_descriptions
? I can change it through this PR too.