-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC][Tracking Issue][AMP] Tracking Issue for Mixed Precision Pass #8296
Comments
cc @Lunderberg |
cc @masahi |
I've hit a nasty issue. On CPU targets, our sorting related ops are implemented in C++ https://github.com/apache/tvm/blob/main/src/runtime/contrib/sort/sort.cc#L436, and they don't support fp16. So ops like Maybe we need to add a specialized CPU sort for fp16 or rewrite CPU sort in TIR... (the same issue would come up with int4, bfloat16 etc). The former solution would not be hard since we just need to add a specialized comparison functor for fp16 like https://github.com/apache/tvm/blob/main/src/runtime/contrib/sort/sort.cc#L40-L43 |
It looks like transformer like models have many The fact that softmax and the following cast to fp16 are not fused surprised me. This is because the op pattern for softmax is kOpaque, tvm/python/tvm/relay/op/nn/_nn.py Line 42 in 66ac470
@yzhliu Is there a reason softmax op pattern cannot be |
@AndrewZhaoLuo What is our goal wrt mixed_type accumulation? Assuming we do find cases where mixed accum is beneficial, how are we going to decide when to enable / disable it? Given that currently we can only choose one or the other per op basis: tvm/python/tvm/relay/transform/mixed_precision.py Lines 167 to 168 in f4f525d
|
Yeah the issue behind creating defaults is that we cannot create defaults that work best for every situation. This is especially true since whenever we want speed we trade accuracy which can sometimes become a problem. For the defaults I envision that for most ops we don't accumulate to FP32. For some ops like the global pools and sums we might turn it on. Really the best way to determine the criteria is to do a lot of the work you've been doing in trying out different models in different applications and seeing what needs to be turned on and off. That being said, this is really designed to be a tool which requires the user sometimes to go back and modify the default values provided to either get more speed if their model can afford it, or accuracy if they need it. It requires investigation and I don't think we can probably hit all cases well. A tutorial here would help (which is on my long list of TODOs). Finally, while things are done on a per-op basis, the actual mixed precision function can look at some parts of the relay call like the attributes or the node or the input tensor sizes. Therefore we can be smart about the quantization (e.g. for global pooling, only accumulate in fp32 if the input to output reduction is large enough). Again, a tutorial or example would help flesh this out. |
@AndrewZhaoLuo I briefly looked at bfloat16. While fp16 vs bf16 makes no difference for the conversion pass, it seems it is going to take a lot of effort to compile and run a bf16 model end to end, for at least two reasons:
Since tensorcore can natively run bf16 workloads at the same rate as fp16, and bf16 on x86 servers is becoming a thing, it would be nice to have a good support for bf16 across the stack in the future. |
This issue tracks work on supporting mixed precision within TVM.
RFC: apache/tvm-rfcs#6
Edge case ops:
out_dtypes
) for following types of ops:Other discussions:
Tasks which may help:
Benchmarking improvements from pass: https://docs.google.com/spreadsheets/d/12lgyfuHaRS-X4uG-1iQOV8oAuPpuVAbspcmkOSPRFHQ/edit?usp=sharing
The text was updated successfully, but these errors were encountered: