-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu: nvidia: ip: adjust benchdnn error threshold #2479
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -215,6 +215,14 @@ limitations when using Nvidia backend for eltwise primitive: | |||||
The inner product primitives is an implementation of matrix multiplication plus | ||||||
bias activation. There are two implementation of inner product in cuDNN backend. | ||||||
|
||||||
With `sum` post-op, the accumulation mode attribute affects behaviour as | ||||||
follows: | ||||||
- `relaxed`: Uses GEMM’s beta parameter for a fused, optimised sum post-op but | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
American English |
||||||
may reduce output precision for large `f16` inputs. | ||||||
- `strict` (default): Converts GEMM output to `f32`, performs sum as a separate | ||||||
operation, then converts it back to the original type. This is more precise | ||||||
but less performant. | ||||||
|
||||||
#### Using GEMM | ||||||
|
||||||
The default backend for inner product is the gemm backend using `cublasGemmEx` | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -278,7 +278,17 @@ void skip_invalid_prb(const prb_t *prb, res_t *res) {} | |
|
||
void setup_cmp(compare::compare_t &cmp, const prb_t *prb, data_kind_t kind, | ||
const args_t &ref_args) { | ||
cmp.set_threshold(0.f); | ||
// The nvidia implementation has different precision guarantees in some cases | ||
// for large problems with post-op sum | ||
if (is_nvidia_gpu() | ||
&& prb->attr.post_ops.find(attr_t::post_ops_t::kind_t::SUM) != -1 | ||
&& prb->dst_dt() == dnnl_f16 && (prb->dir & FLAG_FWD) | ||
&& prb->attr.acc_mode == dnnl_accumulation_mode_relaxed) { | ||
const float trh = epsilon_dt(prb->dt[2]); | ||
cmp.set_threshold(trh); | ||
} else { | ||
cmp.set_threshold(0.f); | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you know why this difference ? Is sum post-op applied over f32 intermediate value or over f16 values for NV backend? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd say this change can fly in only in case sum post-op is done through a native cuDNN fusion (single call) with f16 accumulation internally, otherwise, the issue is likely inside the implementation that doesn't convert the output to f32 and accumulate pieces in f32. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The sum post-op is implemented through the @dzarukin I investigated if there are any issues with the implementation but couldn't find any. Also, I noticed that changing the input values makes the test pass, e.g. when using whole numbers as the input (still in f16 datatype). To me it seems to be some sort of a precision/rounding issue. The expected values computed by oneDNN are rounded down, while in the cuDNN case they are rounded up, e.g.
The values in full precision in the above example are not representable as f16 (e.g. https://float.exposed/0x641c), which makes me think cublas is doing incorrect rounding? Also I found this discussion where someone is asking about how the scaling parameters in cublas work, but there was no response. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sgeor255, thanks for looking into implementation details, that's a good start. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When changing data addresses the issue it always means rounding/accumulation mechanics stands on its way. Smaller ranges usually lead to situations when final numbers remain exact and conversion to f16/f32 and back don't change the number and the check passes. When exp number if x.5, in the reality, it can be x.5002, which would be rounded towards There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dzarukin thanks for the suggestion, I tested doing the sum post-op separately with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sgeor255 thanks for checking it! Then it likely means non-zero There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dzarukin updated the PR. |
||
} | ||
|
||
std::vector<int> supported_exec_args(dir_t dir) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
American English is preferred