Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu: nvidia: ip: adjust benchdnn error threshold #2479

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion tests/benchdnn/ip/ip.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -278,7 +278,16 @@ void skip_invalid_prb(const prb_t *prb, res_t *res) {}

void setup_cmp(compare::compare_t &cmp, const prb_t *prb, data_kind_t kind,
const args_t &ref_args) {
cmp.set_threshold(0.f);
// The nvidia implementation has precision issues in some cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't call it precision issues, wasn't our conclusion that this is differences in rounding modes that we can't change in cuDNN?

// for large problems with post-op sum
if (is_nvidia_gpu()
&& prb->attr.post_ops.find(attr_t::post_ops_t::kind_t::SUM) != -1
&& prb->dst_dt() == dnnl_f16) {
const float trh = epsilon_dt(prb->dt[2]);
cmp.set_threshold(trh);
} else {
cmp.set_threshold(0.f);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why this difference ? Is sum post-op applied over f32 intermediate value or over f16 values for NV backend?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say this change can fly in only in case sum post-op is done through a native cuDNN fusion (single call) with f16 accumulation internally, otherwise, the issue is likely inside the implementation that doesn't convert the output to f32 and accumulate pieces in f32.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sum post-op is implemented through the beta parameter of cublas gemm (see here). The compute type for gemm is set to float32, but I couldn't find any details in the documentation how that affects alpha & beta scaling parameters, i.e. whether they are performed in f32 too or in f16 (which is the datatype used for the failing cases).

@dzarukin I investigated if there are any issues with the implementation but couldn't find any. Also, I noticed that changing the input values makes the test pass, e.g. when using whole numbers as the input (still in f16 datatype).

To me it seems to be some sort of a precision/rounding issue. The expected values computed by oneDNN are rounded down, while in the cuDNN case they are rounded up, e.g.

[107536][DST][336:16] exp_f32:      1038.5 exp:        1038 got:        1039 diff:       1 rdiff:0.000963391
[108178][DST][338:18] exp_f32:      1051.5 exp:        1052 got:        1051 diff:       1 rdiff:0.00095057
[108499][DST][339:19] exp_f32:      1064.5 exp:        1064 got:        1065 diff:       1 rdiff:0.00093985

The values in full precision in the above example are not representable as f16 (e.g. https://float.exposed/0x641c), which makes me think cublas is doing incorrect rounding?

Also I found this discussion where someone is asking about how the scaling parameters in cublas work, but there was no response.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgeor255, thanks for looking into implementation details, that's a good start.
I suggest to conduct a little experiment - make cublasGemmEx sum-less and return f32 and append sum post-op (with data copied to a dedicated buffer) with f32 as well (through the cudnnAddTensor call) to simulate the proper behavior.
If this experiment goes fine, it would prove that there's something wrong indeed with cuda libraries, and I'll re-iterate on that check again. Probably the same one should exist (or appear) for matmul as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When changing data addresses the issue it always means rounding/accumulation mechanics stands on its way. Smaller ranges usually lead to situations when final numbers remain exact and conversion to f16/f32 and back don't change the number and the check passes.

When exp number if x.5, in the reality, it can be x.5002, which would be rounded towards x + 1, while the library might have x.4998 which would be rounded towards x. That library result can be a product of different effects (accumulation in lower data types, inexact division/multiplication, intermediate roundings, etc.) which can't be examined with cuda.

}

std::vector<int> supported_exec_args(dir_t dir) {
Expand Down
Loading