Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu: nvidia: ip: adjust benchdnn error threshold #2479

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions src/gpu/nvidia/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,14 @@ limitations when using Nvidia backend for eltwise primitive:
The inner product primitives is an implementation of matrix multiplication plus
bias activation. There are two implementation of inner product in cuDNN backend.

With `sum` post-op, the accumulation mode attribute affects behaviour as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
With `sum` post-op, the accumulation mode attribute affects behaviour as
With `sum` post-op, the accumulation mode attribute affects behavior as

American English is preferred

follows:
- `relaxed`: Uses GEMM’s beta parameter for a fused, optimised sum post-op but
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `relaxed`: Uses GEMM’s beta parameter for a fused, optimised sum post-op but
- `relaxed`: Uses GEMM’s beta parameter for a fused, optimized sum post-op but

American English

may reduce output precision for large `f16` inputs.
- `strict` (default): Converts GEMM output to `f32`, performs sum as a separate
operation, then converts it back to the original type. This is more precise
but less performant.

#### Using GEMM

The default backend for inner product is the gemm backend using `cublasGemmEx`
Expand Down
7 changes: 4 additions & 3 deletions src/gpu/nvidia/cudnn_conv_inner_product.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,8 @@ struct cudnn_conv_inner_product_fwd_t : public cudnn_inner_product_fwd_t {
new cudnn_conv_inner_product_fwd_impl_t());

auto st = inner_product_impl_->init(engine, this, with_relu(),
with_eltwise(), with_sum(), use_fused_path_for_blocking);
with_eltwise(), with_sum(), use_fused_path_for_blocking,
false);
return st;
}
bool with_eltwise() const {
Expand Down Expand Up @@ -250,7 +251,7 @@ struct cudnn_conv_inner_product_bwd_data_t
new cudnn_conv_inner_product_bwd_data_impl_t());

return inner_product_impl_->init(
engine, this, false, false, false, false);
engine, this, false, false, false, false, false);
}

status_t set_default_params() {
Expand Down Expand Up @@ -341,7 +342,7 @@ struct cudnn_conv_inner_product_bwd_weights_t
new cudnn_conv_inner_product_bwd_weights_impl_t());

return inner_product_impl_->init(
engine, this, false, false, false, false);
engine, this, false, false, false, false, false);
}

status_t set_default_params() {
Expand Down
8 changes: 5 additions & 3 deletions src/gpu/nvidia/cudnn_conv_inner_product_impl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ struct cudnn_conv_inner_product_fwd_impl_t
}
virtual status_t init(impl::engine_t *engine, inner_product_pd_t *pd,
bool with_relu, bool with_eltwise, bool with_sum,
bool use_fuse_path_for_blocking) override {
bool use_fuse_path_for_blocking, bool /* use_f32_sum */) override {
with_bias_ = pd->with_bias();
with_relu_ = with_relu;
with_eltwise_ = with_eltwise;
Expand Down Expand Up @@ -424,7 +424,8 @@ struct cudnn_conv_inner_product_bwd_data_impl_t
cudnnTensorFormat_t diff_source_format_;
virtual status_t init(impl::engine_t *engine, inner_product_pd_t *pd,
bool /*with_relu*/, bool /*with_eltwise*/, bool /*with_sum */,
bool /*using_fused_path_for_blocking*/) override {
bool /*using_fused_path_for_blocking*/,
bool /* use_f32_sum */) override {
// Pad out the dimensions to 4
if (pd->ndims() > CUDNN_DIM_MAX || pd->ndims() < 2) {
return status::invalid_arguments;
Expand Down Expand Up @@ -575,7 +576,8 @@ struct cudnn_conv_inner_product_bwd_weights_impl_t

virtual status_t init(impl::engine_t *engine, inner_product_pd_t *pd,
bool /*with_relu*/, bool /*with_eltwise*/, bool /*with_sum */,
bool /*using_fused_path_for_blocking*/) override {
bool /*using_fused_path_for_blocking*/,
bool /* use_f32_sum */) override {
// If any of the dimensions are 0 we should not continue with creating
// cudnn descriptors
with_bias_ = pd->with_bias();
Expand Down
10 changes: 7 additions & 3 deletions src/gpu/nvidia/cudnn_gemm_inner_product.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -222,10 +222,14 @@ struct cudnn_gemm_inner_product_fwd_t : public cudnn_inner_product_fwd_t {
&& (gemm_compatible || need_reorder);
if (!ok) return status::unimplemented;

const bool is_relaxed_acc_mode
= attr()->acc_mode_ == dnnl_accumulation_mode_relaxed;
const bool use_f32_sum = with_sum && !is_relaxed_acc_mode;

inner_product_impl_.reset(
new cudnn_gemm_inner_product_fwd_impl_t());
return inner_product_impl_->init(engine, this, with_eltwise,
with_eltwise, with_sum, need_reorder);
with_eltwise, with_sum, need_reorder, use_f32_sum);
}

status_t set_default_params() {
Expand Down Expand Up @@ -289,7 +293,7 @@ struct cudnn_gemm_inner_product_bwd_data_t
new cudnn_gemm_inner_product_bwd_data_impl_t());

return inner_product_impl_->init(
engine, this, false, false, false, need_reorder);
engine, this, false, false, false, need_reorder, false);
}

status_t set_default_params() {
Expand Down Expand Up @@ -345,7 +349,7 @@ struct cudnn_gemm_inner_product_bwd_weights_t
inner_product_impl_.reset(
new cudnn_gemm_inner_product_bwd_weights_impl_t());
return inner_product_impl_->init(
engine, this, false, false, false, need_reorder);
engine, this, false, false, false, need_reorder, false);
}

status_t set_default_params() {
Expand Down
47 changes: 38 additions & 9 deletions src/gpu/nvidia/cudnn_gemm_inner_product_impl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -75,10 +75,12 @@ struct cudnn_gemm_inner_product_fwd_impl_t
bool use_acc_dst_;
cudnnTensorDescriptor_t y_acc_desc_;
bool need_reorder_;
cudnnTensorDescriptor_t bias_f32_desc_;
bool with_f32_sum_ = false;

virtual status_t init(impl::engine_t *, inner_product_pd_t *pd,
bool with_relu, bool with_eltwise, bool with_sum,
bool need_reorder) override {
bool with_relu, bool with_eltwise, bool with_sum, bool need_reorder,
bool use_f32_sum) override {
need_reorder_ = need_reorder;
// GEMM is column major, here the data is row major.
// By switching the weight and source we convert the row major to
Expand Down Expand Up @@ -121,8 +123,10 @@ struct cudnn_gemm_inner_product_fwd_impl_t
use_acc_dst_ = ((pd->dst_md()->data_type == data_type::s8)
|| (with_bias_
&& pd->weights_md(1)->data_type
!= pd->dst_md()->data_type));
!= pd->dst_md()->data_type)
|| use_f32_sum);
with_sum_ = with_sum;
with_f32_sum_ = use_f32_sum;
// scaling factor to add the previous destination value to the current
// computation. This is equivalent of
sum_scale_ = sum_scale(pd);
Expand Down Expand Up @@ -154,12 +158,23 @@ struct cudnn_gemm_inner_product_fwd_impl_t

if (with_bias_) {
CHECK(convert_data_type(pd->weights_md(1), &data_types_[io::bia]));

// format is always nchw
set_bias_dims(CUDNN_TENSOR_NCHW, ndims_, pd->OC());

CHECK(create_and_set_tensor_descriptor(&tensor_descs_[io::bia],
data_types_[io::bia], ndims_, dims_[io::bia],
strides_[io::bia]));

if (with_f32_sum_) {
pd->scratchpad_registry().registrar().book(
memory_tracking::names::key_iprod_bias_bf16_convert_wsp,
memory_desc_wrapper(pd->weights_md(1)).nelems(),
types::data_type_size(data_type::f32));
CHECK(create_and_set_tensor_descriptor(&bias_f32_desc_,
CUDNN_DATA_FLOAT, ndims_, dims_[io::bia],
strides_[io::bia]));
}
}
if (use_acc_dst_) {
pd->scratchpad_registry().registrar().book(
Expand All @@ -178,10 +193,10 @@ struct cudnn_gemm_inner_product_fwd_impl_t

void execute(cudnnHandle_t cudnn_handle, cublasHandle_t cublas_handle,
const std::vector<void *> &args) const override {
assert(args.size() == 9);
assert(args.size() == 10);
auto x = args[0], w = args[1], b = args[2], y = args[3],
workspace = args[4], src_scale = args[6], wei_scale = args[7],
dst_scale = args[8];
dst_scale = args[8], bias_f32 = args[9];
auto w_arg = w;
if (need_reorder_) {
void *transformed_w = args[5];
Expand Down Expand Up @@ -222,8 +237,18 @@ struct cudnn_gemm_inner_product_fwd_impl_t

if (with_bias_) {
float alpha = 1.0f;
CUDNN_EXECUTE_FUNC(cudnnAddTensor, cudnn_handle, &alpha,
tensor_descs_[io::bia], b, &alpha, y_acc_desc_, y_dst);
float beta = 0.f;
auto bias = b;
auto bias_desc = tensor_descs_[io::bia];
if (with_f32_sum_) {
cudnnTransformTensor(cudnn_handle, &alpha,
tensor_descs_[io::bia], b, &beta, bias_f32_desc_,
bias_f32);
bias = bias_f32;
bias_desc = bias_f32_desc_;
}
CUDNN_EXECUTE_FUNC(cudnnAddTensor, cudnn_handle, &alpha, bias_desc,
bias, &alpha, y_acc_desc_, y_dst);
}
if (with_eltwise_) {
CUDNN_EXECUTE_FUNC(cudnnActivationForward, cudnn_handle, act_desc_,
Expand Down Expand Up @@ -271,6 +296,10 @@ struct cudnn_gemm_inner_product_fwd_impl_t

return status::success;
}

~cudnn_gemm_inner_product_fwd_impl_t() {
if (with_f32_sum_) { cudnnDestroyTensorDescriptor(bias_f32_desc_); }
}
};

struct cudnn_gemm_inner_product_bwd_data_impl_t
Expand All @@ -281,7 +310,7 @@ struct cudnn_gemm_inner_product_bwd_data_impl_t

virtual status_t init(impl::engine_t *, inner_product_pd_t *pd,
bool /*with_relu*/, bool /*with_eltwise*/, bool /*with_sum */,
bool need_reorder) override {
bool need_reorder, bool /* use_f32_sum */) override {
need_reorder_ = need_reorder;

// GEMM is column major, here the data is row major.
Expand Down Expand Up @@ -365,7 +394,7 @@ struct cudnn_gemm_inner_product_bwd_weights_impl_t
}
virtual status_t init(impl::engine_t *engine, inner_product_pd_t *pd,
bool /*with_relu*/, bool /*with_eltwise*/, bool /*with_sum */,
bool need_reorder) override {
bool need_reorder, bool /* use_f32_sum */) override {
need_reorder_ = need_reorder;
with_bias_ = pd->with_bias();

Expand Down
3 changes: 3 additions & 0 deletions src/gpu/nvidia/cudnn_inner_product.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ status_t cudnn_inner_product_fwd_t::execute(const exec_ctx_t &ctx) const {
memory_tracking::names::key_iprod_int_dat_in_acc_dt);
auto arg_spacial_scratch
= CTX_SCRATCH_SYCL_MEMORY(memory_tracking::names::key_none);
auto arg_f32_bias_scratch = CTX_SCRATCH_SYCL_MEMORY(
memory_tracking::names::key_iprod_bias_bf16_convert_wsp);
compat::host_task(cgh, [=, this](const compat::interop_handle &ih) {
auto &sycl_engine = *utils::downcast<nvidia::engine_t *>(
cuda_stream->engine());
Expand All @@ -72,6 +74,7 @@ status_t cudnn_inner_product_fwd_t::execute(const exec_ctx_t &ctx) const {
args.push_back(arg_src_scale.get_native_pointer(ih));
args.push_back(arg_wei_scale.get_native_pointer(ih));
args.push_back(arg_dst_scale.get_native_pointer(ih));
args.push_back(arg_f32_bias_scratch.get_native_pointer(ih));

pd()->inner_product_impl_->execute(
cudnn_handle, cublas_handle, args);
Expand Down
4 changes: 3 additions & 1 deletion src/gpu/nvidia/cudnn_inner_product_impl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -146,12 +146,14 @@ struct cudnn_inner_product_impl_base_t {
virtual status_t init(impl::engine_t * /*engine*/,
inner_product_pd_t * /*pd*/, bool /*with_relu*/,
bool /*with_eltwise*/, bool /*with_sum */,
bool /*using_fused_path_for_blocking*/)
bool /*using_fused_path_for_blocking*/, bool /* use_f32_sum */)
= 0;

virtual void execute(cudnnHandle_t /*handle*/,
cublasHandle_t /*cublas_handle*/,
const std::vector<void *> & /*args*/) const = 0;

virtual ~cudnn_inner_product_impl_base_t() = default;
};

struct cudnn_inner_product_fwd_base_t : public cudnn_inner_product_impl_base_t {
Expand Down
12 changes: 11 additions & 1 deletion tests/benchdnn/ip/ip.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -278,7 +278,17 @@ void skip_invalid_prb(const prb_t *prb, res_t *res) {}

void setup_cmp(compare::compare_t &cmp, const prb_t *prb, data_kind_t kind,
const args_t &ref_args) {
cmp.set_threshold(0.f);
// The nvidia implementation has different precision guarantees in some cases
// for large problems with post-op sum
if (is_nvidia_gpu()
&& prb->attr.post_ops.find(attr_t::post_ops_t::kind_t::SUM) != -1
&& prb->dst_dt() == dnnl_f16 && (prb->dir & FLAG_FWD)
&& prb->attr.acc_mode == dnnl_accumulation_mode_relaxed) {
const float trh = epsilon_dt(prb->dt[2]);
cmp.set_threshold(trh);
} else {
cmp.set_threshold(0.f);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why this difference ? Is sum post-op applied over f32 intermediate value or over f16 values for NV backend?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say this change can fly in only in case sum post-op is done through a native cuDNN fusion (single call) with f16 accumulation internally, otherwise, the issue is likely inside the implementation that doesn't convert the output to f32 and accumulate pieces in f32.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sum post-op is implemented through the beta parameter of cublas gemm (see here). The compute type for gemm is set to float32, but I couldn't find any details in the documentation how that affects alpha & beta scaling parameters, i.e. whether they are performed in f32 too or in f16 (which is the datatype used for the failing cases).

@dzarukin I investigated if there are any issues with the implementation but couldn't find any. Also, I noticed that changing the input values makes the test pass, e.g. when using whole numbers as the input (still in f16 datatype).

To me it seems to be some sort of a precision/rounding issue. The expected values computed by oneDNN are rounded down, while in the cuDNN case they are rounded up, e.g.

[107536][DST][336:16] exp_f32:      1038.5 exp:        1038 got:        1039 diff:       1 rdiff:0.000963391
[108178][DST][338:18] exp_f32:      1051.5 exp:        1052 got:        1051 diff:       1 rdiff:0.00095057
[108499][DST][339:19] exp_f32:      1064.5 exp:        1064 got:        1065 diff:       1 rdiff:0.00093985

The values in full precision in the above example are not representable as f16 (e.g. https://float.exposed/0x641c), which makes me think cublas is doing incorrect rounding?

Also I found this discussion where someone is asking about how the scaling parameters in cublas work, but there was no response.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgeor255, thanks for looking into implementation details, that's a good start.
I suggest to conduct a little experiment - make cublasGemmEx sum-less and return f32 and append sum post-op (with data copied to a dedicated buffer) with f32 as well (through the cudnnAddTensor call) to simulate the proper behavior.
If this experiment goes fine, it would prove that there's something wrong indeed with cuda libraries, and I'll re-iterate on that check again. Probably the same one should exist (or appear) for matmul as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When changing data addresses the issue it always means rounding/accumulation mechanics stands on its way. Smaller ranges usually lead to situations when final numbers remain exact and conversion to f16/f32 and back don't change the number and the check passes.

When exp number if x.5, in the reality, it can be x.5002, which would be rounded towards x + 1, while the library might have x.4998 which would be rounded towards x. That library result can be a product of different effects (accumulation in lower data types, inexact division/multiplication, intermediate roundings, etc.) which can't be examined with cuda.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dzarukin thanks for the suggestion, I tested doing the sum post-op separately with cudnnAddTensor and the test is passing for all sum scale values I tested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgeor255 thanks for checking it!

Then it likely means non-zero beta is not aligned with proper f32 conversions.
I would suggest to keep a "split" implementation version as a gold one and use beta-fusion if acc-mode is specified to be relaxed. acc-mode isn't used much inside the library yet, but it can become an essential requirement for cuda/cuDNN/cuBLAS functionalities given observed behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dzarukin updated the PR.

}

std::vector<int> supported_exec_args(dir_t dir) {
Expand Down
Loading