Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent division by zero in soft_max vulkan shader #2604

Closed
wants to merge 3 commits into from

Conversation

gn64
Copy link
Contributor

@gn64 gn64 commented Dec 2, 2024

Fix division by zero error in soft_max vulkan shader

This PR fixes #2596 by adding a check for p.KY being zero in the soft_max compute shader.

Changes:

  • Modified the calculation of rowy in soft_max.comp to handle the case where p.KY is 0
  • Changed const uint rowy = rowx % p.KY; to const uint rowy = (p.KY > 0) ? (rowx % p.KY) : 0;

The original code would cause a division by zero error when p.KY is 0. This fix ensures that rowy is set to 0 in such cases, preventing the crash while maintaining the expected behavior in normal scenarios.

Testing:

  • Confirmed the fix resolves the crash in my environment
  • Existing functionality remains unchanged when p.KY > 0

gn64 added 3 commits December 3, 2024 06:17
This change prevents a division by zero error when p.KY is 0.
@slaren
Copy link
Collaborator

slaren commented Dec 15, 2024

The Vulkan fix looks correct, this can happen when using ggml_soft_max. @0cc4m please take a look.

@0cc4m
Copy link
Contributor

0cc4m commented Dec 15, 2024

Looks good, but what are the other two changes?

@ggerganov
Copy link
Owner

@gn64 Please update the PR to include just the Vulkan fix.

@0cc4m In which cases p.KY == 0? We should add a test to test-backend-ops to cover these.

@gn64
Copy link
Contributor Author

gn64 commented Dec 16, 2024

Closing this PR in favor of a more focused fix in #2633.
The new PR contains only the Vulkan shader division by zero fix, making the changes easier to review and maintain.

@gn64 gn64 closed this Dec 16, 2024
@ggerganov
Copy link
Owner

Thank you. I'm still interested in which cases p.KY == 0 - I'm not able to deduce by looking at the code.

@slaren
Copy link
Collaborator

slaren commented Dec 16, 2024

I believe it happens when using using a NULL mask, which happens when using the plain ggml_soft_max.

    ggml_vk_op_f32<vk_op_soft_max_push_constants>(ctx, subctx, src0, src1, nullptr, dst, GGML_OP_SOFT_MAX, {
        ncols,
        src1 != nullptr ? nrows_y : (uint32_t)0, // this is KY
        scale, max_bias,
        m0, m1,
        n_head_log2,
        nrows_x,
    }, dryrun);

@ggerganov
Copy link
Owner

That's what I thought as well, but we do have tests in test-backend-ops without a mask, so I am not sure why they are passing.

@slaren
Copy link
Collaborator

slaren commented Dec 16, 2024

Division by zero is likely just undefined behavior, and in some GPUs it may just return zero. Still, can't rely on that behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalid probability vector error with AMD iGPU on Vulkan backend Environment
4 participants