-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EXTREMELY SERIOUS BUGS THAT MAKE BLOCKSPARSE COMPLETELY USELESS IN TRAINING #419
Comments
I saw your testing script, and I believe you have tested it. Maybe it's me getting something wrong, but it really confuses me. |
Yep, this big bug affects only float32 gradient of sdd and it's been fixed in v2.0 branch already. I can push a hotfix throwing an error when float32 is used on master |
Thank you for the quick reply! However, I tested fp16 (by changing the dtype setting to And this time, RuntimeError still exists but changed:
REALLY LOOKING FORWARD TO YOUR HELP AND FIX. THANK YOU. |
And it seems some grads become nan or inf. |
Can you try v2.0? I have high confidence in the version there. We use Triton blocksparse internally at OpenAI |
OK, I'll try it right now! |
OK, I tested v2.0. But It still has the problem.
|
@ptillet The gradient problem still exists in v2.0, as well as the RuntimeErrors. |
I have confirmed that the bug doesn't appear if you replace
with
which shows that the op works in general and explains why many groups have been able to use it successfully in large models. I am not sure how the sum makes things buggy -- probably has to do with some contiguous requirement in the blocksparse op backprop. Will look into it |
Because the incoming gradient is a scalar when the blocksparse is followed by a sum, it has stride "(0, 0, 0, 0)" and this throws off the compiler apparently. I'll fix this. |
Indeed if using
As for the stride problem, I add a line Looking forward to your fix. Please inform me if done. This is really essential for me recently. THANK YOU to you and your team!!! |
Hey! This is fixed in #420. As for the error difference, I think that this is nothing of concern. This seems within the normal range of FP16 inputs, so in a sense the Torch result is not more accurate than Triton |
So fascinating and so nice that you have fixed this so quickly! |
The script runs one given config for debug purposes.
I found that the blocksparse ops' backward gradient is totally wrong, which makes the training meaningless. Take matmul (mode='sdd') as a simple example, the following is my test code. As you can see, the forward result is the same as pytorch's implementation, while the backward is wrong, with big errors, even when layout is a dense one.
I tested the old version, and the problem still exists. For other operators or other modes, it seems the gradient result is still wrong. My environment is:
One of my running:
Sometimes, RuntimeError appears as you can see above, but sometimes it doesn't.
Could you please fix it asap? It's really serious, and many people use this module oblivious of the problem.
Thank you very much.
The text was updated successfully, but these errors were encountered: