Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in 0.36.0, training broken when fine-tuning RoBERTa on 3090 #121

Closed
rationalism opened this issue Jan 7, 2023 · 4 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@rationalism
Copy link

I'm using HuggingFace's library to fine-tune RoBERTa on an nVidia 3090 for a text classification task. It works fine with pytorch's Adam, and fine with Adam8Bit from bitsandbytes 0.35.4, but breaks when I upgrade to 0.36.0; training loss bounces around wildly and doesn't drop as expected. Eg., here's some log output from a normal run:

0%| | 0/30471 [00:00<?, ?it/s][WARNING|logging.py:281] 2023-01-06 18:10:56,000 >> You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
0%| | 50/30471 [00:15<47:42, 10.63it/s]{'loss': 1.3128, 'learning_rate': 3.9999744778317457e-05, 'epoch': 0.0}
0%| | 100/30471 [00:30<44:00, 11.50it/s]{'loss': 0.977, 'learning_rate': 3.999895817937791e-05, 'epoch': 0.01}
0%| | 150/30471 [00:44<43:29, 11.62it/s]{'loss': 0.9848, 'learning_rate': 3.99976716871491e-05, 'epoch': 0.01}
1%| | 200/30471 [00:59<43:42, 11.54it/s]{'loss': 0.8877, 'learning_rate': 3.9995832826050655e-05, 'epoch': 0.02}
1%| | 250/30471 [01:15<48:04, 10.48it/s]{'loss': 0.9048, 'learning_rate': 3.9993462585355154e-05, 'epoch': 0.02}
1%| | 300/30471 [01:30<45:02, 11.16it/s]{'loss': 0.87, 'learning_rate': 3.9990561028050576e-05, 'epoch': 0.03}
1%| | 350/30471 [01:45<44:52, 11.19it/s]{'loss': 0.8576, 'learning_rate': 3.998712823124443e-05, 'epoch': 0.03}
1%| | 354/30471 [01:49<4:53:51, 1.71it/s]

and with the new bitsandbytes:

0%| | 0/30471 [00:00<?, ?it/s][WARNING|logging.py:281] 2023-01-06 18:01:14,752 >> You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
1%| | 277/30471 [01:25<1:46:31, 4.72it/s]
{'loss': 1.3541, 'learning_rate': 3.999975508920987e-05, 'epoch': 0.0}
{'loss': 1.295, 'learning_rate': 3.999899984760361e-05, 'epoch': 0.01}
{'loss': 1.3924, 'learning_rate': 3.99977030439398e-05, 'epoch': 0.01}
{'loss': 1.4335, 'learning_rate': 3.999587481097879e-05, 'epoch': 0.02}
{'loss': 1.5828, 'learning_rate': 3.999351519730499e-05, 'epoch': 0.02}

Happy to give config settings, run tests, etc. as needed to help with debugging. Thanks

@rationalism rationalism changed the title Regression in 0.36.0, training broken Regression in 0.36.0, training broken when fine-tuning RoBERTa on 3090 Jan 7, 2023
@TimDettmers
Copy link
Collaborator

TimDettmers commented Jan 7, 2023

Thank you for the prompt reporting of this bug!

I think it would be very helpful to replicate this in the bitsandbytes tests for 8-bit adam. You can run the tests via:

pytest -vsk adam8bit

I think what could be the problem in the tests is that the criterion for failing a test is set too high. If this is true, then the regression was not captured by the tests because the test conditions are too lenient. To debug this, you could look at the mean error compared to regular Adam and try to see if this increases/decreases between versions.

It would be great if you would be able to do this. If not, let me know and I try to do it sometime in the next few days.

Right now, I am not sure if anything changed relevant to 8-bit optimizers between 0.35.4 and 0.36.0post2, but I might be missing something here.

@TimDettmers TimDettmers added the bug Something isn't working label Jan 7, 2023
@bonlime
Copy link

bonlime commented Jan 19, 2023

@TimDettmers
This is critical! I'm also observing regression when fine-tuning Stable Diffusion with 0.36post2, while it works fine with 0.35.0.

@ArrowM
Copy link

ArrowM commented Feb 5, 2023

I am also experiencing this issue (training will still run with 0.36.0post2, but the results are distorted). I tried to run the test locally but couldn't quite get them to work.

@rationalism
Copy link
Author

This has now been resolved by the bug fix in 0.41.1, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants