You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using HuggingFace's library to fine-tune RoBERTa on an nVidia 3090 for a text classification task. It works fine with pytorch's Adam, and fine with Adam8Bit from bitsandbytes 0.35.4, but breaks when I upgrade to 0.36.0; training loss bounces around wildly and doesn't drop as expected. Eg., here's some log output from a normal run:
0%| | 0/30471 [00:00<?, ?it/s][WARNING|logging.py:281] 2023-01-06 18:10:56,000 >> You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
0%| | 50/30471 [00:15<47:42, 10.63it/s]{'loss': 1.3128, 'learning_rate': 3.9999744778317457e-05, 'epoch': 0.0}
0%| | 100/30471 [00:30<44:00, 11.50it/s]{'loss': 0.977, 'learning_rate': 3.999895817937791e-05, 'epoch': 0.01}
0%| | 150/30471 [00:44<43:29, 11.62it/s]{'loss': 0.9848, 'learning_rate': 3.99976716871491e-05, 'epoch': 0.01}
1%| | 200/30471 [00:59<43:42, 11.54it/s]{'loss': 0.8877, 'learning_rate': 3.9995832826050655e-05, 'epoch': 0.02}
1%| | 250/30471 [01:15<48:04, 10.48it/s]{'loss': 0.9048, 'learning_rate': 3.9993462585355154e-05, 'epoch': 0.02}
1%| | 300/30471 [01:30<45:02, 11.16it/s]{'loss': 0.87, 'learning_rate': 3.9990561028050576e-05, 'epoch': 0.03}
1%| | 350/30471 [01:45<44:52, 11.19it/s]{'loss': 0.8576, 'learning_rate': 3.998712823124443e-05, 'epoch': 0.03}
1%| | 354/30471 [01:49<4:53:51, 1.71it/s]
and with the new bitsandbytes:
0%| | 0/30471 [00:00<?, ?it/s][WARNING|logging.py:281] 2023-01-06 18:01:14,752 >> You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
1%| | 277/30471 [01:25<1:46:31, 4.72it/s]
{'loss': 1.3541, 'learning_rate': 3.999975508920987e-05, 'epoch': 0.0}
{'loss': 1.295, 'learning_rate': 3.999899984760361e-05, 'epoch': 0.01}
{'loss': 1.3924, 'learning_rate': 3.99977030439398e-05, 'epoch': 0.01}
{'loss': 1.4335, 'learning_rate': 3.999587481097879e-05, 'epoch': 0.02}
{'loss': 1.5828, 'learning_rate': 3.999351519730499e-05, 'epoch': 0.02}
Happy to give config settings, run tests, etc. as needed to help with debugging. Thanks
The text was updated successfully, but these errors were encountered:
rationalism
changed the title
Regression in 0.36.0, training broken
Regression in 0.36.0, training broken when fine-tuning RoBERTa on 3090
Jan 7, 2023
I think it would be very helpful to replicate this in the bitsandbytes tests for 8-bit adam. You can run the tests via:
pytest -vsk adam8bit
I think what could be the problem in the tests is that the criterion for failing a test is set too high. If this is true, then the regression was not captured by the tests because the test conditions are too lenient. To debug this, you could look at the mean error compared to regular Adam and try to see if this increases/decreases between versions.
It would be great if you would be able to do this. If not, let me know and I try to do it sometime in the next few days.
Right now, I am not sure if anything changed relevant to 8-bit optimizers between 0.35.4 and 0.36.0post2, but I might be missing something here.
I am also experiencing this issue (training will still run with 0.36.0post2, but the results are distorted). I tried to run the test locally but couldn't quite get them to work.
I'm using HuggingFace's library to fine-tune RoBERTa on an nVidia 3090 for a text classification task. It works fine with pytorch's Adam, and fine with Adam8Bit from bitsandbytes 0.35.4, but breaks when I upgrade to 0.36.0; training loss bounces around wildly and doesn't drop as expected. Eg., here's some log output from a normal run:
0%| | 0/30471 [00:00<?, ?it/s][WARNING|logging.py:281] 2023-01-06 18:10:56,000 >> You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the
__call__
method is faster than using a method to encode the text followed by a call to thepad
method to get a padded encoding.0%| | 50/30471 [00:15<47:42, 10.63it/s]{'loss': 1.3128, 'learning_rate': 3.9999744778317457e-05, 'epoch': 0.0}
0%| | 100/30471 [00:30<44:00, 11.50it/s]{'loss': 0.977, 'learning_rate': 3.999895817937791e-05, 'epoch': 0.01}
0%| | 150/30471 [00:44<43:29, 11.62it/s]{'loss': 0.9848, 'learning_rate': 3.99976716871491e-05, 'epoch': 0.01}
1%| | 200/30471 [00:59<43:42, 11.54it/s]{'loss': 0.8877, 'learning_rate': 3.9995832826050655e-05, 'epoch': 0.02}
1%| | 250/30471 [01:15<48:04, 10.48it/s]{'loss': 0.9048, 'learning_rate': 3.9993462585355154e-05, 'epoch': 0.02}
1%| | 300/30471 [01:30<45:02, 11.16it/s]{'loss': 0.87, 'learning_rate': 3.9990561028050576e-05, 'epoch': 0.03}
1%| | 350/30471 [01:45<44:52, 11.19it/s]{'loss': 0.8576, 'learning_rate': 3.998712823124443e-05, 'epoch': 0.03}
1%| | 354/30471 [01:49<4:53:51, 1.71it/s]
and with the new bitsandbytes:
0%| | 0/30471 [00:00<?, ?it/s][WARNING|logging.py:281] 2023-01-06 18:01:14,752 >> You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the
__call__
method is faster than using a method to encode the text followed by a call to thepad
method to get a padded encoding.1%| | 277/30471 [01:25<1:46:31, 4.72it/s]
{'loss': 1.3541, 'learning_rate': 3.999975508920987e-05, 'epoch': 0.0}
{'loss': 1.295, 'learning_rate': 3.999899984760361e-05, 'epoch': 0.01}
{'loss': 1.3924, 'learning_rate': 3.99977030439398e-05, 'epoch': 0.01}
{'loss': 1.4335, 'learning_rate': 3.999587481097879e-05, 'epoch': 0.02}
{'loss': 1.5828, 'learning_rate': 3.999351519730499e-05, 'epoch': 0.02}
Happy to give config settings, run tests, etc. as needed to help with debugging. Thanks
The text was updated successfully, but these errors were encountered: