Regression in 0.36.0, training broken when fine-tuning RoBERTa on 3090 #121

rationalism · 2023-01-07T02:13:24Z

I'm using HuggingFace's library to fine-tune RoBERTa on an nVidia 3090 for a text classification task. It works fine with pytorch's Adam, and fine with Adam8Bit from bitsandbytes 0.35.4, but breaks when I upgrade to 0.36.0; training loss bounces around wildly and doesn't drop as expected. Eg., here's some log output from a normal run:

0%| | 0/30471 [00:00<?, ?it/s][WARNING|logging.py:281] 2023-01-06 18:10:56,000 >> You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
0%| | 50/30471 [00:15<47:42, 10.63it/s]{'loss': 1.3128, 'learning_rate': 3.9999744778317457e-05, 'epoch': 0.0}
0%| | 100/30471 [00:30<44:00, 11.50it/s]{'loss': 0.977, 'learning_rate': 3.999895817937791e-05, 'epoch': 0.01}
0%| | 150/30471 [00:44<43:29, 11.62it/s]{'loss': 0.9848, 'learning_rate': 3.99976716871491e-05, 'epoch': 0.01}
1%| | 200/30471 [00:59<43:42, 11.54it/s]{'loss': 0.8877, 'learning_rate': 3.9995832826050655e-05, 'epoch': 0.02}
1%| | 250/30471 [01:15<48:04, 10.48it/s]{'loss': 0.9048, 'learning_rate': 3.9993462585355154e-05, 'epoch': 0.02}
1%| | 300/30471 [01:30<45:02, 11.16it/s]{'loss': 0.87, 'learning_rate': 3.9990561028050576e-05, 'epoch': 0.03}
1%| | 350/30471 [01:45<44:52, 11.19it/s]{'loss': 0.8576, 'learning_rate': 3.998712823124443e-05, 'epoch': 0.03}
1%| | 354/30471 [01:49<4:53:51, 1.71it/s]

and with the new bitsandbytes:

0%| | 0/30471 [00:00<?, ?it/s][WARNING|logging.py:281] 2023-01-06 18:01:14,752 >> You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
1%| | 277/30471 [01:25<1:46:31, 4.72it/s]
{'loss': 1.3541, 'learning_rate': 3.999975508920987e-05, 'epoch': 0.0}
{'loss': 1.295, 'learning_rate': 3.999899984760361e-05, 'epoch': 0.01}
{'loss': 1.3924, 'learning_rate': 3.99977030439398e-05, 'epoch': 0.01}
{'loss': 1.4335, 'learning_rate': 3.999587481097879e-05, 'epoch': 0.02}
{'loss': 1.5828, 'learning_rate': 3.999351519730499e-05, 'epoch': 0.02}

Happy to give config settings, run tests, etc. as needed to help with debugging. Thanks

The text was updated successfully, but these errors were encountered:

TimDettmers · 2023-01-07T03:17:32Z

Thank you for the prompt reporting of this bug!

I think it would be very helpful to replicate this in the bitsandbytes tests for 8-bit adam. You can run the tests via:

pytest -vsk adam8bit

I think what could be the problem in the tests is that the criterion for failing a test is set too high. If this is true, then the regression was not captured by the tests because the test conditions are too lenient. To debug this, you could look at the mean error compared to regular Adam and try to see if this increases/decreases between versions.

It would be great if you would be able to do this. If not, let me know and I try to do it sometime in the next few days.

Right now, I am not sure if anything changed relevant to 8-bit optimizers between 0.35.4 and 0.36.0post2, but I might be missing something here.

bonlime · 2023-01-19T07:49:37Z

@TimDettmers
This is critical! I'm also observing regression when fine-tuning Stable Diffusion with 0.36post2, while it works fine with 0.35.0.

ArrowM · 2023-02-05T16:54:50Z

I am also experiencing this issue (training will still run with 0.36.0post2, but the results are distorted). I tried to run the test locally but couldn't quite get them to work.

rationalism · 2023-08-05T02:59:03Z

This has now been resolved by the bug fix in 0.41.1, closing.

rationalism changed the title ~~Regression in 0.36.0, training broken~~ Regression in 0.36.0, training broken when fine-tuning RoBERTa on 3090 Jan 7, 2023

TimDettmers added the bug Something isn't working label Jan 7, 2023

TimDettmers assigned rationalism and TimDettmers Jan 7, 2023

ArrowM mentioned this issue Mar 27, 2023

Any changes to the function "optim.AdamW8bit"? #152

Closed

rationalism closed this as completed Aug 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in 0.36.0, training broken when fine-tuning RoBERTa on 3090 #121

Regression in 0.36.0, training broken when fine-tuning RoBERTa on 3090 #121

rationalism commented Jan 7, 2023

TimDettmers commented Jan 7, 2023 •

edited

Loading

bonlime commented Jan 19, 2023

ArrowM commented Feb 5, 2023

rationalism commented Aug 5, 2023

Regression in 0.36.0, training broken when fine-tuning RoBERTa on 3090 #121

Regression in 0.36.0, training broken when fine-tuning RoBERTa on 3090 #121

Comments

rationalism commented Jan 7, 2023

TimDettmers commented Jan 7, 2023 • edited Loading

bonlime commented Jan 19, 2023

ArrowM commented Feb 5, 2023

rationalism commented Aug 5, 2023

TimDettmers commented Jan 7, 2023 •

edited

Loading