-
Notifications
You must be signed in to change notification settings - Fork 26.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deepspeed and T5-11B for multitask training #14531
Comments
I have a feeling that the issue is not in using deepspeed but somewhere else in your setup. Let's remove deepspeed for a moment from the equation and try your setup with a single gpu setup with Once this is working you can then progress to a higher model size and eventually you'd just plug deepspeed to work with t5-11b. It'll also make your debug process much easier since it takes forever to even load t5-11b. Always start small and simple, then progress to bigger and slightly more complex, and then big and complex. |
@stas00 thanks I tried with t5-small with and without deepspeed and the loss was non zero it was in the range of 3.6 and was slowly decreasing. I removed the label smoothing before training
I started doing T5-11B with deepspeed
|
t5-large with deepspeed back to zero lr
|
So first good to see that you have the non-DS setup working. I don't understand this in your log 2 comments up. Is it with or without DS?
Re: last comment: As the warning says, the optimizer hasn't started running, so it doesn't have an LR yet, and just returns 0. So we need to figure out why the optimizer isn't running. For example you can edit the ds config file to remove the optimizer section and it will use Transformers' AdamW instead of the DS's one. Meanwhile could you help me to reproduce the issue on my side? Could you perhaps make a tarball that I could run with the data and your customizations? So that I could run the same setup as you do |
I also run a sanity check with this and verified that in general things work correctly:
So something is different about your setup. |
You are trying with t5-small in your sanity check . t5-small works for me too with deepseed as well |
Additionally, I have noticed you're using So it's very likely this could be related as well. e.g. try to use the default ds_config w/ optimizer and remove |
Understood. As I explained earlier, for some reason the optimizer isn't stepping in your t5-11b example. So we need to figure out why that is. You can also try the larger ones first - t5-base, t5-large |
I removed adafactor. This is for t5-large my config
|
In general try to use But I don't see any fault with your config. And you haven't tried t5-base, t5-large, t5-3b to see if they work and it's an issue specifically with t5-11b. Can you please send me a sample of your data you train with - if it's not for a public eye, let me know. It'd be easier to experiment directly rather than ask you to do this and that all the time. And I suppose you have a custom code - best to send me a tarball of the whole thing (custom script+data), so that I don't have to spend time sorting it out. Thanks. p.s. I don't actually have access to A100 at the moment, but I hope to sort it out on a smaller gpu. |
I can't share it publicly on this thread but I emailed you the zip file containing code and data |
missing your custom |
I made changes already in the code run_translation.py Check for this function and you will know
|
Could you please try after applying this patch to deepspeed:
This should now tell if you OVERFLOW happens and that's why it skips the |
Does this solve the issue ? I think for t5-large I was getting 0 LR however for T5-11b loss was zero. I am just trying to understand here |
Yes so I guess OVERFLOW is happening |
No, it's not solving the issue - I just added a diagnostic logging. It was already in So why does it start with loss scale: 1, e.g. when I run with t5-small I get: (Also added
In the ds config file:
which is 2**16, hence you can see that its first step on my t5-small setup is:
well, it's actually a minor bug, but ignore it, as the next one does the right thing:
but in your case it appears that
instead of starting with 2**16. so it gets an overflow and it's already at loss scale 1, so it can't go anywhere from here. |
I can reproduce your issue with
I hope to have time tomorrow to debug this. |
zero3 does the right thing, starting with if you want to experiment before I get a chance, the next step is for you to try And it fails too:
So your issue is not with deepspeed, but either your code or (I just left the deepspeed launcher, but it's not running deepspeed) |
OK, your issue is bf16 has a much larger dynamic range than fp16, and models trained in the former often overflow on the first step in fp16. e.g. mt5 overflows even on a small model on the first step. Removing But you want speed of course, so here is what you can do next:
for 3. make sure you use torch>=1.10 and enable:
I recommend you try 3 first, then 2, and then 1. |
And DS has recently added bf16 for
so that's option 4 to try with deepspeed - just replace the float16 section with the above one and don't use I think it only works with z2. |
Stas you are amazing and I appreciate all the help and fast turnaround. I am just trying to understand if I use OPTION 3 (fp32) won't it give me OOM eventually? I just wanted to let you know my entire research questions tests on the ability to finetune T5-11B so unless that works t5-large/small/3B doesn't really help me Just to be sure and consistent I have 4 A100 GPUs, so if you can tell me what would be the best way for me to use T5-11B. I am trying to reproduce (https://arxiv.org/abs/2110.08207) and honestly its been a bit difficult for me to get to train T5-11B .:( |
I got t5-large to work with fp32 but ofcourse got OOM with batch size 1 fp32 T5-11B zero2. Appreciate any help here |
Option 4 gave me this
|
The first step is to make things work w/o overflow, the second step is dealing with memory. As bf16 is all new it will take some time to fully sort it out. You can try solution (1) as well - it might just work. So your fp32 OOM was w/ or w/o deepspeed? fp32 takes about the same amount of memory as fp16 mixed precision, because the latter still allocates 4 bytes for master weights per param. So the latter saves some memory in some places, but uses more memory in others. fp16 amp is really about up to 5x speed up, not saving memory. Here are the next things to try: Experiment A. Try deepspeed with both fp16 and bf16 disabled and stage2 (your current setup) and add on top of
how does that fair? Experiment B. Same as A, but use stage 3 in the config file, and ensure your cpu offload is enabled - the default config file from the docs will do. I of course assume you're also using torch==1.10 and some fairly recent cuda - at least cuda=11.3 re: bf16-support in deepspeed I haven't tried it myself yet as it was literally just added. I will give it a try. |
Additionally, I know you're trying to use Adafactor, but if nothing else works right away and you're in a hurry one other things to consider is using https://github.com/facebookresearch/bitsandbytes 8-bit AdamW optimizer. It will save you 6 out of 8 bytes per param. This is a huge memory saving, hence the suggestion. Here is the saving breakdown:
We are testing it (BNB) out right now at BigScience and so far it tracks the normal AdamW performance quality-wise. The main issue with BNB is that it needs a Embed norm, which transformers models don't have at the moment. So we need to discuss this. |
Turns out with zero3 and fp32 it works. I was training it and it went OOM after 25% training so I reduced it to batch size 12 from 16. If it still fails will fall back to 8. The time its taking is definitely more but atleast working
|
That's a progress. Additionally have you made sure that you have |
Yes Done and yes confirming |
watch this PR microsoft/DeepSpeed#1453 I guess you can already try it if you are in need. |
I can confirm I could train and evaluate using fp32 and zero3. It does take me 28 hours even after using 4 GPUs |
Thank you for the confirmation, @tuhinjubcse, that it works just not very fast. To be faster you want bf16-support, which is a work in progress. The plan is as following:
Once 3 is done or least I have it working you should be able to use bf16 w/ Deepspeed/HF integration. I will let you know once this happens. |
Many many thanks |
One thing I have been noticing is my performance once using run_translation which indirectly uses trainer is significantly lower. In my earlier code where I did not use a trainer, my perplexity loss was so much better than what I am getting now. Are there any trainer specific hyperparameters which I am missing are there any hyperparameter that I might be missing? This was my training code prior to deep speed. You can see the train function https://github.com/tuhinjubcse/tuhinjubcse.github.io/blob/master/fine_tune_lm.py |
But you're not using --fp16 now, which sometimes makes a huge difference, so it's not the code base that is different. And your original finetune script was using trainer, the script was rewritten but it's the same trainer. That is not to say there is surely no regression here. We have been talking about adding speed regression tests, but so far we haven't gone beyond talking about it. Once deepspeed releases bloat16-support I think you should be back at a fast speed. I will start working on the integration now, against the Deepspeed PR, now that we have completed --bf16 in transformers. So perhaps you will have something to experiment with shortly. I will keep you posted. |
No before finetune_trainer I was using something else as you can see in the link above I would expect since T5-11B is a bigger model it should give better performance anyway I will put a comparative result of T5-3B using model.parallelize and deep speed. I am wondering if there is performance degradation with deepspeed |
Thank you for clarifying that you were talking about the pre-finetune_trainer. I assumed that it was OK, so for you to understand what Deepspeed ZeRO does conceptually - it shards the tensors over multiple gpus and then at the point of calculation (forward/backward) it restores the tensors to their normal unsharded state, so the model doesn't see anything different - it has no idea ZeRO even exists. i.e. Deepspeed ZeRO itself doesn't change anything and can't make any difference to the math, and thus you should be getting an identical numerical output w/ or w/o ZeRO. Now, it's possible that you are using a different optimizer or lr scheduler, - when you're using Deepspeed since it lets you use your own or provides its own - and they aren't identical most of the time. And you could be mismatching on whatever other hparams are involved. So when comparing such things you need to make sure you compare oranges to orange. Besides Deepspeed you have Now, bigger models typically give better performance but usually they take much longer to get to the same range of loss. Now that you understand these things, let's focus on how you could get better results faster, since we want the same thing. |
Deepspeed/bf16/zero2 should work with #14569 Please let me know if you run into any problems if you choose to try that branch. Please follow up directly in that PR's comments if you run into issues. To use bf16 you need 2 things:
On the deepspeed side I'm not sure if you just need deepspeed@master, or you actually need this branch: microsoft/DeepSpeed#1453 - I was testing with the latter. zero3 doesn't seem to be ready on the deepspeed size. But it's all ready on the transformers side. p.s. remember zero2 doesn't shard params, so it will be more memory demanding. p.p.s. I think I need to do some tweaks to t5 models as well to save more memory for bf16 - I will have a look in the next few days. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
For future reference, I could help with LR logging from |
Are these 40GB or 80 GB A100s? |
I'm running into this exact same issue except with bf16 and llama 13b+ combo. Turning off bf16 fixes it, but I then can't fit 65b onto my GPUs. Any idea why bf16 is causing problems? |
I also meet the same error when setup is ds_stage2/bf16 and baichuan13b model. I want to ask about |
I ran into issues with fp16 as well, so I used fp32. |
For those who do not have enough gpu memory to train full precision model, I fixed this issue by decreased "initial_scale_power" in fp16 option in deepspeed config from 16 to 2. |
Carrying on my conversation here @stas00
#9996 (comment)
Used the run_translation.py and now my loss is 0.0 :( . This is probably doomed to fail
Script
Data format
The text was updated successfully, but these errors were encountered: