-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why do we need to take care of num_process in lr_scheduler? #1382
Comments
a relevant problem is here: huggingface/transformers#18436 (comment) |
That's because the lr_scheduler has been defined taking into account the unprepared dataloader which has the original batch size. |
But in my case(2 GPUs), the size of BTW, thanks for the prompt response! |
So, if i want to assign |
@Hannibal046 Yes your dataloader size will be divided by 2 after you have prepared it on 2 GPUs, because your total batch size has been multiplied by 2. If you want your scheduler to peak at 4000 steps independently of the number of GPUs, you shouldn't send it to |
Sorry, it my mistake. The dataloader size does change after the prepare method. Could you please explain some conceptual guides on how to set the warmup_steps in accelerate. Take this paper, Attention is all you need as example. In this paper, it said the model was trained on 8GPUs and warmup_steps is 4000. So, i should give 4000*8 to the scheduler function? from transformers import get_polynomial_decay_schedule_with_warmup,
lr_scheduler = get_polynomial_decay_schedule_with_warmup(
optimizer,
num_warmup_steps = 4000*8*cfg.trainer.gradient_accumulation_steps,
num_training_steps = total_steps
) |
if that is the case, what if i only have 4 GPUs? Typically, i would keep |
Hi, @sgugger , would you mind spare some time to check this? |
I don't really get what you want me to say. It looks like you want a scheduler that always does the same number of steps regardless of distributed/gradient accumulation so maybe don't pass it to |
Hi @sgugger , really sorry for bothering you. I am new to For anyone who is seeking for fixed warmup steps in Accelerate, a simple workaround would be: lr_scheduler = get_scheduler()
# not prepare lr_scheduler
for batch in dataloader:
with accelerator.accumulate(model):
loss = compute_loss()
if accelerator.sync_gradients and not accelerator.optimizer_step_was_skipped:
lr_scheduler.step()
|
Hi, I am a little bit confused about the behavior of lr_scheduler especially this line of code:
accelerate/src/accelerate/scheduler.py
Lines 73 to 82 in 995563f
in the
DDP
mode,gradient_accumulation_steps
andmore gpus
are essentially increase the batch size . Suppose thebatch_size
parameter we give to dataloader is 10, we have 8 gpus, and gradient accumulation steps is 4. So the real batch size is 10*8*4 = 320. This is i believe when we should do one optimization step as well as one lr schedule step. But in the above line of code, it would do more than one lr schedule step.Briefly speaking, my question is: why do we need to make lr_sceduler depend on the number of GPUs when we already assign a value to warmup_steps in user config?
The text was updated successfully, but these errors were encountered: